Automata Inference Algorithms
Automata inference algorithms are algorithms that take a set of program execution traces and utilise techniques such as state-merging in order to try and hypothesise an automata representation of a program from any given trace.
Program Traces
A program trace is a log of sequences executed during the run-time of a program. Generally, they capture details about the sequence of operations, function calls, and changes of state. In the case of automata inference, we require program traces which detail the events, in order to attempt to construct an automaton generalising from observed data.
Prefix Tree Acceptor
Automata inference algorithms generally start by creating a prefix tree acceptor, or PTA. A prefix tree acceptor is a tree-like deterministic finite automaton (DFA) built from a program trace log, by taking all the traces in the log as state transitions and constructing the smallest DFA which is a tree, consistent with the program traces.
K-tails Algorithm
K-tails is a classic automata inference algorithm, which generates an inferred deterministic finite automata (DFA) from an input parameter k and a program trace log. Several variants of the k-tails algorithm have been defined since its initial introduction, but this tool teaches a 'modern' variant which uses prefix tree acceptors and the blue-fringe algorithm for state exploration.
Modern k-tails starts by taking a program trace log and parameter k, and constructs an initial PTA representing the program traces. It then iterates through the states of the graph and merges any two states that have equivalent k-tails. A state's k-tail is defined as the future sequence of k events/transitions from the current state.
Evidence-Drive State Merging (EDSM) Algorithm
EDSM is a competition winning state-merging algorithm, which uses evidence to determine which states should be merged. The algorithm performs better than other algorithms due to its inclusion of a scoring system, which makes sure early decisions are correct, preventing a snowballing of wrong decisions being made later in the algorithm. The following explanation describes the blue-fringe variant of the EDSM algorithm.
EDSM starts with a PTA constructed from program traces. The root of this PTA is coloured red, every child of the root is coloured blue, and all other nodes are left white.
A valid merge is a merge where no transition exists for a pair of states that leads to different subsequent states. With sparse training data, the algorithm can’t be sure that a compatible merge is truly valid, but it labels it as valid anyway.
When the algorithm reaches a point in which it can merge states, all valid merges are hypothetically computed, and are scored as follows: a negative score is given if the merge is not valid, zero is given if there are no common states, and the number of states eliminated by the merge is given otherwise.
The algorithm then proceeds to repeat the following steps until completion:
- Evaluate all the red/blue merges
- If there is a blue node that cannot be merged with a red node, promote the shallowest blue node to red, then go back to step 1
- Otherwise (if no blue is promoted), perform the red-blue merge with the highest score, and go back to step 1
- Halt if no more blue nodes
- There is a connected graph of mutually unmergeable red nodes
- All non-red children of red nodes are blue
- Blue nodes are the roots of isolated trees