Aho-corasick algorithm is a classical algorithm in multi-pattern matching, and it is more in practical application.
Aho-corasick algorithm corresponding data structure is Aho-corasick automata, referred to as AC automatic machine.
Programming generally should know the Automaton FA bar, specifically subdivided into: deterministic finite state automata (DFA) and non-deterministic finite state automata NFA. ordinary automata can not be multi-mode matching, AC automata to increase the failure of transfer , transfer to the text has been entered the suffix of the successful, to achieve.
1. Multi-mode matching
Multi-pattern matching is the p1,p2,p3...,pm of multiple pattern strings to find all the possible locations of all these pattern strings in a continuous text T1....N.
For example: Find the pattern set {"Nihao", "Hao", "HS", "HSR"} in the given text "Sdmfhsgnshejfgnihaofhsrnihao" in all possible locations .
2.aho-corasick algorithm
Using the Aho-corasick algorithm requires three steps:
1. Establishing a model Trie
2. Add a failure path to Trie
3. Search for pending text according to AC automaton
These three steps are described below:
2.1 creating a multi-mode collection Trie Tree
The Trie tree is also a self-motive. For the multi-mode collection {"Say", "she", "shr", "he", "her"}, the corresponding trie tree is as follows, where the red marked circle is expressed as the receiving State:
2.2 for a multi-mode collection. Trie tree To add the failed path, establish AC Automatic Machine
The process of constructing a failed pointer sums up one sentence: Set the letter C on this node, walk along his father's failed pointer, and go to a node where his son has a node with the letter C. The current node's failure pointer is then directed to the son whose letter is also c. If you have not found the root, then point the failed pointer to root.
Using breadth-first search for BFS, the hierarchy traverses the nodes to handle each node's failure path.
Special Processing: The second layer to special processing, the node in this layer's failure path directly to the parent node ( i.e. root node ) .
2.3 according to AC automaton, search for pending text
from the root node , each time you move down the automaton according to the characters you read.
When the read-in character does not exist in the branch, the recursive walk fails the path . If the failed path goes to the root node, the character is skipped and the next character is processed.
Since the AC automaton is moved along the longest suffix of the input text, after all the input text has been read, the last recursive path fails until it reaches the root node, so that all patterns can be detected.
3.aho-corasick Algorithm code example
Pattern String Collection: {"Nihao", "Hao", "HS", "HSR"}
Pending text: "Sdmfhsgnshejfgnihaofhsrnihao"
Output:
(Two graphs above, reference page: http://www.cppblog.com/mythit/archive/2009/04/21/80633.html)
Aho-corasick multi-mode matching algorithm, AC automata detailed