Faith can lazy on the spirit of laziness, about AC automata originally do not want to see, but HANLP source code in the user-defined dictionary recognition is the use of AC automatic machine implementation. Oh, no way, just take a look.
Theory of AC automata
Aho Corasick automatic machine, referred to as AC automata, to learn AC automata, we must know what is trie, that is, the dictionary tree. Trie tree, also known as the word search tree or key tree, is a tree-shaped structure, is a hash tree variant. Typical applications are used to count and sort large numbers of strings (but not limited to strings), so they are often used by search engine systems for text frequency statistics. It has the advantage of minimizing unnecessary string comparisons and querying efficiencies over hash tables. I have previously written articles about Tire and even group Tire (Double array Tire, DAT, hereafter).
Advantages of AC automata
The advantages of AC actually include all the advantages of the tire tree, but also more powerful, consider the following questions
For a long string s, given a pattern string T, see if the pattern string T appears in S?
The simple algorithm for this problem is to facilitate the comparison of the characters in the s with T, and the inconsistency moves to the next character of S. The worst-case time complexity of this algorithm is O (len (S) * Len (T))
The well-known KMP algorithm is to omit the process of one-by-one comparisons, but instead directly follow the contents of the next array to find the complexity of O (Len (S) + len (T))
The advantage of AC automata compared to DAT is the addition of a fail table, and omitted in the DAT multi-mode match does not matter of backtracking, (but later HANLP's author did the experiment, not as efficient as dat. )
The construction of AC automatic machine
Know the power of AC, look at the structure of the AC, before the construction, first look at an AC diagram, the dictionary {fg,he,hers,his,she} constructs the tire tree as shown:
At first glance this diagram is very complex, in fact, just on the basis of tire, for each node, plus a fail pointer, shown in the dashed line
The construction of AC automatic machine
AC is built on the basis of tire, the two elements of DAT are base and check table, while the three elements of AC are goto table, fail table and output table
Goto: Branch and son node
Fail: A status match failed, back to fail the indicated state continues to match
Output table: Outputs for status
The elephant is installed in a refrigerator in 3 steps, so the AC construction process is divided into 3 steps:
1. Establish the tire tree structure of the dictionary, the establishment process of the Goto table is also built (that is, the pointer to the child node), the output table was initially built, and then to expand output according to the fail Pointer, "[]" The content is the preliminary output table, corresponding to the subscript in the dictionary
2. Step 1 generates the tire tree, although it has a tree structure, but the status value is 0, the following constructs the DAT with tire, and assigns a value to the status state.
Note that in the construction process, the node is a word, that is, the green node in the diagram, in the construction of DAT, will be a child node, that is, the end of a word, similar to the leaf node structure mentioned in the previous article, to the increase of the child node of the transfer character to take 0, the new node's state value
3. Traverse the tire with status, construct the fail table and the output table.
Constructs the failure pointer principle: to a certain node, it produces in the letter C, along this node's parent node's failure pointer walks, knows a node, his branch state also has the letter C, then the current node's failure pointer points to that branch C points to the son node, if always, to root has no such node, The failure pointer points to root. Principle with KMP algorithm, do not understand can Google. The finished figure is shown in the first picture.
After the construction is complete, the contents of each table are:
At this point the AC automaton is constructed, and the next step is to see how to use it to participle.
Resources
Aho Corasick automaton combined with Doublearraytrie fast multi-mode matching
Chinese Word Series (ii) AC automata based on tire tree of double Group