Chinese Word Series (ii) AC automata based on tire tree of double Group

Source: Internet
Author: User
Tags dashed line goto

Faith can lazy on the spirit of laziness, about AC automata originally do not want to see, but HANLP source code in the user-defined dictionary recognition is the use of AC automatic machine implementation. Oh, no way, just take a look.

Theory of AC automata

Aho Corasick automatic machine, referred to as AC automata, to learn AC automata, we must know what is trie, that is, the dictionary tree. Trie tree, also known as the word search tree or key tree, is a tree-shaped structure, is a hash tree variant. Typical applications are used to count and sort large numbers of strings (but not limited to strings), so they are often used by search engine systems for text frequency statistics. It has the advantage of minimizing unnecessary string comparisons and querying efficiencies over hash tables. I have previously written articles about Tire and even group Tire (Double array Tire, DAT, hereafter).

Advantages of AC automata

The advantages of AC actually include all the advantages of the tire tree, but also more powerful, consider the following questions

For a long string s, given a pattern string T, see if the pattern string T appears in S?

The simple algorithm for this problem is to facilitate the comparison of the characters in the s with T, and the inconsistency moves to the next character of S. The worst-case time complexity of this algorithm is O (len (S) * Len (T))

The well-known KMP algorithm is to omit the process of one-by-one comparisons, but instead directly follow the contents of the next array to find the complexity of O (Len (S) + len (T))

The advantage of AC automata compared to DAT is the addition of a fail table, and omitted in the DAT multi-mode match does not matter of backtracking, (but later HANLP's author did the experiment, not as efficient as dat. )

The construction of AC automatic machine

Know the power of AC, look at the structure of the AC, before the construction, first look at an AC diagram, the dictionary {fg,he,hers,his,she} constructs the tire tree as shown:

At first glance this diagram is very complex, in fact, just on the basis of tire, for each node, plus a fail pointer, shown in the dashed line

The construction of AC automatic machine

AC is built on the basis of tire, the two elements of DAT are base and check table, while the three elements of AC are goto table, fail table and output table

Goto: Branch and son node

Fail: A status match failed, back to fail the indicated state continues to match

Output table: Outputs for status

The elephant is installed in a refrigerator in 3 steps, so the AC construction process is divided into 3 steps:

1. Establish the tire tree structure of the dictionary, the establishment process of the Goto table is also built (that is, the pointer to the child node), the output table was initially built, and then to expand output according to the fail Pointer, "[]" The content is the preliminary output table, corresponding to the subscript in the dictionary

2. Step 1 generates the tire tree, although it has a tree structure, but the status value is 0, the following constructs the DAT with tire, and assigns a value to the status state.

Note that in the construction process, the node is a word, that is, the green node in the diagram, in the construction of DAT, will be a child node, that is, the end of a word, similar to the leaf node structure mentioned in the previous article, to the increase of the child node of the transfer character to take 0, the new node's state value

3. Traverse the tire with status, construct the fail table and the output table.

Constructs the failure pointer principle: to a certain node, it produces in the letter C, along this node's parent node's failure pointer walks, knows a node, his branch state also has the letter C, then the current node's failure pointer points to that branch C points to the son node, if always, to root has no such node, The failure pointer points to root. Principle with KMP algorithm, do not understand can Google. The finished figure is shown in the first picture.

After the construction is complete, the contents of each table are:

At this point the AC automaton is constructed, and the next step is to see how to use it to participle.

Resources

Aho Corasick automaton combined with Doublearraytrie fast multi-mode matching

Chinese Word Series (ii) AC automata based on tire tree of double Group

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.