Aho-corasick multi-mode matching algorithm, AC automata detailed

Source: Internet
Author: User

Aho-corasick algorithm is a classical algorithm in multi-pattern matching, and it is more in practical application.

Aho-corasick algorithm corresponding data structure is Aho-corasick automata, referred to as AC automatic machine.

Programming generally should know the Automaton FA bar, specifically subdivided into: deterministic finite state automata (DFA) and non-deterministic finite state automata NFA. ordinary automata can not be multi-mode matching, AC automata to increase the failure of transfer , transfer to the text has been entered the suffix of the successful, to achieve.

1. Multi-mode matching

  Multi-pattern matching is the p1,p2,p3...,pm of multiple pattern strings to find all the possible locations of all these pattern strings in a continuous text T1....N.

For example: Find the pattern set {"Nihao", "Hao", "HS", "HSR"} in the given text "Sdmfhsgnshejfgnihaofhsrnihao" in all possible locations .

2.aho-corasick algorithm

Using the Aho-corasick algorithm requires three steps:

1. Establishing a model Trie

2. Add a failure path to Trie

3. Search for pending text according to AC automaton

These three steps are described below:

2.1 creating a multi-mode collection Trie Tree

The Trie tree is also a self-motive. For the multi-mode collection {"Say", "she", "shr", "he", "her"}, the corresponding trie tree is as follows, where the red marked circle is expressed as the receiving State:

  

2.2 for a multi-mode collection. Trie tree To add the failed path, establish AC Automatic Machine

The process of constructing a failed pointer sums up one sentence: Set the letter C on this node, walk along his father's failed pointer, and go to a node where his son has a node with the letter C. The current node's failure pointer is then directed to the son whose letter is also c. If you have not found the root, then point the failed pointer to root.

Using breadth-first search for BFS, the hierarchy traverses the nodes to handle each node's failure path.

  Special Processing: The second layer to special processing, the node in this layer's failure path directly to the parent node ( i.e. root node ) .

2.3 according to AC automaton, search for pending text

from the root node , each time you move down the automaton according to the characters you read.

When the read-in character does not exist in the branch, the recursive walk fails the path . If the failed path goes to the root node, the character is skipped and the next character is processed.

Since the AC automaton is moved along the longest suffix of the input text, after all the input text has been read, the last recursive path fails until it reaches the root node, so that all patterns can be detected.

3.aho-corasick Algorithm code example

Pattern String Collection: {"Nihao", "Hao", "HS", "HSR"}

Pending text: "Sdmfhsgnshejfgnihaofhsrnihao"

Output:

  

(Two graphs above, reference page: http://www.cppblog.com/mythit/archive/2009/04/21/80633.html)

Aho-corasick multi-mode matching algorithm, AC automata detailed

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.