Aho-corasick automatic algorithm (AC algorithm interpretation)

Source: Internet
Author: User
Tags goto

To understand this algorithm, there should be a wired state automata Foundation.


The basic idea of the algorithm is this:
In the preprocessing stage, the AC automata algorithm establishes three functions, the steering function Goto, the failure function failure and output function outputs, thus constructs a tree type finite automaton.
In the search lookup phase, the scanned text is used to cross-use these three functions to position all occurrences of the keyword in the text.


This algorithm has two characteristics, one is to scan the text without the need for backtracking, the other is the time complexity of O (n), the time complexity of the number and length of the keyword Independent.



Model set {He, she, his, hers} tree-type finite automaton

Goto function:


Fail function:


Output function:

I output (i)

2 {He}

5 {She,he}

7 {His}

9 {Hers}


3. Construction of steering, failure and output functions
Now shows how to establish the correct steering, invalidation, and output functions based on a keyword set. The entire build consists of two parts, in the first part determines the state and the steering function, in the second part we calculate the failure function. The calculation of the output function is done in the first and second parts.
In order to construct the steering function, we need to create a state transition diagram. To begin, this figure contains only one representation of state 0. Then, by adding a path from the starting state, enter each keyword p in turn. New vertices and edges are added to the chart, resulting in a path that spells out the keyword p. The keyword P is added to the output function of the terminating state of this path. New edges are added to the chart only when it is necessary, of course.


First, the matching algorithm:

Algorithm 1. Pattern Matching machine
input: Text and M. The text is x=a1a2...an,m is a pattern matching automaton (which contains the Goto function g (), the failure function f (), and the output function outputs ())
: The Pat that appears in text and its position.
 
state←0
for i←1 until n do//Swallow text AI while
     g (state, AI) =fail 
         state←f (state)//until you can go on, hehe, At least 0 of that state is going to go down.
     state←g (State,ai)//Get the next status
     if Output (state) ≠empty//Can output
         print i;
         Print output (state)



Set up the Turn function goto functions:

</pre><pre name= "code" class= "CPP" > Algorithm 2. Construction of the Goto function
Input: Patterns collection K={y1,y2,..., yk}
output: Intermediate result of goto function g and output function output
 
/*
We assume output (s) is empty when the state S is first created, and G (s,a) =fail If A is
undefined or if G (s,a) have not yet been defined. The procedure enter (Y) inserts into the
goto graph a path that spells out Y.
*/
 
newstate←0
fori←1until k//For each mode go to enter (Yi), to be inserted in the automaton come on.
     Enter (yi) for all
A such that G (0,a) =fail< C11/>g (0,a) ←0
 
 
Enter (a1a2...am)
{
     state←0;j←;1 while
     g (State,aj) ≠fail//can go on, as far as possible to continue the old way, not go down , go to the following for () to expand the new path
         state←g (State,aj)
         j←j+1 for
 
     p←j until m//expand New Path
         newstate←newstate+1
         g ( STATE,AP) ←newstate
         state←newstate
 
     output (state) ←{a1a2...am}//Here states the status that is encountered each time a pat is constructed
}

The construction of the Failure function: (This is more abstract)

Note that state 0 is not in the failure function, and the following is the beginning of the construction, first for all depth state s,f (s) = 0, and then the failure value for all states depth to D is obtained by the depth-1 state.

Specifically, when calculating all states of depth to D, we take into account the state of each depth for d-1 R

1. If for all characters a,g (r,a) =fail, then do nothing, I think at this time R is already the trie tree leaf knot point.

2. Otherwise, if there is g (r,a) =s, then perform the following three steps

(a) Set State=f (R)//with the state record with R prefix

(b) Execute state=f (state) 0 or several times, until G (State,a)!=fail (which is bound to be due to G (0,a)!=fail)//must find a way to live, can go on

(c) Set F (s) =g (State,a), which is equivalent to finding F (s) and also by a state that matches the A-character to go to.

Example Analysis:

First we will depth to 1 of the state F (1) =f (3) = 0, and then consider depth as 2 nodes 2,6,4

When calculating F (2), we set state=f (1) = 0, because g (0,e) = 0, so f (2) = 0;

When calculating F (6), we set state=f (1) = 0, because g (0,i) = 0, so f (6) = 0;

When calculating f (4), we set state=f (3) = 0, because g (0,h) = 1, so f (4) = 1;

Then consider depth as a 3 node 8,7,5.

When calculating f (8), we set state=f (2) = 0, because g (0,r) = 0, so f (8) = 0;

When calculating F (7), we set state=f (6) = 0, because g (0,s) = 3, so f (7) = 3;

When calculating F (5), we set state=f (4) = 1, because g (1,e) = 2, so f (5) = 2;

Finally consider the junction of depth 4 9

When calculating F (9), we set state=f (8) = 0, because g (0,s) = 3, so f (9) = 3;

Specific algorithm:

Algorithm 3. Construction of the failure function
input: goto function g and output function output from algorithm 2
output: Failure function f and output function output
 
queue←empty  //First empty queue for each
a such that G (0,a) =s and s≠0  // Put all subsequent non-0 states of status 0 into the temporary queue
     list_add_tail (queue, s)  
     F (s) ←0
 
while queue≠empty {
     pop (queue);  Each time the loop pops up the first State R in the queue is the depth of the current depth for every
     a such that G (R,a) =s and S≠fail//r is the queue header status, if R encounters a can go down, "the depth of S is the depth of R +1"
<pre name= "code" class= "CPP" >         list_add_tail (queue, s) <span style= "Font-family:arial, Helvetica, Sans-serif; " >//Put this new state in the queue </span>
<span style= "White-space:pre" >	</span>//start: Find R's failure state condition for State of character a (or save in status variable)
State←f (R) while G (State,a) =fail state←f (state)
<pre name= "code" class= "CPP" ><span>	</span>//end: Find the failure state of R The condition of the State is the status of character a (or it is stored in the states variable)
F (s) ←g (state,a)//The status state at this time is a subsequent state after a failed match of S State (note that the depth of S is R's depth + 1)}



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.