AC classic multi-mode matching algorithm

Source: Internet
Author: User

From: http://blog.csdn.net/ijuliet/article/details/4210858

Today we will talk about the multi-mode matching AC algorithm (AHO and corasick). Thanks to chase for helping sort out the materials, while (1) {Juliet. Say ("3q ");}. We have learned the BM and Wu-manber algorithms. WM is derived from BM, But AC is not infected with them. This is another matching idea.

1. First recognized AC Algorithm

Step 1: A finite state automatic machine is composed of a set of patterns (to match multiple patterns at the same time.

Step 2: Use the text to be matched as the input of the automatic machine, and output the patterns and patterns in the whole text.

 

 

The execution action of an automatic machine consists of three parts:

(1) A goto Function

(2) A failure Function

(3) An output function

 

First, let's take a look at these three parts through a specific instance, and how the automatic machine operates. First, I have a general impression, which will be explained later. Patterns set {he, she, his, hers}. We need to find and match in "ushers.

 

(1) goto Function

 

 

I 1 2 3 4 5 6 7 8 9

F (I) 0 0 0 1 2 0 3 0 3 (found? It seems that I and f (I) have the same prefix)

(2) Failure Function

 

 

I output (I)

2 {he}

5 {she, he}

7 {his}

9 {hers}

(3) Output Function

 

 

First, we start from the state 0 and receive the first character u that matches the string. in GOTO (goto function for short), we can see that the return state is 0, followed by the second character S, if the output is found to be in status 3, find that output (3) is a Null String, indicating that patterns is not matched. Continue matching H, go to status 4, find output and find no match, continue character E, status to 5, find output, find output (5) match the two strings she and he, and output the position in the entire string. Then match R, but find G (5, R) = fail. At this time, we need to find failure and find F (5) = 2, so we will go to status 2, then match R, state to 8, then match S, state to 9, view output, and output (9): hers, record the matching position. So far, the ushers string has been matched.

 

The specific matching algorithm is as follows:

Algorithm 1. Pattern Matching Machine

Input: Text and M. Text is X = a1a2... An, M is a pattern matching automatic machine (including the Goto function g (), failure function f (), output function output ())

Output: Pat in text and its location.

 

State limit 0

For I then 1 until n do // swallow text AI

While G (State, AI) = fail

State when F (state) // wait until it can go down. Haha, at least 0 can go down in that state.

State invocation g (State, AI) // get the next state

If output (state) =empty // You can output

Print I;

Print Output (state)

 

The time complexity of the AC algorithm is O (n), which is irrelevant to the number and length of patterns. Because every character in text must be input with an automatic machine, it is best to use O (n) in the worst case, and the preprocessing time is O (m + n ), M is the sum of patterns lengths.

2. construct three tables

OK. Let's take a look at how to construct the above three functions through the patterns set.

The construction of these three functions is divided into two phases:

(1) we determine the state and construct the Goto function.

(2) We calculate the failure function.

The structure of the output function runs through these two stages.

2.1 fill in the Goto and ouput tables

We still take the instance for step-by-step Construction: patterns set {he, she, his, hers}

First, we construct patterns he

 

Then construct the she

 

Then construct his. Because the status 0 receives H has reached the status 1 when constructing his, you do not need to re-create a State, which is a bit like the trie tree building process, share a section with the same prefix

 

Construct hers

 

The Goto function algorithm is constructed as follows:

Algorithm 2. Construction of the Goto Function

Input: patterns set K = {y1, Y2,..., YK}

Output: intermediate results of goto function g and output function output

 

/*

We assume output (s) is empty when State S is first created, and g (s, A) = fail if A is

Undefined or if G (s, A) has not yet been defined. The procedure enter (y) inserts

The Goto graph a path that spells out y.

*/

 

Newstate limit 0

Fori then 1 until K // enter (yi) for each mode and insert it into the automatic machine.

Enter (Yi)

For all a such that G (0, a) = fail

G (0, a) defaults 0

 

 

Enter (a1a2... am)

{

State success 0; j success; 1

While G (State, AJ) =fail // if you can continue, try to extend the old path as much as possible. If you cannot continue, proceed with the for () below to expand a new path.

State invocation g (State, AJ)

J + 1

 

For p then J until M // expands the new path

Newstate unknown newstate + 1

G (State, AP) implements newstate

State unknown newstate

 

Output (state) failed {a1a2... am} // The State that is encountered every time a Pat is constructed.

}

2.2 failure and Output

Construction of failure function: (this is abstract)

Note that the state 0 is not in the failure function. The construction starts below. First, for all states in which depth is 1, F (S) = 0, then, the failure value of all States where depth is D is obtained from the state of depth-1.

Specifically, when calculating all states in which depth is D, we will consider each State in which depth is D-1.

1. If all characters A, G (R, a) = fail are used, do nothing. I think R is the leaf node of the trie tree.

2. Otherwise, if G (R, a) = s exists, perform the following three steps:

(A) Set state = f (r) // use state to record the stuff with the prefix of R

(B) Execute State = f (state) zero or several times until g (State, )! = Fail (this status will definitely be due to G (0, )! = Fail) // you must find a living path to continue.

(C) Set F (S) = g (State, A), that is, finding F (s) is also the state in which a State matches the character.

Instance analysis:

First, we will set the state of depth to 1 F (1) = f (3) = 0, and then consider the nodes 2, 6, 4 where depth is 2.

When calculating F (2), we set state = F (1) = 0. Because G (0, e) = 0, F (2) = 0;

When calculating F (6), we set state = F (1) = 0. Because G (0, I) = 0, F (6) = 0;

When calculating F (4), we set state = f (3) = 0. Because G (0, H) = 1, F (4) = 1;

Then consider nodes 8, 7, 5 with a depth of 3.

When calculating F (8), we set state = F (2) = 0. Because G (0, R) = 0, F (8) = 0;

When calculating F (7), we set state = f (6) = 0. Because G (0, S) = 3, F (7) = 3;

When calculating F (5), we set state = f (4) = 1, because G (1, e) = 2, F (5) = 2;

Finally, consider node 9 with a depth of 4.

When calculating F (9), we set state = f (8) = 0. Because G (0, S) = 3, F (9) = 3;

 

Specific algorithms:

Algorithm 3. Construction of the failure Function

Input: goto function g and output function output from algorithm 2

Output: Failure function f and output function output

 

Queue queue empty

For each a such that G (0, a) = S = 0 // In fact, this is the process of extensive BFS search

Queue queue {s}

F (s) limit 0

 

While queue =empty

Pop ();

For each a such that G (R, a) = s =fail // R is the queue header status. If R encounters a, it can continue.

Queue queue {s} // click it.

State when F (r) // state with the same prefix as R

While G (State, a) = fail // In fact, it is certainly possible to find no fail, because at least G (0, a) will not fail

State when F (state)

 

F (s) returns g (State, a) // OK. This step is equivalent to finding the state with the same prefix of S, that is, F (s)

 

Output (s) extract output (f (s) // we recommend that you use g (4, e) = 5 in the example, and then ouput (5) export output (2) = {she, he}

 

2.3 output

For more information about the output function construction, see algorithm 2 and 3.


3. Algorithm Optimization

Improvement 1:Observe that the failure function in algorithm 3 is not optimized enough.

 

 

We can see that G (4, e) = 5. If the current status is 4 and the current character is t! = E, because g (4, t) = fail,

So jump to status 1 according to F (4) = 1, and we already know T! = E, so there is no need to jump to 1, and directly jump to the State F (1) = 0.

To avoid unnecessary state migration, it is similar to the KMP algorithm. Redefined a failure function: F1

 

F1 (1) = 0,

For I> 1, if all characters in the state f (I) can also transfer the I state, then F1 (I) = F1 (f (I )),

Otherwise, F1 (I) = f (I ).

 

Improvement 2:Because not every state in the Goto function maps to any character for stateful migration, when the migration is fail, we need to check the failure function, and then change the state for migration. Now we construct a definite finite automatic machine next move Function Based on goto function and failure function. Each state of the automatic machine can be migrated to every character, thus omitting the failure function.

 

The algorithm for constructing next move function is as follows:

Algorithm 4: construction of a deterministic finite automaton

Input: goto functioni g and failure function f

Output: Next move function delta

 

Queue queue empty

For each symbol

Delta (0, a) Delta g (0,)

If G (0, a) = 0

Queue queue g (0,)

 

While queue =empty

Pop ()

For each symbol

If G (R, a) = s =fail

Queue queue {s}

Delta (R, a) ← s

Else delta (R, a) Delta (f (R),)

 

The Calculation of next function delta is as follows:

 

'.' Indicates other characters that can recognize characters in this status.

 

Improvement 2 has advantages and disadvantages: it can reduce the number of state transitions; the disadvantage is that the large storage space is caused by the large number of migration between the state and the state.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.