- About AC automatic machine
- AC automata: Aho-corasickautomation, which was produced in 1975 at Bell Labs, is one of the well-known multimode matching algorithms. A common example is to give n words, and then give an article with m characters, so you can find out how many words have appeared in the article. To understand the AC automata, first have to have the basic knowledge of the Pattern Tree (dictionary tree) trie and KMP pattern matching algorithms. The AC automaton algorithm is divided into 3 steps: Constructs a trie tree, constructs the failure pointer and the pattern matching process.
- Simply put, an AC automaton is an efficient algorithm for multi-mode matching (single main string, multiple pattern strings).
- The construction process of AC automatic machine
Three steps are required to use the Aho-corasick algorithm:
-
- add a failed path to trie
- According to the AC automaton, search for pending text
Let's use the following example to introduce the operating process of the AC automaton
here with Hdu 2222 Keywordssearch The most examples of this topic, the test data is as follows:
given 5 words: Say she shr he her and then given a string YASHERHS. Ask how many words have appeared in this string.
- Determining Data structures
First, we need to identify the data storage structures required by the AC automata, and they will be useful later.
First step: Build Trie
Built in the Trie tree based on the input keyword one by one
void Build_trie (char *keyword)//build trie tree {Node *p,*q;int i,v;int len=strlen (keyword); for (i=0,p=root;i<len;i++) {v= keyword[i]-' A ', if (p->next[v]==null) {q= (struct node *) malloc (sizeof (node)), Init (q);p->next[v]=q;//node link}p=p- >next[v];//pointer moves to the next node}p->cnt++;//the last node of the word cnt++, which represents a single word}
after the build is complete, the effect is as follows:
- Building the failure pointer is the key to AC automata, so to speak, if there is no failed pointer, the so-called AC automaton is just a trie tree.
- Failure pointer principle:
- Build a failed pointer to jump to another section when the current character is mismatched from root each character continues to match exactly the same length as a suffix of the currently matched character segment and the largest position, as in the case of the KMP algorithm, which jumps with the mismatch pointer if the current string match fails. If jump, jump after the prefix of the string must be the suffix of the pattern string before the jump, and the depth of the new position of the jump (match the number of characters) must be less than the node before jumping (the number of matching characters after the jump can not be greater than the jump before, otherwise it is not guaranteed to jump after the sequence prefix with the suffix of So we can use BFS to solve the failure pointer on trie.
- Failure Pointer Utilization:
- If the current pointer is mismatch at a character s[m+1], i.e. (p->next[s[m+1]]==null), no word s[1...m+1] exists, and if the current pointer's mismatch pointer points to root, Indicates that any suffix of the current sequence is not a prefix of a word, and if the pointer's mismatch pointer does not point to root, then the current sequence S[I...M] is the prefix of a word and jumps to the mismatch pointer of the current pointer, s[i...m] to the prefix to continue matching s[m+1].
- For the obtained sequence S[1...M], because S[I...M] may be a word suffix, s[1...j] may be a word prefix, so s[1...m] may appear in the word, but the position of the current pointer is determined, cannot move, we need temp temporary pointer, Make temp= the current pointer, and then test s[1...m],s[i...m] to see if it is a word.
- >>> simply, the function of a failed pointer is to quickly find all the words in the trie tree that match the pattern string before one of the main strings can be matched.
Step Two: Build the failed pointer
- After constructing the tire tree, the next task is to construct the failed pointer. The process of constructing a failed pointer sums up a sentence: Set the letter C on this node, walk along the failed pointer of its father node, until it goes to a node, and its child nodes also have a node with the letter C. The current node's failure pointer is then directed to the son whose letter is also c. If you have not found the root, then point the failed pointer to root. To do this, you only need to add root to the queue (root fails pointer to yourself or NULL), and after that we take all of its sons into the queue every time we process a point.
- Observe the process of constructing the failed pointer: in the control graph, first the fail pointer of root points to null, then root is queued and enters the loop. The Root,root node that pops up from the queue is connected to the S,h node, because they are the first layer of characters, there is definitely no smaller than the number of layers of the common prefix, so the 2 node failed pointer to root, and successively into the queue, the failure pointer to the corresponding diagram (1), (2) two dashed line , H (the Right One) pops up from the queue, and H has only the E node, so the next scan pointer points to the node where the fail pointer to the H node of the parent node of the e-node, which is root,root->next[' E ']==null, and root->fail== NULL, indicating that the match sequence is empty, the fail pointer to node e points to root, corresponding to (3) in the diagram, and then node e into the queue, and the popup s,s node from the queue is connected to the A,h (the left one), first traversing to the a node, The scan pointer points to the node where the fail pointer to the parent node s node of the A node points to, that is, root,root->next[' a ']==null, and root->fail==null that the matching sequence is empty, The Fail pointer to Node A is directed to root, corresponding to (4) in the diagram, and Node A enters the queue. Then traverse to the H node, the scan pointer points to the node of the H node's parent node s node of the fail pointer, that is, root,root->next[' h ']!=null, so the fail pointer of node H points to the right of the H, corresponding to the figure (5), and then node H into the queue ... By analogy, a pointer is eventually lost.
code to build the failed pointer:
void Build _ac_automation (Node *root) {head=0,tail=0;//queue header, tail pointer queue[head++]=root;//The root is queued first while (Head!=tail) {Node *p=null; Node *temp=queue[tail++];//Popup header node for (int i=0;i<26;i++) {if (temp->next[i]!=null)//Find the actual existing character node {//temp-> Next[i] for this node, TEMP is its parent node if (temp==root)//If the first layer of the character node, then the node's failure pointer to root temp->next[i]->fail=root;else{// The failed pointer of the parent node of the node is traced back to the same node as the next[i],//The node's failure pointer is pointed to the next[i] node, or if it is not found back to root, the node's failed pointer refers to the Give rootp=temp->fail;//a failure pointer to the parent node of the node to P while (P!=null) {if (p->next[i]!=null) {temp->next[i]->fail=p->. Next[i];break;} P=p->fail;} The failed pointer to the node also points to root if (p==null) temp->next[i]->fail=root;} queue[head++]=temp->next[i];//each node of the node, all the children of the knot are queued to the next}}}
.
- Why is the above method feasible, which guarantees that the length of the string from root to the point of the jump is less than the length of the currently matched string and that it is exactly the same as the length of the string that is currently matched and is the largest?
- Obviously we're building the failed pointer from the failure pointer of the parent node of the current node, because the trie tree compresses the same prefixes in all words, so all failed pointers are not able to jump sideways (to another node with the same depth as themselves). Because if the lateral jump, it is obvious that the jump to the node is not the current match to the suffix of the string is part of the two nodes are joined to one, so the jump can only reach the node is smaller than the current depth, and because it is the current node of the parent node start jump, So you can guarantee that the length of the string from root to the position you are jumping to is less than the length of the string that is currently matched. On the other hand, we can kmp the idea of finding the maximum number of matches in the next array, which is reflected in the AC automaton when the failed pointer is built, and then the next character in the jump position is determined to include the current character. If you connect the failed pointer to that jump position, if the jump position points to null, the current matched character does not appear before the current depth, and cannot match any of the jump positions, and if the next character that finds the first jump position contains the jump position of the current word Fu, the maximum length must be taken. This is because the remaining characters that are currently being matched must be above the jump position depth of the current word Fu at the first jump position, and the jump position will not be maximal (the depth of the last character is smaller than the last character of the first feasible jump position that is currently found, The string must be shorter).
- Step three: Match This proves the feasibility of this method of building a failed pointer.
Step three: Match
- Finally, we can find out which words appear in the pattern string on the AC automaton. The matching process is divided into two situations: (1) The current character matches, indicating that there is a path from the current node along the edge of the tree to reach the target character, at this point only along the path to the next node to continue matching, the target string pointer moves downward character continue to match, (2) The current character does not match, The character that the current node failed pointer points to continues to match, and the matching process ends with the pointer pointing to root. Repeat any one of these 2 processes until the pattern string goes to the end.
- For example, the pattern string is YASHERHS. For i=0,1. There is no corresponding path in the trie, so no action is done; when i=2,3,4, the pointer p goes to the lower left node E. Because the count information for node E is 1, so cnt+1, and the count value of node E is set to-1, which indicates that the changed word has already appeared, prevents the repetition of the count, and finally the node to which the failed pointer to the E node is pointing continues to find, and so on, and finally temp points to root, Exits the while loop, in which count increases by 2. The expression found 2 words she and he. When I=5, the program enters line 5th, p points to its failed pointer node, which is the E node on the right, and then to the R node on line 6th, the R node has a count value of 1, thus count+1, looping until temp points to root. At the end of the i=6,7, no match was found and the matching process ended.
- AC Automata time Complexity: O (L (T) +max (L (Pi)) +m) where M is the number of pattern strings
Match code:
int query (Node *root) {//i is the main string pointer, p is the pattern string pointer int i,v,count=0; Node *p=root;int Len=strlen (s); for (i=0;i<len;i++) {v=s[i]-' a ';//lookup by failed pointer, determine if S[I] exists in trie tree while (P->next[v] ==null && P!=root) p=p->fail;p=p->next[v];//Locate the P pointer to the node if (p==null)//If the pointer returns null, no character p=root is found; Node *temp=p;//matches the node, backtracking along its failed pointer, judging if the other nodes match while (Temp!=root)//Match end Control {if (temp->cnt>=0)//Determine if the node is accessed {count+= temp->cnt;//because CNT is initialized to 0, only cnt>0 count the number of words temp->cnt=-1;//Mark has visited}else//node has been visited, exit loop break;temp=temp-> fail;//backtracking failure pointer continues to find the next node satisfying the condition}}return count;}
For the full template Code of this example, please click here to view blog: http://blog.csdn.net/liu940204/article/details/51345954
The explanation of the temporary AC automaton ended so happily that it was not to be continued ...
AC Automata Algorithms and templates