AC automatic mechanism for string algorithm

Source: Internet
Author: User

AC automatic mechanism for string algorithm

Recently, I have been learning algorithms such as strings. Although BF is easy to understand, it is easy to time out. I want to learn other string algorithms to improve it, I recently learned about the AC automatic mechanism. Although I have gained some experience, I still feel a bit confused. I hope you can give me some advice.

I. Principle of AC automatic mechanism:

Aho-Corasick automaton, which was generated in Bell Labs in 1975 and is one of the famous multimode matching algorithms. A common example is to give N words. In an article containing m characters, you can find out how many words have appeared in this article ,. To understand the AC automatic mechanism, you must first have the basic knowledge of the dictionary tree and KMP pattern matching algorithm.

2. steps for implementing the AC automatic machine algorithm (three steps)

Storage Data Structure of AC automatic machines

Const int MAXN = 10000000;
Struct node
{
Int count; // whether it is the last node of a word
Node * next [26]; // Trie 26 subnodes of each node
Node * fail; // failure pointer
};
Node * q [MAXN]; // queue, using bfs to construct the failure pointer
Char keyword [55]; // input the word pattern string
Char str [1000010]; // The main string to be searched
Int head, tail; // queue head and tail pointer

 

1. Construct a Trie tree

 

First, we need to create a Trie. However, this Trie is not a common Trie, but has some special properties.

There are three important pointers, namely p, p-> fail, and temp.

1. the pointer p points to the currently matched character. If p points to root, it indicates that the character sequence currently matched is null. (Root is the Trie entry, which has no actual meaning ).

2. pointer p-> fail, p failure pointer, pointing to the same node as the character p, if not, pointing to root.

3. pointer temp, test pointer (self-named, easy to understand !~), When a fail pointer is created, it is useful to find the node that matches the p character. It is the most useful and hard to understand during scanning.

 

For a node in the Trie tree, it corresponds to a sequence of s [1... m]. In this case, p points to the character s [m]. If the next character is mismatched, that is, p-> next [s [m + 1] = NULL, the mismatched pointer will jump to another node (p-> fail, the sequence corresponding to this node is s [I... m]. If the mismatch persists, the sequence jumps in sequence until the sequence is null or a match occurs. In this process, the value of p has been changing, but the character of the node corresponding to p has not changed. In this process, we can see that the final sequence s is the longest public suffix. In addition, because the sequence starts from root to a node, it indicates that the sequence may be the prefix of some sequences.

The significance of p pointer transfer is discussed again. If the p pointer is mismatched at a character s [m + 1] (that is, p-> next [s [m + 1] = NULL ), no word s [1... m + 1. In this case, if the p mismatch Pointer Points to the root, it means that any Suffix of the current sequence will not be the prefix of a word. If the p mismatch pointer does not point to the root, it indicates the sequence s [I... m] is the prefix of a word, so jump to the mismatch pointer of p, with s [I... m] is prefixed with s [m + 1].

For the obtained sequence s [1... m], because s [I... m] may be the suffix of a word, s [1... j] may be the prefix of a word, so s [1... m] may contain words. In this case, p points to a matched character and cannot be moved. Therefore, set temp = p and test in sequence whether s [1... m], s [I... m] is a word.

The constructed Trie is:

Implementation Code:

 

Void insert (char * word, node * root) {int index, len; node * p = root, * newnode; len = strlen (word); for (int I = 0; I <len; I ++) {index = word [I]-'A'; if (! P-> next [index]) // This character node does not exist. Add it to the Trie tree {// initialize newnode and add it to the Trie tree newnode = (struct node *) malloc (sizeof (struct node); for (int j = 0; j <26; j ++) newnode-> next [j] = 0; newnode-> count = 0; newnode-> fail = 0; p-> next [index] = newnode;} p = p-> next [index]; // move the pointer to the next layer} p-> count ++; // mark the end node count + 1}


 

2. Failed pointer Construction

 

 

The process of constructing the failure pointer is summarized as one sentence: Set the letter on this node to x and follow his father's failure pointer until it reaches a node, his son also has nodes with letters x. Then, point the failure pointer of the current node to the son of x. If the root fails to be found, point the failure pointer to the root.

There are two rules:

 

  1. The failure pointer of the root sub-node points to the root.

  2. The failure pointer of a node (character: x) points to the failure pointer from the fail node of the parent node of X until the child node that finds a node is also the character x. If it is not found, it points to the root node.

    For example

     

    Implementation Code:

     

    Void build_ac_automation (node * root) {head = 0; tail = 1; q [head] = root; node * temp, * p; while (head
       
        
    Next [I]) // determine the actual existing node {// The failure pointer of the first node under root points to root if (temp = root) temp-> next [I]-> fail = root; else {// trace the failure pointer of the parent node of the node in sequence // until the next [I] of a node is the same as that of the node, then // point the failure pointer of the node to the next [I] node // if it is traced back to the root, it cannot be found, the node // failure Pointer Points to root p = temp-> fail; // temp is the parent pointer of the node while (p) {if (p-> next [I]) {temp-> next [I]-> fail = p-> next [I]; break;} p = p-> fail;} if (! P) temp-> next [I]-> fail = root;} // Add all its sons to the queue for each processing vertex, // until the queue is empty q [tail ++] = temp-> next [I] ;}}}
       


     


    3. pattern matching process

     

     

    Starting from the root node, each time the read characters are moved down along the automatic machine. If the read character does not exist in the branch, it recursively follows the failure path. If the path to the root node fails, skip this character and process the next character. Because the AC automatic machine moves along the longest Suffix of the input text, after reading all the input text, it recursively follows the failure path until it reaches the root node. This can detect all the modes.

    Search Steps:

     

    1. Start a search from the root node;

    2. Obtain the first character of the keyword to be searched, select the corresponding subtree Based on the character, and go to the subtree for further search;

    3. On the corresponding subtree, obtain the second character of the keyword to be searched, and then select the corresponding subtree for retrieval.

    4. Iteration process ......

    5. When all the characters of a keyword are extracted from a node, the information attached to the node is read to complete the search.

      Match the words in the pattern string. When our pattern string matches on Trie, if it cannot match the keyword of the current node,

      You should continue matching with the node pointed to by the failure pointer of the current node.

      There are two situations in the matching process:

       

      1. Matching of the current character indicates that a path along the tree edge of the current node can reach the target character. In this case, you only need to continue matching along the path to the next node, the target string pointer moves to the next character to continue matching;

      2. If the current character does not match, the character pointed to by the failed pointer to the current node will continue to match. The matching process ends with the pointer pointing to the root.

        Repeat any of the two processes until the pattern string ends.

         

        Implementation Code:

         

        Int query (node * root) // similar to the kmp algorithm. {// I is the primary string pointer. p is the matching string pointer int I, cnt = 0, index, len = strlen (str); node * p = root; for (I = 0; I <len; I ++) {index = str [I]-'A'; // search by the failure pointer, determine whether str [I] exists in the Trie tree while (! P-> next [index] & p! = Root) {p = p-> fail;} p = p-> next [index]; // if it is found, p points to the node. // the pointer is null, the matched character if (! P) {p = root; // the pointer returns to the root node root. Next time you start searching for the Trie tree from the root node} node * temp = p; // after matching the node, backtrack the failure pointer to determine if other nodes match while (temp! = Root) // match end control {if (temp-> count> = 0) // determine whether the node is accessed {// count the number of words displayed in cnt, because the node does not end with a word, count is 0. // Therefore, cnt + = temp-> count; only count> 0 actually counts the number of words cnt + = temp-> count; temp-> count =-1; // mark accessed} else break; // The node has been accessed, exit the loop temp = temp-> fail; // trace the failed pointer to continue searching for the next node that meets the conditions} return cnt ;}

         

         

        3. AC automation Template

         

        # Include
               
                
        # Include
                
                 
        # Include
                 
                  
        # Define kind 26 const int MAXN = 10000000; struct node {int count; // whether it is the last node of the word * next [26]; // Trie each node's 26 subnodes node * fail; // failure pointer}; node * q [MAXN]; // queue, use bfs to construct the failed pointer char keyword [55]; // input the word pattern string char str [1000010]; // The main string int head and tail to be searched; // node * root; void insert (char * word, node * root) {int index, len; node * p = root, * newnode; len = strlen (word); for (int I = 0; I <len; I ++) {index = word [I]-'A'; I F (! P-> next [index]) // This character node does not exist. Add it to the Trie tree {// initialize newnode and add it to the Trie tree newnode = (struct node *) malloc (sizeof (struct node); for (int j = 0; j <26; j ++) newnode-> next [j] = 0; newnode-> count = 0; newnode-> fail = 0; p-> next [index] = newnode;} p = p-> next [index]; // move the pointer to the next layer} p-> count ++; // mark the end node count + 1} void build_ac_automation (node * root) {head = 0; tail = 1; q [head] = root; node * temp, * p; while (head
                  
                   
        Next [I]) // determine the actual existing node {// The failure pointer of the first node under root points to root if (temp = root) temp-> next [I]-> fail = root; else {// trace the failure pointer of the parent node of the node in sequence // until the next [I] of a node is the same as that of the node, then // point the failure pointer of the node to the next [I] node // if it is traced back to the root, it cannot be found, the node // failure Pointer Points to root p = temp-> fail; // temp is the parent pointer of the node while (p) {if (p-> next [I]) {temp-> next [I]-> fail = p-> next [I]; break;} p = p-> fail;} if (! P) temp-> next [I]-> fail = root;} // Add all its sons to the queue for each processing vertex, // until the queue is empty q [tail ++] = temp-> next [I] ;}}} int query (node * root) // similar to the kmp algorithm. {// I is the primary string pointer. p is the matching string pointer int I, cnt = 0, index, len = strlen (str); node * p = root; for (I = 0; I <len; I ++) {index = str [I]-'A'; // search by the failure pointer, determine whether str [I] exists in the Trie tree while (! P-> next [index] & p! = Root) {p = p-> fail;} p = p-> next [index]; // if it is found, p points to the node. // the pointer is null, the matched character if (! P) {p = root; // the pointer returns to the root node root. Next time you start searching for the Trie tree from the root node} node * temp = p; // after matching the node, backtrack the failure pointer to determine if other nodes match while (temp! = Root) // match end control {if (temp-> count> = 0) // determine whether the node is accessed {// count the number of words displayed in cnt, because the node does not end with a word, count is 0. // Therefore, cnt + = temp-> count; only count> 0 actually counts the number of words cnt + = temp-> count; temp-> count =-1; // mark accessed} else break; // The node has been accessed, exit the loop temp = temp-> fail; // trace the failed pointer to continue searching for the next node that meets the conditions} return cnt;} int main () {int I, t, n, ans; scanf ("% d ", & t); while (t --) {root = (struct node *) malloc (sizeof (struct node); for (int j = 0; j <26; j ++) root-> next [j] = 0; root-> fail = 0; root-> count = 0; scanf ("% d", & n ); getchar (); for (I = 0; I
                   
                  
                 
                
               


         

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.