[Pattern matching]-multimode matching (implementation of the Prefix Tree of AC algorithm)

Source: Internet
Author: User

From: http://blog.csdn.net/sun2043430/article/details/8832496

Preface

Code for this article:

Http://download.csdn.net/detail/sun2043430/5286986

Steps for implementing the AC algorithm Prefix Tree

  1. Step 1 create a Prefix Tree
  2. Step 2: Set the failure node for each node
  3. Step 3: Collect all matching mode string information for each node
  4. Step 4: Search and match the target string

For more information about the process of string matching using the AC automatic machine, see Wikipedia:

Http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm


The specific implementation method is to create a Prefix Tree, match the target string by character based on the target string to be searched, and find the target string step by step from the root node of the tree to the leaf node. In this process, if the mismatch occurs, jump Based on the mismatch jump point. If the matching mode string is found, print the output. In this case, there is only a vague general impression. Let's take a look at a simple example and explain the specific operation steps step by step.


Step 1: Create a Prefix Tree

For example, we now have five modes to be searched:
"Uuidi"

"UI"

"Idi"

"IDK"

"Di"

Create a Prefix Tree as follows:


(Figure 1)

The root node is empty and does not contain any characters. Its ID is 0. Read each mode string in sequence, add each character of the mode string to the tree, and serial numbers sequentially. The numbers are displayed in red on the right of the node, and each leaf node usesYellow background(Node 5, 6, 9, 10, and 12) indicates that the end of a pattern string is reached (hereinafter referred to as the end node ). If two mode strings have the same prefix, the same prefix will share the same node. For example, "uuidi" and "UI" have a common prefix "u", "Idi", and "IDK" have a common prefix "ID ".

Starting from the root node, the child node of each node indicates which characters can be matched on this node. For example, if the root node has three child nodes and 11, the root node can match U, i, D: three characters. If the target string is one of the three characters, it matches a child node. The next match will continue from the child node. This can be matched, and there is still a mismatch. For the case of mismatch, we do not jump back to the root node to re-match (this will cause the backtracing of the target string ), instead, the system jumps to the failure node when it imitates the mismatch in the KMP algorithm ). For example, if the target string is "uuidk", the tree constructed above goes through nodes 1, 2, and 3 (matched with UUID), and node 5 of node 4 is I, cannot match the K in the target string. In this case, we should jump from 4 to node 8. We call 8 a failure of 4.
Node.

The corresponding failure node should exist for each node on the tree to indicate which node should be redirected to continue matching in case of non-matching.


Step 2: Set the failure node for each node

As shown in the preceding example, when node 4 is not matched, you should jump to node 8 to continue matching, this is the same principle as that in the KMP algorithm for redirection based on the position recorded in the next array during the mismatch, so as to avoid backtracing matching on the target string. In the KMP algorithm, the next array value can be calculated by iteration. However, when multiple mode strings exist, the failure node can only be traversed from the root node of the tree. However, the function of failure node is the same as the next array value in KMP, And the search principle is the same, that isFind the longest prefix in the mode string to match the longest suffix at the current mismatch location.

For example, if node 4 in the above area still exists, the mode string that reaches node 4 is UUID, and the corresponding suffix is "uid", "ID", "I ", in the tree prefix, we can match the three suffixes with the longest length. First, let's look at "uid", starting from the root node of the tree (because we need to find the prefix) if no matching node exists, find "ID" and find the node 7 or 8 that can match. Therefore, we set the failure node of 4 to 8. If no matching prefix is found, set failure node as root node.

NoteI: Some students who have read the KMP algorithm noticed that while finding the longest suffix, we also need to check whether the child nodes are the same, if the child node after the location is the same as the child node of the current node, the jump will inevitably lead to mismatch. In fact, this is true, but we still do it simply. We only need to check whether the prefix and suffix match, regardless of whether the child nodes are the same. The reason is described later.

For the 12 nodes, the failure node corresponding to each node is as follows:


(Figure 2)

Connect each failure node with its corresponding failure node with a green dotted line (the shortcut jump of the Green Channel and the dotted line indicate that the node is hidden and cannot be found easily ), form.

The depth of each node K's failure node does not exceed the depth of the node K, because from the heel node to the failure node is a prefix.

In addition, each node K has only one failure node and no more. Because the definition of a failure node is the longest-length prefix that matches the suffix of the mismatched position, the longest part can only find one, and the two parts cannot be found. If the two parts have the same length, and they are all matched prefixes, so the two branches should overlap according to the construction method of the Prefix Tree.

After constructing the Prefix Tree and setting the failure node of each node, we still have one important thing not to do. Observe the above Prefix Tree. When we come to node 3, the string "UI" composed of Node 2 and 3 actually matches the previous mode string, but Node 3 is not the end character of a mode string, so we cannot report it to the queryer, we have actually matched the previous mode string. In addition, when we see node 5, when we reach node 5, we will not only match the "uuidi" string, in fact, we also matched the "Idi" and "Di" strings. To solve this problem, we need to collect pattern string matching information for each node.


Step 3: Collect all matching mode string information for each node

In fact, it is very easy to collect all matching mode strings for each node. Observe Figure 2. At Node 3, we should report matching mode strings "UI ", we can see that the failure node of Node 3 points to node 6. Therefore, you can obtain information about all matching mode strings of each node from the failure node entry of the node. If the failure node of node K is a tail node, then arriving at node K is equivalent to matching a pattern string. In addition, we observe node 5. node 5 itself is a tail node, so it has its own matching mode string. Then we can look at the failure node of node 5, pointing to 9, and node 9 is also the tail node, therefore, in addition to a mode string (uuidi), the matching mode string of 5 also includes the mode string represented by 9 (IDI), while the failure string of 9
Node pointing to 12, 12 is also a tail node, so node 5 should also contain the matching mode string (DI) of node 12 )...... In this way, until the failure node points to the root node, the traversal ends, and all the tail nodes encountered during the traversal process are pattern strings that can be matched.

In specific code implementation, I use a STD: verctor container to save all the matching mode string information of a node. In addition, we can answer the above questions.NoteI mentioned. Why didn't we check whether the node and its failure node have the same child? For example, node 8 in node 2, we calculated the failure of node 8 above.
Node is 11, but because there are two children in 8, 9 and 10, if the next match in 8 is not matched, it means that the current character in the target string is not I (9 ), K (10), while the child node 12 of 11 represents (I), it will inevitably fail to match when the child node reaches 11 through failure node. However, we still set the failure node of 8 to 11 because if node 11 is missed, we may miss the matching pattern string. For example, the failure node of node 5 is 9, the failure node of node 9 is 12, and the Failure node of node 12 is 7. If we set the failure of node 5 because no child node exists in node 5
Node is node 7, so when we collect all matching mode string information, we will miss the last node 9, 12.

Consider a more extreme situation, such as the mode string set "aaaa", "AAA", "AA", "", if we consider that the child nodes of the failure node of node K should not all be included in the child nodes of node K (the same as the last character after the next array prefix and suffix obtained by KMP), some matching mode string information reports are missing when the target string "aaaaaaaaaaaaa" is used to find the mode string.

Depth is the height of the tree, and the node id = 0 is root. Children is followed by the child node of the current node. Match pattern is followed by a pattern string that can be matched at the node position.

Where:

Node 3 match pattern: "UI"

Node 5 match pattern: "uuidi" "Idi" "Di"

Node 6 match pattern: "UI"

Node 9 match pattern: "Idi" "Di"

Node 10 match pattern: "IDK"

Node 12 match pattern: "Di"


Step 4: Search and match the target string

After completing the above three steps, you can start to search for the target string. This is a simple linear scanning process from start to end, and the target string does not have backtracing.

The current node of the tree is recorded before the search. Initially, the current node of the tree is the root node.

Start from the first character of the target string and match with the root child node. If this character does not match, move the target string to another character and continue searching for matching in the root child node.

If a matched child is found, the target string is moved one character backward, and the curnode is changed to the child node on the matched child node. In the next matching process, if the mismatch occurs, it will jump to the failure node of the curnode node to continue matching.

Check the matching mode string information of the child node every time you take a step toward the child node in the tree. If the matching mode string information exists, report the pattern strings that can be matched by the searcher.


Code Description

There are two codes. One is the link provided at the entry "Aho-corasick string matching algorithm" in Wikipedia. The address is:

Http://sourceforge.net/projects/multifast/

It is written in C language and implemented in a functional way. It contains a large amount of English comments. It is difficult to understand the principle of understanding the code.


The other C ++ code I wrote is easy to call using classes. Of course, I didn't add a lot of comments. Some details may need to be pondered by readers. The examples and images in this article all come from the code I implemented.

Construct tree process call

[CPP]View
Plaincopyprint?

  1. Bool ctrie: Create (const my_pattern pattern [], int ncount)

Inside the function, ctrie: setfailure () is called to set each node failure node and ctrie: setmatchpattern (node
* Pnode) sets all matching mode string information for each node.

Because it is a multi-Cross Tree Structure, the above two functions are recursively called from the root node to the depth of the leaf node.

After the Prefix Tree is constructed, you can start to search for matching mode strings in the target string. For example, you can search for the five mode strings shown in figure 1 in the target string "Hello uuididkidid, the result is as follows:

Number of subscripts: 0 1 2 3
Subscript: 0123456789012345678901234567890123456789
Target string: Hello uuididkidid
Match pattern at 8:
"UI"
Match pattern at 10:
"Uuidi"
"Idi"
"Di"
Match pattern at 12:
"IDK"
Match pattern at 15:
"Idi"
"Di"

Code for this article:

Http://download.csdn.net/detail/sun2043430/5286986

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.