The principle and realization of AC automata for multi-mode string matching algorithm

Source: Internet
Author: User
Tags dashed line

Introduction :

This article is the blogger's own understanding of the principle of AC automata and views, mainly in an example of the way to explain, at the same time with the corresponding picture. The Code Implementation Section also gives explicit comments, hoping to give you a different feeling. AC automata is mainly used to match multi-pattern strings, which is essentially a tree extension of the KMP algorithm. This article mainly introduces the working principle of AC automata, and on this basis, it realizes a simple AC automaton with Java code.

Welcome to discuss, if there are errors please correct me

if you want to reprint, please specify the source http://www.cnblogs.com/nullzx/

1. Application Scenario-multimode string matching

We are now thinking about a problem in which we want to find multiple target strings in text string Target1,target2,...... The number and position of occurrences. For example: Find the target string set {"Nihao", "Hao", "HS", "HSR"} all possible positions in the given text "Sdmfhsgnshejfgnihaofhsrnihao". To solve this problem, our general approach is to find each target string individually in a text string and record where each occurrence occurs. Obviously this approach solves the problem, but is less efficient when the text string is large and the target string is large. To improve efficiency, Bell Labs invented the famous multi-mode string matching algorithm--ac automaton in 1975. The AC automaton relies on the trie tree (also called the Dictionary tree) and draws on the core idea of the KMP pattern matching algorithm. You can actually think of the KMP algorithm as an AC automaton with only one child node per node.

2. AC Automaton and its operating principle

2.1 initial knowledge of AC automatic machine

The basis of AC automata is the trie tree. Unlike the trie tree, each node in the tree has a pointer to the child (or a reference), and a fail pointer, which indicates that the input character does not match any of the child nodes of the current node (note that it does not match the node itself), The state (or nodes that should be transferred) to which the state of the automata should be transferred. The function of the fail pointer can be analogous to the function of the next array in the KMP algorithm.

Let's take a look at an AC automaton constructed with the target string set {ABD,ABDK, Abchijn, Chnit, IJABDF, Ijaij}

is a built-in AC automaton, where the root node does not store any characters, and the fail pointer of the root node is null. The dashed line indicates the point of the node's fail pointer , and all the nodes that represent the last character of the string are represented by a red circle, which we call the endpoint of the string. Each node actually has a fail pointer, but for convenience , this article deals with the principle that all fail dashed lines pointing to the root node are not drawn .

From the AC automata we can see an important property: The fail pointer for each node represents the longest common part of all the suffixes of the sequence of characters consisting of the root node to that node, and for all the prefixes in the entire set of target strings (that is, the entire trie tree) .

In comparison, all suffixes of the character sequence "ijabd" consisting of the root node to the ' d ' in the target string "IJABDF" in the entire target string set {ABD,ABDK, Abchijn, Chnit, IJABDF, Ijaij}, the longest public part of all prefixes is ABD, The fail of the D node (this d in the string "IJABDF") points to the last character of the character sequence Abd.

2.2 AC the operation process of the automaton :

1 ) indicates that the current node pointer points to the root node of the AC automaton, i.e. Curr = root.

2 ) reads (down) one character from a text string

3 ) to find the node that matches the character from all the child nodes of the current node ,

If successful: determines whether the node at the current node and the current node fail indicates the end of a string, and if so, the index starting point in the text string is recorded in the corresponding string save result collection (index start = Current index-string length + 1). Curr Point to the child node and proceed to step 2nd.

If failure: Perform step 4th.

4) if fail = = NULL (indicates that no string in the target string is a prefix of the input string, which is equivalent to restarting the state machine) Curr = root, perform step 2,

otherwise, point the current node pointer to the Fail node, and perform step 3 .

Now, let's take a concrete example to deepen the understanding that at the initial time the current node is the root node, we now assume that the text string literal = "ABCHNIJABDFK".

The real curve in the graph represents the transfer process of the current node pointer throughout the search process, and the text next to the node represents the text string character read under the current node. For example, when the current pointer points to the root node, enter the character ' a ', the current pointer points to node A, then enter the character ' B ', the transition of the Automata State to Node B,......, and so on. The last state of the AC automaton in the figure is just right back to the root node.

It is necessary to note that when the pointer is at Node B (the curve is two times B, this refers to the second B, that is, the target string "B" in "IJABDF"), then read the text string character Subscript 9 characters (that is, ' d '), the current node pointer points to the end point D when there is a node that matches the input character d in all child nodes of B (where there is just one child node), and at this point the fail pointer of the node D points to the endpoint of the string "abc" (denoted by a red circle). So we found the target string "abc" once. This process is indicated by a dashed line in the diagram, but the state is not transferred to the D node in "Abd".

After all the text string characters have been entered, we find the ABD once in the target string set in the text string, in the position labeled 7 in the text string, and the target string IJABDF once, in the position labeled 5 in the text string.

3. The method and principle of constructing AC automatic machine

3.1 Basic methods of construction

First we insert all of the target strings into the trie tree and then find the correct point by traversing the fail pointer for all the child nodes of each node through breadth-first traversal .

Determine the problem that the fail pointer points to and the way the next array is constructed in the KMP algorithm. The specific method is as follows

1) point to the root node of all the child nodes of the root node, and then into row all the child nodes in the root node in order.

2) If the queue is not empty:

2.1) out, we will be out of the node to record as Curr, Failto represents the Curr of the fail point to the node, that is, Failto = Curr.fail

2.2) A. Judge Curr.child[i] = = Failto.child[i] is established,

Established: Curr.child[i].fail = Failto.child[i],

Not established: Judge Failto = = NULL is established

Established: Curr.child[i].fail = = root

not established: execution Failto = Failto.fail, continue execution 2.2)

B.curr.child[i] into row, perform step 2 again)

If the queue is empty: End

3.2 An example to understand the principle of constructing AC automata

The order of resolution that each node fail points to is done in the order of breadth-first traversal, or sequence traversal, which means that the fail pointer to the current node must have pointed to the correct position when addressing the point of the child node fail of the current node.

To illustrate the problem, we emphasize again that "the fail pointer for each node represents the longest common part of all the suffixes of the sequence of characters consisting of the root node to that node and the entire set of target strings (that is, the entire trie tree)."

In the example shown, we want to solve the problem of the fail point of the node y of a child of node X1. Known x1.fail points to x2, according to the meaning of the fail pointer of the X1 node, we know that the sequence of characters in the red solid line ellipse is necessarily equal and represents the longest public part. According to the meaning of y.fail, if X2 's child node and node y represent equal characters, then y.fail should point to it.

What if the X2 child node does not have a character that is represented by the node Y? Since X2.fail points to X3, according to the meaning of x2.fail, we know that the sequence of characters within the green box is necessarily equal. Obviously, if one of X3 's children has the same character as the node Y, then y.fail points to it.

If the X3 child node does not have the character of the node Y representation, we can repeat this step in turn until the XI node's fail points to null, indicating that we have reached the topmost root node, and we only need to let Y.fail = root.

The core essence of the construction process is to determine the longest common prefix of the child's node, given the longest public prefix known to the current node. This is entirely analogous to the KMP algorithm's next array solution process.

3.2.1 determine the process of the H node fail point in the diagram

Now let's assume that we want to determine the fail point of the child node, h, of node C in the graph. Each node in the diagram should have a dashed line representing fail, but for convenience , all fail dashed lines pointing to the root node are not drawn, as agreed in this article .

The diagram on the left indicates that before H.fail is determined, the right figure indicates that H.fail is determined

In the left image, the fail of the node in the solid blue line is determined. Now how should we find the right point of h.fail? Because the fail of the node C is known (The C node is the parent node of the H node), it points to the longest common part of all suffixes ("BC" and "C") of all prefixes in the trie tree and the character sequence ' a ' B ' C '. Now we are going to solve the problem of all prefixes in the target string collection with the longest public part of all suffixes of the character sequence ' a ' B ' C ' h '. Obviously c.fail points in the node of the child node H, then the H.fail should point to the C.fail child node H, so the right figure represents the situation after the h.fail determined.

3.2.2 determine the process of i.fail pointing in the diagram

The diagram on the left indicates that before I.fail is determined, the right figure indicates that I.fail is determined

When determining the direction of the I.fail, it is clear that the h.fail (the H of the parent node of I in Figure H) has pointed to the correct position. Which means we now know that all suffixes of the target string collection all prefixes with the character sequence ' a ' B ' (c ' h ') are the longest prefixes in the trie tree that are ' C ' h '. It is obvious that H.fail's child node is no I node (here h.fail only one child node N). The longest prefix of all suffixes of the character sequence ' C ' h ' in the trie tree can be obtained by H.fail's fail, while H.fail's fail points to root (this fail dashed line is not drawn according to the principle of drawing in this blog), and the node in Root's child node has a junction representing the character I, So the result is as shown in the image on the right.

In the case of knowing the i.fail, you can try to draw a j.fail point on the paper to deepen the understanding of the AC automatic mechanism creation process.

4. Java code Implementation of AC automata
Package Datastruct;import Java.util.arraylist;import Java.util.hashmap;import java.util.linkedlist;import Java.util.list;import Java.util.map.entry;public class Ahocorasickautomation {/* The AC automaton in this example only handles strings of English type, so the length of the array is 128* /private static final int ASCII = 128;/*ac The root node of the automaton, the root node does not store any character information */private node root;/* the collection of target strings to be found */private list<string > target;/* represents the result of finding in a text string, the key represents the target string, and value indicates where the target string appears in the text strings */private hashmap<string, list<integer>> result;/* internal static class, used to represent each node of the AC automaton, in each node we do not store the corresponding character of the node */private static class node{/* If the node is an end point, that is, from the root node to this node represents a target string, Then str = NULL, and str represents the string */string str;/*ascii = = 128, so this is equivalent to 128 fork tree */node[] table = new node[ascii];/* The child node of the current node does not match one of the text strings character, the next node that should be looked up */node Fail;public boolean Isword () {return str! = NULL;}} /*target represents the set of target strings to be found */public ahocorasickautomation (list<string> target) {root = new Node (); this.target = target ; Buildtrietree (); Build_ac_fromtrie ();} /* Build trie tree */private void Buildtrietree () {for (string targetstr:target) {Node Curr by target string = root;for (int i = 0; i < targetstr.length (); i++) {Char ch = targetstr.charat (i); if (curr.table[ch] = = null) {curr.table [ch] = new Node ();} Curr = Curr.table[ch];} /* Change the node corresponding to the last character of each target string to the end point */curr.str = Targetstr;}} /* Build an AC automaton from the trie tree, essentially an automaton, equivalent to the next array that constructs the KMP algorithm */private void Build_ac_fromtrie () {/* breadth-first traversal of the queue used */linkedlist<node > queue = new linkedlist<node> ();/* All Children Node */for (node x:root.table) {if (x! = NULL) {* * * for processing root node alone The fail of all child nodes of the root node points to the root node */x.fail = root;queue.addlast (x);/* child node of all root nodes into row */}}while (!queue.isempty ()) {/* Determine the point of fail for all child nodes of the dequeue node */node p = Queue.removefirst (); for (int i = 0; i < p.table.length; i++) {if (p.table[i]! = NULL) {/* child Node into row */queue.addlast (P.table[i]);/* Starting from p.fail */node Failto = P.fail;while (True) {/* Description found root node not found */if (Failto = null) {p.table[i].fail = Root;break;} /* Description has a public prefix */if (failto.table[i]! = null) {p.table[i].fail = Failto.table[i];break;} else{/* continue looking up */failto = Failto.fail;}}}} /* Find all target strings in a text string */public hashmap<string, list<integer>> find (String text{/* Creates an object that represents the stored result */result = new hashmap<string, list<integer>> (); for (String S:target) {result.put (S, new Linkedlist<integer> ());} Node Curr = Root;int i = 0;while (i < Text.length ()) {/* characters in text string */char ch = text.charat (i);/* characters in text strings compared to characters in AC automata */if (cu RR.TABLE[CH] = null) {/* If equal, the automaton goes to the next state */curr = Curr.table[ch];if (Curr.isword ()) {Result.get (CURR.STR). Add (I- Curr.str.length () +1);} /* This is easy to overlook because a string in the middle of a target string may contain exactly another target string, * even though the current node does not represent the end point of a target string, it may contain exactly one string to the current node */if (curr.fail! = null & & Curr.fail.isWord ()) {Result.get (CURR.FAIL.STR). Add (I-curr.fail.str.length () +1);} /* Index self-increment, point to the character */i++ in the next text string;} else{/* if not, find the next state that should be compared */curr = curr.fail;/* to the root node has not been found, stating that the text string with the end of CH as the character fragment is not the prefix of any target string, * state machine Reset, compare the next character */if (Curr = = NULL) {Curr = root;i++;}}} return result;} public static void Main (string[] args) {list<string> target = new arraylist<string> (); Target.add ("abcdef"); Target.add ("Abhab"), Target.add ("BCD"), Target.add ("CDE"), Target.add ("CDFKCDF"); String Text = "bcabcdebcedFABCDEFABABKABHABK "; ahocorasickautomation aca = new Ahocorasickautomation (target); hashmap<string, list<integer>> result = Aca.find (text); System.out.println (text); for (entry<string, list<integer>> entry:result.entrySet ()) { System.out.println (Entry.getkey () + ":" + entry.getvalue ());}}}

The result of the operation is as follows, we can see in the text string BCD appeared two times, respectively, the text string subscript 3 and subscript 13 position, ....

BCABCDEBCEDFABCDEFABABKABHABKBCD: [3, 13]CDFKCDF: []CDE: [4, 14]abcdef: [12]abhab: [23]
5. Reference Content

[1] ac automata algorithm

The principle and realization of AC automata for multi-mode string matching algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.