The algorithm idea of multi-mode string matching for AC automata

Source: Internet
Author: User

The standard KMP algorithm is used for the matching of a single pattern string, that is, to seek a pattern string match in the parent string, but now there is a problem, if multiple pattern strings are given at the same time, what should we do if we need to find the number of matches of this series of pattern strings in the parent string?

Based on the KMP algorithm, we can think of a naïve algorithm is to enumerate the multiple pattern strings, and then perform multiple KMP algorithm, the process of completing the count, assuming there are n pattern string, then the entire algorithm is about the complexity of O (n*m), M is the length of the string, here the time complexity is a rough estimate, There is no time to calculate the auxiliary array (the next array in KMP), but this complexity is still too high to achieve the core idea of the "ingredient utilization known information" in the KMP algorithm, here is a clever algorithm--ac automaton, capable of linear completion of multi-mode string matching, that time complexity can be optimized to O (M )。

The following is a brief introduction to the idea of an AC automata algorithm for solving multi-pattern string matching problems.

Recalling the ideas presented by KMP, we need to fully excavate the information contained in the pattern string before matching, so we get a great deal of optimization when we traverse the string of the next array of patterns, so we should take the same idea for the matching of multi-pattern strings. We should fully excavate the information contained in multiple pattern strings.

In order to save space, we would like to think of the previous dictionary tree, where we build a dictionary tree based on multiple pattern strings, and then scan the string linearly in such a dictionary tree.

Recalling the matching process in KMP again, when we get the next array, we move the pattern string on the string to complete the match, based on the next array, we obtained the optimization so that after the pattern string matching failed, can be very limited to the right side of the string and can ensure that no matching situation is missing. But here in AC automata, there's a difference, but the idea is consistent. The difference in practice now, based on the multi-mode string of the dictionary tree, we have the mother string in the dictionary tree to match, when we construct the dictionary tree, the last character of the pattern string as the termination node, then the parent string in the dictionary tree matching to the termination node, it indicates that the completion of a pattern string matching.

So what we need to do now is to match the failure after the optimization, after the female string interval str[i]~str[j] match failed, the best practice should be, to obtain STR[I]~STR[J] This string of the longest public prefix and with the suffix and then continue from this longest prefix and under this branch, Continue with the match. In order to complete this step "to the right of the parent string and to ensure that no match is missed," we set a "match failed jump" pointer at each node of the dictionary tree (hereinafter collectively referred to as the Fail pointer), and record the location of the dictionary tree if the match failed at that point, and in the case of optimization we should jump to it.

Can see that the entire AC automaton presents a multi-pattern string match in fact and the KMP algorithm is highly similar.

So the problem that needs to be solved here is, just like calculating the next array in KMP, here's how we calculate the fail pointer in the dictionary tree.

Assuming we are currently looking for a fail pointer to VI, then it is clear that we go to the node VP that the fail pointer of the parent node of VI points to, according to the definition of the pointer, it is easy to see:

(1) If the sub-node of the current node VP VJ can match VI, then the fail pointer of VI must point to VJ.

(2) If the sub-node of the VP of the current node is not able to match the VI, then we need to find the VP's fail pointer to the node's child node, to see if there is a match with VI, no, the loop operation.

for (2) seems to form a recursive mode, why is this? Or what is the correctness of that? It stems from the definition of the fail pointer, and the careful reader should be able to see that the fail pointer is essentially a common prefix and suffix that describes a string, for a string of str[i]~str[j], after the mismatch we have the longest common suffix in the string (as Str[p]~str[j ]), add the character a that causes the mismatch, but it may not exist in the dictionary tree, so do we start a new match again? Of course no, let's find the longest common prefix of the string str[p]~str[j] (Str[k]~str[j]), add the character a that causes the mismatch, and then loop the operation ...

This is a brief description of the solution to the AC automata.

can see it and KMP algorithm thinking synchronization to the same, this revelation of our learning process of comprehend by analogy, this phenomenon in many places also have very obvious embodiment (integral, double integral, Sanchong integral; unary function, multivariate function, single variable distribution, multivariable joint distribution). Many thought ideas are established in one-dimensional angle and then extended to two-dimensional angles and even higher dimensions.

The algorithm idea of multi-mode string matching for AC automata

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.