Deep understanding of the fast pattern matching algorithm (KMP)

Source: Internet
Author: User

I am afraid that people who have used computers now know that most software with text editing functions have a shortcut key ctrl + f (such as word ). This function is mainly used to complete the "Search", "replace", and "replace all" functions. In fact, this is a typical pattern matching application, that is, searching strings in text files.

1. Pattern Matching
The pattern matching model is like this: Given two string variables S and P, S becomes the target string, which contains n characters. P is called the pattern string and contains m characters, m <= n. The search mode P starts from the given position of S (usually the first position of S. If it is found, the position of the mode P in the target string is returned (that is, the subscript of the first character of P in S ). If no pattern string P is found in the target string S,-1 is returned. This is the definition of pattern matching. Let's take a look at how to implement the pattern matching algorithm.

2. Simple pattern matching
The simple pattern matching algorithm is very simple and easy to understand. The general idea is as follows: Compare the characters in P with the characters in S starting from the first character S0 of S, if S0 = P0 &&...... & Sm-1 = Pm-1, it proves that the match is successful, the rest of the match does not need to be done, return subscript 0. If you are in a step Si! = Pi, the remaining characters in P do not need to be compared, and the match cannot be successful. Then, the second character in S starts to be compared with the first character in P. Similarly, also Know Sm = Pm-1 or find an I to make Si! = S-1. And so on. If you know that the start character is n-m in S, if no matching is successful, the mode P is not saved in S. (Think about why n-m is emphasized here.) This code implementation should be very simple. For details, refer to the internal implementation of strstr function. Let's take a look at Baidu encyclopedia and give a link to Baidu.

3. Fast pattern matching algorithm (KMP)
The primary cause of low efficiency in simple pattern matching is repeated character comparison. There is no connection between the next comparison and the previous comparison. It is a disadvantage of simple pattern matching. In fact, the comparison result of the previous comparison can be used, which leads to a fast pattern matching. In simple pattern matching, the subscript of the target string S is moved step by step, which is actually not good. There is no need to set the number of moving steps to 1.
Now let's assume that the current match is like this: S0 ...... St + 1 ...... St + j and P0 P1 ...... Pj. The matching characters are St + j + 1 and Pj + 1, and St + j + 1! = Pj + 1. The implication is that St + 1 ...... St + j and P0 P1 ...... Pj is exactly matched. At this time, what is the starting position of the next matching in S ?? In simple mode matching, the next comparison should start with St + 1 and compare St + 1 with P0, but this is not the case in quick mode matching, for quick mode matching, select St + j + 1 and Pk + 1. What is K? K is such a value, making P0 P1 ...... Pk and Pj-k + 1 ...... If Pj exactly matches, set k = next [j]. Therefore, P0 P1 ...... Pk and St + j-k + 1 ...... St + j exactly matches. The two characters to be matched next time should be St + j + 1 and Pk + 1. S and P are not compared with subscript 0, Which is why KMP is fast.
Now the key question comes. How can this K be obtained? If the K value is highly complex, it is not a good idea. In fact, this k is only related to the mode string P and requires m k, K = next [j]. therefore, you only need to store the data in the next array once, and the time complexity is related to m (linear relationship ). See how to calculate the value of the next array, that is, k.
Calculate next [] by induction: set next (0) =-1. If next (j) = k is known, obtain next [j + 1].
(1) If Pk + 1 = Pj + 1, apparently next [j + 1] = k + 1. If Pk + 1! = Pj + 1, next [j + 1] <next [j], so we look for h <k to make P0 P1 ...... Ph = Pj-h + 1 ...... Pj = Pk-h + 1 ...... Pk. That is to say, h = next (k); let's see it. This is an iterative process. (That is, the previous results are useful for the value after the evaluation)
(2) If such h is not stored, it indicates P0 P1 ...... Pj + 1 does not have equal substrings, so next [j + 1] =-1.
(3) If such h exists, continue to test whether the Ph and Pj are equal. The process of finding an equal value in this case, or determining it as-1 to find next [j + 1] is over.
Let's look at the implementation code:
Copy codeThe Code is as follows: View Code
Int next [20] = {0 };
// Note that the returned result is an array of "next" and stores m k values, that is, if next [j] = k
// Str [0] str [1]… Str [k] = str [j-k] str [j-k + 1]… Str [j]
// When des [t + j + 1] and pat [j + 1] fail to match, the next matching position is des [t + j + 1] and next [j] + 1.
Void Next (char str [], int len)
{
Next [0] =-1;
For (int j = 1; j <len; j ++)
{
Int I = next [the J-1];
While (str [j]! = Str [I + 1] & I> = 0) // iteration process
{
I = next [I];
}
If (str [j] = str [I + 1])
{
Next [j] = I + 1;
}
Else
{
Next [j] =-1;
}
}
}

Now with the K value saved in the next array, You can implement the KMP algorithm:Copy codeThe Code is as follows: View Code
// Des indicates the target string, pat indicates the mode string, and len1 and len2 indicate the length of the string.
Int kmp (char des [], int len1, char pat [], int len2)
{
Next (str2, len2 );
Int p = 0, s = 0;
While (p <len2 & s <len1)
{
If (pat [p] = des [s])
{
P ++; s ++;
}
Else
{
If (p = 0)
{
S ++; // if the first character fails to match, it starts with the next character of des.
}
Else
{
P = next [P-1] + 1; // use the failure function to determine the character that pat should backtrack
}
}
}
If (p <len2) // the entire process fails to match
{
Return-1;
}
Return s-len2;
}

Time Complexity:
For the Next function, the time complexity of the KMP algorithm is O (n), so the time complexity of the entire algorithm is O (n + m)
Spatial complexity:
The space complexity of O (m) is introduced.
4. An interview with KMP
Given two strings are s1 and s2, you must determine whether s2 can be contained by the string obtained by s1's cyclic shift. For example, if s1 = AABCD, s2 = CDAA, true is returned because the s1 cyclic shift can be changed to CDAAB. If s1 = ACBD and s2 = ACBD are given, false is returned.
Analysis: it is not difficult to find that all the strings obtained from s2 shift will be the child strings of s1s1. If s2 can be obtained through s1 cyclic shift, s2 must be the child string of s1s1, then, does the KMP algorithm work well.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.