[Algorithm series 26] KMP algorithm of string matching

Source: Internet
Author: User
Tags bitwise string back

A Brief introduction

The KMP algorithm is an improved string matching algorithm found by D.e.knuth and V.r.pratt and J.h.morris, so it is called the Knut-Morris-Pratt operation (the KMP algorithm). The key of KMP algorithm is to reduce the number of matches between the pattern string and the main string so as to achieve fast matching by using the information after the match failure.

Two-KMP algorithm based on partial matching table

For example, there is a string "BBC Abcdab Abcdabcdabde", and I want to know if it contains a search string "Abcdabd"?

Step 1: The first character of the string "BBC Abcdab Abcdabcdabde" is compared to the first character of the search string "Abcdab D". Because the character B does not match a, the search string moves back one bit.

Step 2: because the character B and a do not match, so the search string is moved back one bit.

Step 3: This is the case until the string has a character that is the same as the first character of the search string.

Step 4: then compare the string to the next character of the search string, or the same.

Step 5: until the string has a character that is not the same as the character that corresponds to the search string.

Step 6: at this point, the most natural response is to move the search string back one after the whole, and then compare from the beginning. While this works, it is inefficient, because you want to move the "search location" to a location that has been compared again. (This is the brute force match of the previous [algorithm series of ten] string match)

Step 7: A basic fact is that when the empty lattice does not match D, you actually know that the first six characters are "Abcdab". the idea of the KMP algorithm is to take advantage of this known information and not move the "search location" back to a location that has already been compared and move it backwards . This increases the efficiency.

Step 8: How do you do this? A partial match table can be calculated for the search string. This table is how to produce, after the introduction, here as long as it can be used.

Step 9: When you know that the space does not match D, the preceding six characters "Abcdab" are matched.

The table shows that the last matching character B corresponds to a "partial match value" of 2, so the following formula calculates the number of bits moved backwards:

移动位数 = 已匹配的字符数 - 失配字符的上一位字符对应的部分匹配值

Because 6-2 equals 4, the search string is moved backwards by 4 bits, such as.

Step: Because the space and C do not match, the search string will continue to move backwards. At this point, the matched number of characters is 2 ("AB"), corresponding to the "partial match value" of 0. So, move the number of bits = 2-0, the result is 2, then move the search string back 2 bits.

Step One: because the space and a do not match, continue to move a bit. Bitwise comparison until you find that C and D do not match.

So, move the number of digits = 6-2, and continue to move the search string back 4 bits.

Step Three: bitwise comparison, until the last one of the search string, found an exact match, so the search is complete.

If you want to continue searching (that is, find all matches), move the number of digits = 7-0, and then move the search string back 7 bits, there is no repetition.

Three-part matching tables (partial match table)

Here's how the partial match table is produced. First, you need to understand two concepts: prefixes and suffixes .

    • Prefix refers to the entire head combination of a string except for the last character;
    • Suffix refers to all the trailing combinations of a string except the first character.

To illustrate:

The partial match value is the length of the longest common element of the prefix and suffix .

Take "Abcdabd" as an example.

This results in a partial matching table, as follows:

The essence of "partial match" is that sometimes the string header and tail are duplicated. For example, "Abcdab" has two AB, then its "partial match value" is 2 (the length of AB). When the search term moves, the first AB moves backwards 4 bits (string length-part matching value) and can come to the second ab position.

Four-KMP algorithm based on next array

Through the above matching process can be seen, the crux of the problem is to find the maximum length of the search string the same prefix and suffix. After you find the maximum length of the prefix and suffix public part before each character in the search string, you can base this match on. And this maximum length is exactly what the next array is meant to say.

4.1: Find Next array according to "partial match table"

As described above, we already know the maximum common element length for each prefix suffix of the search string "Abcdabd", as shown in:

Some of the matching tables are also shown here:

With a partial match table, we can calculate the number of moving bits using the following formula:

移动位数 = 已匹配的字符数 - 失配字符的上一位字符对应的部分匹配值

Using the partial match table and the move number calculation formula to match, we found that when a character mismatch, it is not necessary to consider the mismatch of the characters, we are always looking at the mismatch character of the previous character "partial match value." This leads to the next array.

After comparing the next array to the "partial match table", it is not difficult to find that the next array is equivalent to the "partial match table" to move one bit to the right, and then the first element value is assigned-1. Aware of this, you will exclaim that the solution to the next array is so simple: to find the maximum symmetric length prefix suffix, and then move the whole right one bit, the first element value is assigned to 1 (of course, you can also directly calculate a character corresponding to the next value, is to see how long the same prefix suffix is in the string before the character.

Update the calculation formula for the number of search string move bits:

移动位数 = 失配字符的位置 - 失配字符next值

In fact, the two formula is essentially the same, the position of the mismatch character equals the number of matched characters, and the mismatch character next value is equal to the partial matching value of the previous character of the mismatch character, just another way of saying it.

4.2 Recursive solution Next array

For a given string p, its next array means: for k=next[j],p prefix p[0...k-1] and P suffix p[j-k...j-1] match, K to be as large as possible, and k< J. We can write Next's brute force calculation method according to the above meanings. The complexity should be O (^2).

A change of mind, now next[0]=-1,next[1]=0.
Suppose K=next[j], then p[0...k-1]=p[j-k...j-1], then ask for next[j+1] There are two cases:

    • If p[k] = P[j], then P[0...K]=P[J-K...J], next[j+1]=k+1=next[j]+1
    • If p[k]! = P[j], this can be seen as another string match problem, both the main string and the pattern string are p, when the match fails, how should k move? Apparently a k=next[k]
void GetNext (stringTintNext[]) {int size= T.size(); next[0] = -1;intK =-1;intj =0; while(J <size-1){//p[k] = prefix, p[j] denotes suffix            if(k = =-1||                T[k] = = T[j]) {++k;                ++j;            NEXT[J] = k; }//if            Else{//Backtrackingk = Next[k]; }        }//while}

Reference:

KMP algorithm for string matching
Thoroughly understand KMP from beginning to end
KMP algorithm to find next array
KMP algorithm of next[] array popular explanation

[Algorithm series 26] KMP algorithm of string matching

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.