KMP algorithm Intensive Reading

Source: Internet
Author: User

After reading a lot of KMP Algorithms on the internet, I always feel confused. Finally, I learned about the KMP algorithm by turning over the data structure of Wei Min. Here we will share with you.

For those who haven't touched the algorithm for a long time, first understand the most primitive BF algorithm of the pattern matching algorithm. If you know something, you can directly transfer it to the KMP algorithm.

1. BF algorithm

The basic idea of the algorithm is to start with the main string S and compare it with the first character of the pattern string T. If they are equal, the comparison characters are continued one by one; otherwise, the comparison starts from the beginning of the Main string and the next character. It is equivalent to the main string S. LEN Length. It is also a loop compared with the pattern string t during the comparison process. The maximum number of cycles is S. len-1, and the maximum length of each cycle T. len. That is to say, the complexity of this algorithm is O (S. Len * T. Len ).

Demonstrate BF algorithm matching process

Figure 1

The Code is as follows:

Int index (string S, string T, int POS)

{

// Return the position of the substring T following the POs character in the primary string S. If not, the function value is 0.

// Where T is not empty, 1 <= POS <= S. Len.

I = Pos; j = 1;

Whie (I <= S. Len & J <= T. Len)

{

If (s [I] = T [J]) {++ I; ++ J;} // continue to compare subsequent characters

Else {I = I-j + 2; j = 1 ;}// returns the pointer and starts matching again.

}

If (j> T. Len) return i-T.len;

Else return 0;

}


2. KMP Algorithm

In fact, the algorithm can be completed on the order of magnitude O (S. Len + T. Len. The improvement is that when the character comparison is not equal during a match, you do not need to backtrack the I pointer, instead, use the "partially matched" results to "slide" the pattern to the right as far as possible before comparison. For example:


Figure 2

Review the matching process in Figure 1. When I = 7, j = 5, the matching process starts again from I = 4, j = 1. After observation, we can find that the comparison between I = 4 and j = 1, I = 5 and j = 1, And I = 6 and j = 1 is unnecessary. The third part of the matching results shows that the 4th, 5, and 6 Characters in the master string must be B, C, and. Because the first character in mode T is a, there is no need to compare it with the three characters. However, you only need to slide the mode to the right for a comparison of the characters I = 7 and J = 2. Similarly, when there are characters in the first match, you only need to move the mode T to the right to continue the comparison of the characters I = 3 and j = 1. As shown in figure 2, the pointer I does not return back.

This is the principle, but how to know the next moving mode string t after the current mismatch is solved by the next function. The next function is defined:


This defines the next function value for exiting the following mode string:


There are a variety of explanations for next on the Internet. We will talk about an improved next function later, but we still need to start with the most basic next function and really understand one thing, the best way I think is to understand it from nothing.

Next [J] indicates that when the J character in the mode string is "Mismatched" with the corresponding character in the main string, the position of the character that needs to be compared with the character in the main string in the mode. For example, when the mode string T [4] of J = 4 is not matched with the main string s [4, in this case, the character at the position J = next [J] = 2 should be compared with the main string s [4. Why? Let's take a look at what is the character string before J = 4 (the term "suffix" is used online). The character above here is t [3] =, T [1] = A and T [3] = s [3], so there is no need to compare T [1] With s [3, we can directly compare T [2] and S [4], so that the mode string is slide. The value of next is gradually becoming clearer. The value of next [J] is actually from T [1 ~ J-1] depends on next [J-1. The value of next [J] is actually the pattern string T [1 ~ J-1] Forward starting from scratch and pattern string T [1 ~ J-1] returns the maximum length of the same string + 1 from the end rather than in reverse order. Finally, we learned that the value of next [J] is irrelevant to the value of T [J. With the next value, you can implement the KMP algorithm.

Int index_kmp (string S, string T, int POS)

{

// When studying algorithms, POS here is actually not needed. Instructor Yan just extended it and asked this function to perform pattern matching for the main string s starting from the POs position.

I = Pos; j = 1;

While (I <= S. Len & J <= T. Len)

{

If (j = 0 | s [I] = T [J]) {++ I; ++ J}

Else J = next [J];

}

If (j> T. Len) return i-T.len;

Else return 0;

}

It's okay to manually calculate the value of next, but the code implementation really seems awkward to me. This is a recursive idea. It is also different from the common recursion. Actually, it is a perfect solution!

Here we need to use a recursive method to analyze and solve the next function value.

Defined knowledge

Next [1] = 0

Set next [J] = K, which indicates that the mode string has the following relationship: T [1 ~ K-1] = T [J-k + 1 ~ J-1]. In this case, next [J + 1] may have two situations:

(1) If t [k] = T [J], t [1 ~ K] = T [J-k + 1 ~ J], next [J + 1] = next [J] + 1

(2) If t [k]! = T [J] indicates T [1 ~ K]! = T [J-k + 1 ~ J]. In this case, the mode string is considered as the main string and also the mode string. Moves the pattern string. The next [k] character in the pattern is compared with the J character of the primary string. Use ''to represent the main string. K' = next [K]. If t' [J] = T [k'], it indicates that before the J + 1 characters in the main string, there is a K'-length oldest string, which is equal to the Child string whose start character is K' in the mode string.

That is, t [1 ~ K'] = T [J-k' + 1 ~ J], and next [J + 1] = next [k'] + 1.

Similarly, if T [1 ~ K']! = T [J-k' + 1 ~ J], continue to slide the pattern string to the Right to next [k']... So on until a character is matched successfully or no K' meets the condition of next [J + 1] = 1.

The algorithm is implemented as follows (following the KMP algorithm ):

Void get_next (string T, int next [])

{

I = 1; next [1] = 0; j = 0;

While (I <t. Len)

{

If (j = 0 | T [I] = T [J]) {++ I; ++ J; next [I] = J ;}

Else J = next [J];

}

}


The next function algorithm can be further optimized, and the optimization is also based on the problem. For example, the main string s -- 'aaaaaab' and the pattern string t -- 'aaaab '.

When s [4] = A and T [4] = B are not equal, next [4] = 3 points to T [3], next [3] = 2 points to T [2]..., This leads to a lot of repetitive tasks. In fact, we can directly slide to the first part of the pattern string and compare s [4] with T [1. According to the above definition, when s [I] <> T [J], and t [J] = T [K], here is actually K = next [J], so we do not need to compare s [I] with T [K], but with T [next [K, and so on until T [I] <> T [next [k'] Or k' = 1. The implementation code is as follows:

Void get_nextval (string T, int next [])

{

I = 1; nextval [1] = 0; j = 0;

While (I <t. Len)

{

If (j = 0 | T [I] = T [J])

{

++ I; ++ J

If (T [I]! = T [J]) next [I] = J;

Else nextval [I] = nextval [J]; // the difference here is to judge whether it is the same. If it is different, + 1 is used. If it is the same, the previous next value is used as the value of the current position.

}

Else J = nextval [J];

}

}

If any error occurs, please correct it ~





Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.