KMP algorithm-next array-detailed explanation, kmp algorithm next Array

Source: Internet
Author: User

KMP algorithm-next array-detailed explanation, kmp algorithm next Array

There are many methods to find a substring in a parent string. KMP is the most common improved algorithm. It can effectively jump a few characters to the backend when the matching process is not matched, thus speeding up the matching.

Of course, we can see that this algorithm targets substrings with symmetric attributes. If there are symmetric attributes, we need to look forward for any matching content.

 

In the KMP algorithm, an array is called a prefix array, and some are called the next array. Each substring has a fixed next array, it records several characters that can be jumped forward when the character string is not matched. Of course, it also describes the degree of symmetry of the substring. The higher the degree, the larger the value, of course, there may be more opportunities for re-matching before.

The method of this next array is the key to the KMP algorithm, but it is not very easy to understand. I will explain it in plain words here. I can see that mathematical formula derivation is everywhere, and it hurts a lot, this article only contributes to students who do not like reading mathematical formulas and want to understand KMP Algorithms.

 

1. An example is used to explain the value of the next array of a substring. We can see that the substring has a high degree of symmetry, so the next value is relatively large.

Location I

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Prefix: next [I]

0

0

0

0

1

2

3

1

2

3

4

5

6

7

4

0

Substring

A

G

C

T

A

G

C

A

G

C

T

A

G

C

T

G

Note: The symmetry mentioned below is not central symmetry, but central character block symmetry. For example, it is not abccba, but abcabc.

(1) Find symmetric strings one by one.

This is very simple. We only need to traverse this sub-string cyclically and read the first 1 character, the first 2 characters, the third... I and the last 15 characters.

1st a is not symmetric, so the degree of symmetry is 0

The first two ag are not symmetric, so they are also 0.

Similarly, the preceding 0-4 values are the same as 0 values.

The first five agcta, you can see that this string has an aphase, so the degree of symmetry is 1, the first six agctag, we can see that the ag and ag are paired, the degree of symmetry is 2

The following describes how to solve the next value:

For example, a string: bread

First, you need to understand two concepts: "prefix" and "suffix ". "Prefix" refers to the combination of all the headers of a string except the last character. "suffix" refers to all the Tail Combinations of a string except the first character.

  

The "next value" is the length of the longest common element of "prefix" and "suffix. Take "ABCDABD" as an example,

-The prefix and suffix of "A" are empty sets, and the length of the common elements is 0;

-The prefix of "AB" is [A], the suffix is [B], and the length of common elements is 0;

-The prefix of "ABC" is [A, AB] And the suffix is [BC, C]. The length of all elements is 0;

-The prefix of "ABCD" is [A, AB, ABC], and the suffix is [BCD, CD, D]. The length of all elements is 0;

-The prefix of "ABCDA" is [A, AB, ABC, ABCD], and the suffix is [BCDA, CDA, DA, A]. The total element is "A" and the length is 1;

-The prefix of "ABCDAB" is [A, AB, ABC, ABCD, ABCDA], and the suffix is [BCDAB, CDAB, DAB, AB, B]. The total element is "AB ", the length is 2;

-The prefix of "ABCDABD" is [A, AB, ABC, ABCD, ABCDA, ABCDAB], and the suffix is [BCDABD, CDABD, DABD, ABD, BD, D]. the length of the common element is 0.


You should pay attention to it here. I think so, how can we implement programming?

Follow the following rules:

A. When the symmetry of the first character of the current surface is 0, you only need to compare the current character with the first character of the substring. This is easy to understand. The preceding values are 0, indicating that they are not symmetrical. if you add a character, the current and first symmetry are required at most. For example, if the value of t in agcta is 0, the symmetry of a after it depends on whether it is equal to the first character.

 

B. According to this reasoning, we can sum up a rule. Not only is the first character 0, but if the next value of the previous character is 1, then we compare the current character with the second character of the substring. Because the first character is 1, it indicates that the first character is equal to the first character, if this is equal to the second, the symmetry degree is 2. The two characters are symmetric. For example, the next of the last and last a of agctag is 1, which indicates that it is symmetric with the first a. Then we compare the last g with the second g, naturally, the symmetric Chengdu is accumulated, that is, 2.

 

C. According to the above reasoning, if it is always equal, it will always accumulate. You can always push it. It should not be difficult to push it here, if you think it is difficult, it means that I failed to write.

Of course, it is impossible for us to keep symmetric so smoothly. If the next generation is not equal, it means we cannot inherit the previous symmetry. This situation can only show that there are not so many symmetry, however, it cannot be said that there is no symmetry at all. In this case, we have to reconsider. This is also the difficulty.

 

(2) looking back at Symmetry

I can't inherit from the previous step here, but I still want to find the symmetry Chengdu. the most stupid way is to write a subfunction and find the maximum symmetry of the string. How can I write it in many ways, for example, you can find all the current strings and move forward to see if they are always equal. Finally, you can go to the beginning of the substring. Of course, this is the dumbest. We generally see that KMP is optimized, this string is regular.

Here we still use the above table as an example:

The positions I = 0 to 14 are as follows. The brackets I add are only used to illustrate the problem:

(A g c t a g c) t

We can see that the last degree of symmetry before t is: 1, 2, 3, 4, 5, 6, 7, and the last and second c looks forward with 7 characters of symmetry, so it is called 7. However, in the end, this t does not inherit the next value of the previous symmetry, so the symmetry of this t will be re-obtained.

Here, we must first declare several facts.

1. If t needs to have symmetry, the symmetry degree must be smaller than the symmetry degree of the previous c, so we need to find a smaller symmetry. You don't need to explain this, if it is large, t inherits the symmetry above.

2. To find a smaller symmetry, subsymmetry exists inside the symmetry, and the t must follow the subsymmetry.

As described in.

 

From the above theory, we can obtain the algorithm for solving the next array prefix below.

Void SetPrefix (const char * Pattern, int prefix [])

{

Int len = CharLen (Pattern); // The length of the Pattern string.

Prefix [0] = 0;

For (int I = 1; I <len; I ++)

{

Int k = prefix [I-1];

// Recursively determine whether sub-symmetry exists. If k is set to 0, sub-symmetry is no longer available. Pattern [I]! = Pattern [k] indicates that although symmetric, the value after symmetric is not equal to the current character value, so the recurrence is continued.

While (Pattern [I]! = Pattern [k] & k! = 0)

K = prefix [k-1]; // continue Recursion

If (Pattern [I] = Pattern [k]) // finds the sub-symmetry, or directly inherits the symmetry above, both of which are based on ++

Prefix [I] = k + 1;

Else

Prefix [I] = 0; // if all the sub-symmetry is traversed, this new character is not symmetric.

}

}

Copyright Disclaimer: This article is an original article by the blogger and cannot be reproduced without the permission of the blogger.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.