Classical Algorithm Research Series: 6. How to thoroughly understand KMP Algorithms

Source: Internet
Author: User

Author: July, saturnma time; January 1, 2011

-----------------------

Reference: Data Structure (c) Li yunqing compilation and introduction to Algorithms
Note: The KMP algorithm will write a sequel later. This article was not deleted to ensure the consistency of the classical algorithm research series.

Introduction:
In text editing, we often need to find a specific character or pattern at a specific position in a piece of text.
As a result, the string matching problem occurs.
This article starts with a simple string matching algorithm and goes to the KMP algorithm. It helps you fully understand the KMP algorithm from the beginning to the end.

Let's take a look at the definition of this string in the introduction to algorithms:
Assume that the text is an array T [1... n] with a length of n, and the pattern is an array P [1... m] with a length of m <= n.
It is further assumed that the P and T elements belong to characters in the finite alphabet Σ.

 

Back

Explain the problem of string matching. The goal is to find all the occurrences of the pattern P = abaa in text T = abcabaabcaabac.
This mode appears only once in the text, where the displacement s is 3. Displacement s = 3 is an effective displacement.

 

1. Simple String Matching Algorithm

A simple string matching algorithm uses a loop to find all valid offsets,
This loop checks the condition P [1... m] = T [s + 1... s + m] for every possible s value of n-m + 1.

NAIVE-STRING-MATCHER (T, P)
1 n bytes length [T]
2 m bytes length [P]
3 for s defaults 0 to n-m
4 do if P [1 then m] = T [s + 1 then s + m]
// For each value in the n-m + 1 possible displacement s, the loop comparing the corresponding characters must be executed m times.
5 then print "Pattern occurs with shift" s

Simple String Matching Algorithm for text T = acaabc and pattern P = aab.
The above 4th lines of code, n-m + 1 possible displacement of every value in s, the loop comparing the corresponding characters must be executed m times.
Therefore, in the worst case, the running time of this simple pattern matching algorithm is O (n-m + 1) m ).

 

--------------------------------

Next I will give a specific example and give a specific running program:
If the target string is banananobano and the pattern to be matched is nano,

The following is the matching process. The principle is very simple. You only need to first compare it with the first character of the target string,
If they are the same, compare the next one. If they are different, shift pattern to the right,
Then compare each character in pattern. The running process of this algorithm is as follows.
// The condition that index indicates every n matches.

# Include <iostream>
# Include <string>
Using namespace std;
Int match (const string & target, const string & pattern)
{
Int target_length = target. size ();
Int pattern_length = pattern. size ();
Int target_index = 0;
Int pattern_index = 0;
While (target_index <target_length & pattern_index <pattern_length)
{
If (target [target_index] = pattern [pattern_index])
{
++ Target_index;
++ Pattern_index;
}
Else
{
Target_index-= (pattern_index-1 );
Pattern_index = 0;
}
}
If (pattern_index = pattern_length)
{
Return target_index-pattern_length;
}
Else
{
Return-1;
}
}
Int main ()
{
Cout <match ("bananobano", "nano") <endl;
Return 0;
}

// The running result is 4.

 

The complexity of the preceding algorithm is O (pattern_length * target_length ),
Where do we mainly waste time,
When index = 2 is displayed, We have matched three characters, but the 4th characters do not match. At this time, the matched Character Sequence is nan,

At this time, if one character is moved to the right, the first matching Character Sequence of nan will be an, which certainly cannot be matched,
Then shift one to the right. The first matching sequence of nan is n, which can be matched.

If we know the information of pattern in advance, we do not need to roll back target_index every time the matching fails,
This rollback wastes a lot of unnecessary time. If we can calculate the nature of pattern in advance,
In this case, the pattern can be directly moved to the next possible location during the mismatch,
Omit the process that is impossible to match,
As shown in the table above, when index = 2, the pattern can be directly moved to the status where index = 4,
The kmp algorithm starts from this point.

 

Ii. KMP Algorithm

1. overlay_function)

The overwrite function represents the nature of pattern, so that it can represent the self-coverage of all consecutive substrings starting from the left of pattern.
For example, the following string: abaabcaba

Since the count starts from 0, if the value of the overwrite function is 0, there is a match. It is a preference for counting from 0 or from the beginning,

For details, please adjust it by yourself.-1 indicates that there is no coverage. What is overwriting? Here is a mathematical definition, for example, for sequence

A0a1... aj-1 aj

 

To find a k, make it meet

A0a1... ak-1ak = aj-kaj-k + 1... aj-1aj

If no larger k meets this condition, it is necessary to find the largest k possible so that the first k character of pattern matches the last k character, and the k must be as big as possible,
The reason is that if a large k exists, and we select a small k that meets the conditions,
When the mismatch occurs, we will increase the position where pattern moves to the right, and a small number of moving positions are matched, so that we will lose the possible matching results.

For example, the following sequence,

If the red part is incorrect and the correct result is k = 1, the pattern is shifted to four places. If k = 0 is selected, an error occurs when the pattern is shifted to five places.
The overlay function can be computed using recursion. we can imagine that if the first j characters of pattern are overwritten with the function value k

A0a1... ak-1ak = aj-kaj-k + 1... aj-1aj
For the first j + 1 character series of pattern, it is possible
(1) pattern [k + 1] = pattern [j + 1] overlay (j + 1) = k + 1 = overlay (j) + 1
(2) pattern [k + 1] =pattern [j + 1] at this time, the corresponding overlay function can only be found in the substring of the k + 1 sub-character group before pattern, h = overlay (k). If pattern [h + 1] = pattern [j + 1] at this time, overlay (j + 1) = h + 1 otherwise repeat (2) process.

 

The following code overwrites the function:

# Include <iostream>
# Include <string>
Using namespace std;
Void compute_overlay (const string & pattern)
{
Const int pattern_length = pattern. size ();
Int * overlay_function = new int [pattern_length];
Int index;
Overlay_function [0] =-1;
For (int I = 1; I <pattern_length; ++ I)
{
Index = overlay_function [I-1];
// Store previous fail position k to index;

While (index> = 0 & pattern [I]! = Pattern [index + 1])
{
Index = overlay_function [index];
}
If (pattern [I] = pattern [index + 1])
{
Overlay_function [I] = index + 1;
}
Else
{
Overlay_function [I] =-1;

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.