Classical Algorithm Research Series: 6. How to thoroughly understand KMP Algorithms

Source: Internet
Author: User

6. teach you a thorough understanding from start to endKMPAlgorithm

 

Author: July,Saturnma time;January 1, 2011

-----------------------

Reference: Data Structure (C LanguageEdited by Li yunqing and an introduction to Algorithms
Author's statement: the personal July has the copyright to the 24 classic algorithm series. For more information, see the source.

Introduction:
In text editing, we often need to find a specific character or pattern at a specific position in a piece of text.
As a result, the string matching problem occurs.
This article starts with a simple string matching algorithm. It uses the Rabin-Karp Algorithm and finally the KMP algorithm to teach you a thorough understanding of the KMP algorithm from start to end.

Let's take a look at the definition of this string in the introduction to algorithms:
Assume that the text is an array T [1... n] with a length of N, and the pattern is an array P [1... m] with a length of M <= n.
It is further assumed that the P and T elements belong to characters in the finite alphabet Σ.

Then explain the problem of string matching. The goal is to find all the occurrences of the pattern P = abaa in text T = abcabaabcaabac.
This mode appears only once in the text, where the displacement S is 3. Displacement S = 3 is an effective displacement.

 

1. Simple String Matching Algorithm

A simple string matching algorithm uses a loop to find all valid offsets,
This loop checks the condition P [1... m] = T [S + 1... S + M] for every possible s value of N-m + 1.

NAIVE-STRING-MATCHER (T, P)
1 N bytes length [T]
2 M bytes length [p]
3 For s defaults 0 to n-m
4 do if p [1 then M] = T [S + 1 then S + M]
// For each value in the n-m + 1 possible displacement S, the loop comparing the corresponding characters must be executed m times.
5 then print "pattern occurs with shift" s

Simple String Matching Algorithm for text T = acaabc and pattern P = Aab.
The above 4th rowsCode, N-m + 1 possible displacement of every value in S, the loop comparing the corresponding characters must be executed m times.
Therefore, in the worst case, the running time of this simple pattern matching algorithm is O (n-m + 1) M ).

 

--------------------------------

Next I will give a specific example and give a specific runProgram:
If the target string is banananobano and the pattern to be matched is nano,

The following is the matching process. The principle is very simple. You only need to first compare it with the first character of the target string,
If they are the same, compare the next one. If they are different, shift pattern to the right,
Then compare each character in pattern. The running process of this algorithm is as follows.
// The condition that index indicates every n matches.

# Include <iostream>
# Include <string>
Using namespace STD;
Int match (const string & target, const string & pattern)
{
Int target_length = target. Size ();
Int pattern_length = pattern. Size ();
Int target_index = 0;
Int pattern_index = 0;
While (target_index <target_length & pattern_index <pattern_length)
{
If (target [target_index] = pattern [pattern_index])
{
++ Target_index;
++ Pattern_index;
}
Else
{
Target_index-= (pattern_index-1 );
Pattern_index = 0;
}
}
If (pattern_index = pattern_length)
{
Return target_index-pattern_length;
}
Else
{
Return-1;
}
}
Int main ()
{
Cout <match ("bananobano", "nano") <Endl;
Return 0;
}

// The running result is 4.

 

The complexity of the preceding algorithm is O (pattern_length * target_length ),
Where do we mainly waste time,
When Index = 2 is displayed, We have matched three characters, but the 4th characters do not match. At this time, the matched Character Sequence is Nan,

At this time, if one character is moved to the right, the first matching Character Sequence of Nan will be an, which certainly cannot be matched,
Then shift one to the right. The first matching sequence of Nan is N, which can be matched.

If we know the information of pattern in advance, we do not need to roll back target_index every time the matching fails,
This rollback wastes a lot of unnecessary time. If we can calculate the nature of pattern in advance,
In this case, the pattern can be directly moved to the next possible location during the mismatch,
Omit the process that is impossible to match,
As shown in the table above, when Index = 2, the pattern can be directly moved to the status where Index = 4,
The KMP algorithm starts from this point.

 

Ii. KMP Algorithm

1,Override Function(Overlay_function)

The overwrite function represents the nature of pattern, so that it can represent the self-coverage of all consecutive substrings starting from the left of pattern.
For example, the following string: abaabcaba

Since the count starts from 0, if the value of the overwrite function is 0, there is a match. It is a preference for counting from 0 or from the beginning,

For details, please adjust it by yourself.-1 indicates that there is no coverage. What is overwriting? Here is a mathematical definition, for example, for sequence

A0A1...AJ-1AJ

 

To findK,Make it meet

A 0 A 1... A K-1 A K = A J-K A J-k + 1... A J-1 A J

If no larger K meets this condition, it is necessary to find the largest K possible so that the first K character of pattern matches the last K character, and the K must be as big as possible,
The reason is that if a large k exists, and we select a small K that meets the conditions,
When the mismatch occurs, we will increase the position where pattern moves to the right, and a small number of moving positions are matched, so that we will lose the possible matching results.

For example, the following sequence,

If the red part is incorrect and the correct result is k = 1, the pattern is shifted to four places. If K = 0 is selected, an error occurs when the pattern is shifted to five places.
The overlay function can be computed using recursion. we can imagine that if the first J characters of pattern are overwritten with the function value K

A 0 A 1... A K-1 A K = A J-K A J-k + 1... A J-1 A J
For the first J + 1 character series of pattern, it is possible
(1) pattern [k + 1] = pattern [J + 1] overlay (J + 1) = k + 1 = overlay (j) + 1
(2) pattern [k + 1] =pattern [J + 1] at this time, the corresponding overlay function can only be found in the substring of the k + 1 sub-character group before pattern, H = overlay (k). If pattern [H + 1] = pattern [J + 1] at this time, overlay (J + 1) = H + 1 otherwise repeat (2) process.

 

The following code overwrites the function:

# Include <iostream>
# Include <string>
Using namespace STD;
Void compute_overlay (const string & pattern)
{
Const int pattern_length = pattern. Size ();
Int * overlay_function = new int [pattern_length];
Int index;
Overlay_function [0] =-1;
For (INT I = 1; I <pattern_length; ++ I)
{
Index = overlay_function [I-1];
// Store previous fail position K to index;

While (index> = 0 & pattern [I]! = Pattern [index + 1])
{
Index = overlay_function [Index];
}
If (pattern [I] = pattern [index + 1])
{
Overlay_function [I] = index + 1;
}
Else
{
Overlay_function [I] =-1;
}
}
For (I = 0; I <pattern_length; ++ I)
{
Cout <overlay_function [I] <Endl;
}
Delete [] overlay_function;
}
Int main ()
{
String Pattern = "abaabcaba ";
Compute_overlay (pattern );
Return 0;
}

The running result is:

-1
-1
0
0
1
-1
0
1
2
Press any key to continue

-------------------------------------

 

2. KMP Algorithm
With the overwrite function, implementing the KMP algorithm is very simple. Our principle is to match from left to right, but when the mismatch occurs, we do not need to move target_index back, the matching part of target_index can be reflected in pattern itself, as long as pattern_index is changed.

When the J length mismatch occurs, you only need to move pattern to the right J-overlay (j) length.

If pattern_index = 0 in case of mismatch, the first character of pattern is not matched,
In this case, add target_index to 1, and move 1 to the right.

 

OK is the process of the KMP algorithm (Red is the execution process of the KMP algorithm ):

 

The KMP algorithm can match all strings in the O (N + M) time.

OK. Finally, the C ++ Code implemented by the KMP algorithm is provided:

# Include <iostream>
# Include <string>
# Include <vector>
Using namespace STD;

Int kmp_find (const string & target, const string & pattern)
<br>
<br>
<br>
<br>
Overlay_value [0] =-1;
Int Index = 0;
For (INT I = 1; I <pattern_length; ++ I)
{
Index = overlay_value [I-1];
While (index> = 0 & pattern [index + 1]! = Pattern [I])
{
Index = overlay_value [Index];
}
If (pattern [index + 1] = pattern [I])
{
Overlay_value [I] = index + 1;
}
Else
{
Overlay_value [I] =-1;
}
}
// Match algorithm start
Int pattern_index = 0;
Int target_index = 0;
While (pattern_index <pattern_length & target_index <target_length)
{
If (target [target_index] = pattern [pattern_index])
{
++ Target_index;
++ Pattern_index;
}
Else if (pattern_index = 0)
{
++ Target_index;
}
Else
{
Pattern_index = overlay_value [pattern_index-1] + 1;
}
}
If (pattern_index = pattern_length)
{
Return target_index-pattern_index;
}
Else
{
Return-1;
}
Delete [] overlay_value;
}

Int main ()
{
String source = "annbcdanacadsannannabnna ";
String Pattern = "annacanna ";
Cout <kmp_find (source, pattern) <Endl;
Return 0;
}
// The running result is-1.

 

Iii. Sources of KMP Algorithms
How did KMP come from being so delicate? Why did three people work together. In fact, even if there is no KMP algorithm, people can find the same efficient algorithm in character matching. This algorithm is eventually equivalent to the KMP algorithm, but the starting point of this algorithm is not to overwrite the function, not to directly start from the internal principle of matching, the overwrite functions computed using this method are complex and difficult to understand. However, once this overwrite function is found, the efficiency of matching with the same pattern will be the same as that of KMP, in fact, the function found by this algorithm should not be called an overwrite function, because the overwrite issue is not considered during the search process.

After talking about this for a long time, what is this method? This method is a well-known finite automaton (deterministic finite state automaton DFA). The pattern grammar that DFA can recognize is a three-type grammar, it is also called regular or regular syntax. Since regular syntax can be recognized, it is certainly not a problem to identify the strings (determining strings are a subset of regular expressions ). There is a complete algorithm for constructing DFA, which is not described here. The use of DFA in identifying identified strings is really rare. DFA can recognize more general regular expressions, while using a general method of building DFA to identify identified strings, the overhead is too big.

The value of KMP algorithm is that, starting from the characteristics of character matching, it cleverly uses the overwrite function, which represents the characteristics of pattern to quickly generate DFA for string recognition, therefore, for an algorithm like KMP, you can understand this algorithm in high school mathematics. However, if you want to design this algorithm from scratch, you must have a deep mathematical knowledge.
 

OK.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.