KMP comparison algorithm for two strings
Assume that there are strings x and y that match Len (x)> Len (y.
We know that the simplest method is to align the two and then compare the characters at the corresponding position in sequence. If y is matched to the last position, the match is successful. If y fails to be matched, Y is shifted to the right and matched from the beginning.
Set string X to dababeabafdabcg, and string y to ababc.
The comparison method is as follows:
The first character does not match
:|:dababeabafdababcg:ababc
Shift one digit to the right, and move the comparison position to the start position of Y.
: |:dababeabafdababcg: ababc
It succeeds four times in a row and does not match again.
: |:dababeabafdababcg: ababc
Shift one digit to the right, and move the comparison position to the start position of Y.
: |:dababeabafdababcg: ababc:
Repeat the process ............
: |:dababeabafdababcg: ababc
Y completely matches X and ends.
Undoubtedly, this method is too stupid. The time complexity is as high as O (Mn ). The biggest problem is that a large number of repeated comparisons are performed.
KMP Algorithm
It was proposed to solve this problem.KMP Algorithm
The central idea is that the matched parts do not need to be matched again. On the contrary, the matching parts should be moved in multiple bytes to speed up the matching.
How can this problem be solved? KMP provides the following answers:
The matched part is known and only depends on string Y. The movement of string y can be directly moved to the next matching position.
What is the next matching position?
Assume that the matching part of string X and string y isabcab
Obviously:
: abcab... : abcab... :: abcab... : abcab... :
All do not match, only the following
:abcab...: abcab...
. Obviously, the new matching sequence is a real Suffix of the old matching sequence and a prefix of string y, while the old matching sequence is also the prefix of string y.
That is to say, based on the matched parts, we can exclude a large number of positions.KMP Algorithm
It is implemented based on this principle.
Rule 1
If the current comparison position has the same two characters, the comparison position is shifted to one place
Rule 2
If the current comparison position is different from the two characters, then:
If no matching sequence exists, the position of the comparison is shifted to one bit, and the string y is shifted to one bit.
If the string already matches the sequence, the string y is moved to the next matching position.
For example
:|:dababeabafdababcg:ababc
The character does not match. The position and string y are both shifted to the right.
: |:dababeabafdababcg: ababc
Character match. The position is shifted to the right by 1.
: |:dababeabafdababcg: ababc
It is successfully matched four times in a row and does not match again.
: |:dababeabafdababcg: ababc
Character mismatch. Move string Y to the next matching position.
: |:dababeabafdababcg: ababc
Still does not match, continue to move string y
: |:dababeabafdababcg: ababc
It still does not match, and no matching sequence exists. Both the position and string y are shifted to the right.
: |:dababeabafdababcg: ababc
Character match. The position is shifted to the right by 1.
: |:dababeabafdababcg: ababc
It is successfully matched three times in a row and does not match again.
: |:dababeabafdababcg: ababc
Character mismatch. Move string Y to the next matching position.
: |:dababeabafdababcg: ababc
Still does not match, continue to move string y
: |:dababeabafdababcg: ababc
The character does not match. No matching sequence exists. Both the matching position and string y are shifted to the right.
: |:dababeabafdababcg: ababc
The character does not match. No matching sequence exists. Both the matching position and string y are shifted to the right.
: |:dababeabafdababcg: ababc
Character match. The position is shifted to the right by 1.
: |:dababeabafdababcg: ababc
The string is successfully matched five times in a row and reaches the end of string Y. The match is successful.
: |:dababeabafdababcg: ababc
As you can see,KMP Algorithm
The comparison position does not reverse, and the number of comparisons at the same position is significantly lessSimple Algorithm
, The comparison speed is quite fast.
The key to the KMP algorithm isCalculate the next matching Sequence Based on the existing matching Sequence
This part can be implemented only by string y, because the calculation of the next matching sequence is irrelevant to the part other than the existing matching sequence.
In fact, based on the existing matching sequence, you can obtain multiple next-level matching sequences that meet the requirements (that is, both the true Suffix of the existing matching sequence and the prefix of string y), but there is no doubt that, we should select the longest one.
KMP Algorithm
Is the KeyCalculate the next matching Sequence Based on the existing matching Sequence
This step is calculated because we know that the next matching sequence must be the prefix of string Y. In fact, we only need to record the length of the next matching sequence. This is the true face of the next array. (The value of the next array introduced by others may be different from what I said, but it is actually the result of mathematical transformation of the next matching length)
There are also many ways to calculate the next array, for example, a simple method. In fact, there is a good way to easily calculate the next array. This method is similar to the process of comparing KMP, and it is also similar to the suffix tree structure.Ukkonen Algorithm
The same is true.
The go code is as follows:
// Member N [I] indicates the length of the longest sequence of S [: I] Real suffix and S prefix. // A real suffix is a non-empty suffix. If the member does not exist, set the value to 0. Func next (s string) [] int {n: = make ([] int, L) // starts from N [2] For I, J: = 2, 0; I <L; {// all the characters above are known to match if s [I-1] = s [J] {J ++ N [I] = J I ++ continue} If J = 0 {// The applied N will all be initialized to zero I ++ continue} // Function J = N [J]} return n} similar to the suffix pointer}
We compared only one character in each loop! This is because the previous comparison has ensured that the previous part is matched. If the comparison succeeds, you only need to extend the length of the next match. If the comparison fails, we do not need to start from the beginning to find the starting position of the next match, because the previous matching result tells us the next matching position.
The result is,KMP Algorithm
The process of constructing and comparing the next array is very fast!
KMP Algorithm
The complete code is as follows:
// KMP string search algorithm func KMP (S, r string) int {L: = Len (r) // member N [I] indicates that it is both S [: i] the length of the longest sequence with the real suffix and the S prefix. // A real suffix is a non-empty suffix. If the member does not exist, set the value to 0. N: = func (s string) [] int {n: = make ([] int, L) // start from N [2] For I, J: = 2, 0; I <L; {// all the characters above are known to match if s [I-1] = s [J] {J ++ N [I] = J I ++ continue} If J = 0 {// The applied N will all be initialized to zero I ++ continue} // Function J = N [J]} return n} (r) similar to the suffix pointer) // search for I, j: = 0, 0 for I + L <j + Len (s) & J <L {if s [I] = R [J] {I, j = I + 1, J + 1 continue} If J = 0 {I ++} else {J = N [J]} If J = L {return I-l} return-1}
The KMP algorithm is a very delicate algorithm, but the code is very simple. In contrast, although the speed may be inferior to that of the BM algorithm, it is indeed much more elegant.
KMP comparison algorithm for two strings