KMP comparison algorithm for two strings

Last Update:2014-09-23 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Assume that there are strings x and y that match Len (x)> Len (y.

We know that the simplest method is to align the two and then compare the characters at the corresponding position in sequence. If y is matched to the last position, the match is successful. If y fails to be matched, Y is shifted to the right and matched from the beginning.

Set string X to dababeabafdabcg, and string y to ababc.

The comparison method is as follows:

The first character does not match

:|:dababeabafdababcg:ababc

Shift one digit to the right, and move the comparison position to the start position of Y.

: |:dababeabafdababcg: ababc

It succeeds four times in a row and does not match again.

:     |:dababeabafdababcg: ababc

Shift one digit to the right, and move the comparison position to the start position of Y.

:  |:dababeabafdababcg:  ababc:

Repeat the process ............

:               |:dababeabafdababcg:           ababc

Y completely matches X and ends.

Undoubtedly, this method is too stupid. The time complexity is as high as O (Mn ). The biggest problem is that a large number of repeated comparisons are performed.

KMP AlgorithmIt was proposed to solve this problem.KMP AlgorithmThe central idea is that the matched parts do not need to be matched again. On the contrary, the matching parts should be moved in multiple bytes to speed up the matching.

How can this problem be solved? KMP provides the following answers:

The matched part is known and only depends on string Y. The movement of string y can be directly moved to the next matching position.

What is the next matching position?

Assume that the matching part of string X and string y isabcabObviously:

: abcab...  : abcab...   ::  abcab... :   abcab... :

All do not match, only the following

:abcab...:   abcab...

. Obviously, the new matching sequence is a real Suffix of the old matching sequence and a prefix of string y, while the old matching sequence is also the prefix of string y.

That is to say, based on the matched parts, we can exclude a large number of positions.KMP AlgorithmIt is implemented based on this principle.

Rule 1
If the current comparison position has the same two characters, the comparison position is shifted to one place
Rule 2
If the current comparison position is different from the two characters, then:

If no matching sequence exists, the position of the comparison is shifted to one bit, and the string y is shifted to one bit.
If the string already matches the sequence, the string y is moved to the next matching position.

For example

:|:dababeabafdababcg:ababc

The character does not match. The position and string y are both shifted to the right.

: |:dababeabafdababcg: ababc

Character match. The position is shifted to the right by 1.

:  |:dababeabafdababcg: ababc

It is successfully matched four times in a row and does not match again.

:     |:dababeabafdababcg: ababc

Character mismatch. Move string Y to the next matching position.

:     |:dababeabafdababcg:   ababc

Still does not match, continue to move string y

:     |:dababeabafdababcg:     ababc

It still does not match, and no matching sequence exists. Both the position and string y are shifted to the right.

:      |:dababeabafdababcg:      ababc

Character match. The position is shifted to the right by 1.

:       |:dababeabafdababcg:      ababc

It is successfully matched three times in a row and does not match again.

:         |:dababeabafdababcg:      ababc

Character mismatch. Move string Y to the next matching position.

:         |:dababeabafdababcg:        ababc

Still does not match, continue to move string y

:         |:dababeabafdababcg:         ababc

The character does not match. No matching sequence exists. Both the matching position and string y are shifted to the right.

:          |:dababeabafdababcg:          ababc

The character does not match. No matching sequence exists. Both the matching position and string y are shifted to the right.

:           |:dababeabafdababcg:           ababc

Character match. The position is shifted to the right by 1.

:            |:dababeabafdababcg:           ababc

The string is successfully matched five times in a row and reaches the end of string Y. The match is successful.

:               |:dababeabafdababcg:           ababc

As you can see,KMP AlgorithmThe comparison position does not reverse, and the number of comparisons at the same position is significantly lessSimple Algorithm, The comparison speed is quite fast.

The key to the KMP algorithm isCalculate the next matching Sequence Based on the existing matching SequenceThis part can be implemented only by string y, because the calculation of the next matching sequence is irrelevant to the part other than the existing matching sequence.

In fact, based on the existing matching sequence, you can obtain multiple next-level matching sequences that meet the requirements (that is, both the true Suffix of the existing matching sequence and the prefix of string y), but there is no doubt that, we should select the longest one.

KMP AlgorithmIs the KeyCalculate the next matching Sequence Based on the existing matching SequenceThis step is calculated because we know that the next matching sequence must be the prefix of string Y. In fact, we only need to record the length of the next matching sequence. This is the true face of the next array. (The value of the next array introduced by others may be different from what I said, but it is actually the result of mathematical transformation of the next matching length)

There are also many ways to calculate the next array, for example, a simple method. In fact, there is a good way to easily calculate the next array. This method is similar to the process of comparing KMP, and it is also similar to the suffix tree structure.Ukkonen AlgorithmThe same is true.

The go code is as follows:

// Member N [I] indicates the length of the longest sequence of S [: I] Real suffix and S prefix. // A real suffix is a non-empty suffix. If the member does not exist, set the value to 0. Func next (s string) [] int {n: = make ([] int, L) // starts from N [2] For I, J: = 2, 0; I <L; {// all the characters above are known to match if s [I-1] = s [J] {J ++ N [I] = J I ++ continue} If J = 0 {// The applied N will all be initialized to zero I ++ continue} // Function J = N [J]} return n} similar to the suffix pointer}

We compared only one character in each loop! This is because the previous comparison has ensured that the previous part is matched. If the comparison succeeds, you only need to extend the length of the next match. If the comparison fails, we do not need to start from the beginning to find the starting position of the next match, because the previous matching result tells us the next matching position.

The result is,KMP AlgorithmThe process of constructing and comparing the next array is very fast!

KMP AlgorithmThe complete code is as follows:

// KMP string search algorithm func KMP (S, r string) int {L: = Len (r) // member N [I] indicates that it is both S [: i] the length of the longest sequence with the real suffix and the S prefix. // A real suffix is a non-empty suffix. If the member does not exist, set the value to 0. N: = func (s string) [] int {n: = make ([] int, L) // start from N [2] For I, J: = 2, 0; I <L; {// all the characters above are known to match if s [I-1] = s [J] {J ++ N [I] = J I ++ continue} If J = 0 {// The applied N will all be initialized to zero I ++ continue} // Function J = N [J]} return n} (r) similar to the suffix pointer) // search for I, j: = 0, 0 for I + L <j + Len (s) & J <L {if s [I] = R [J] {I, j = I + 1, J + 1 continue} If J = 0 {I ++} else {J = N [J]} If J = L {return I-l} return-1}

The KMP algorithm is a very delicate algorithm, but the code is very simple. In contrast, although the speed may be inferior to that of the BM algorithm, it is indeed much more elegant.

KMP comparison algorithm for two strings

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

KMP comparison algorithm for two strings

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

KMP comparison algorithm for two strings

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support