KMP algorithm of string matching

Last Update:2016-05-23 Source: Internet

Author: User

Tags bitwise

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Preface:

Leetcode on the 28. Implement StrStr () is a string matching problem. String matching is one of the basic tasks of a computer. So the next two logs, all of the relevant algorithms are summarized.

2. Brute Force solution algorithm

If the idea of a violent match, and assuming that the text string s matches now to the I position, the pattern string P matches to the J position, there are:

If the current character matches successfully (that is, s[i] = = P[j]), then i++,j++ continues to match the next character;

If the mismatch (that is, a s[i]! = P[j]), so i = i-(j-1), j = 0. The equivalent of every match failure, I backtracking, J is set to 0.

1 classSolution {2  Public:3     intSTRSTR (stringHaystackstringneedle) {4         if(needle=="")return 0;5         inti =0, j =0;6          while(i<haystack.size ()) {7             intTagi;8Tagi =i;9              while((Haystack[tagi] = = Needle[j]) &&j<needle.size ()) {Tentagi++; One             } A             if(j==needle.size ()) { -                  -                 returntagi-needle.size (); the}Else{ -j =0; -             } -i++; +         } -         return-1; + } A}

3, KMP algorithm principle

This part of the content is reproduced from Ruan Yi Feng

First, the first character of the string "BBC Abcdab Abcdabcdabde" is compared to the first character of the search term "abcdabd". Because B does not match A, the search term moves one after the other.

Because B does not match A, the search term moves backwards.

This is the case until the string has a character that is the same as the first character of the search term.

It then compares the string to the next character of the search term, or the same.

Until the string has a character that is not the same as the character that corresponds to the search term.

At this point, the most natural response is to move the search term to one place, and then compare it from the beginning to the next. While this works, it is inefficient, because you want to move the "search location" to a location that has been compared again.

One basic fact is that when the pod does not match D, you actually know that the first six characters are "Abcdab". The idea of the KMP algorithm is to try to take advantage of this known information and not move the "search location" back to the location that has already been compared and move it backwards, which improves efficiency.

How do you do that? A partial match table can be calculated for the search term. This table is how to produce, after the introduction, here as long as it can be used.

When a known space does not match D, the first six characters "Abcdab" are matched. The table shows that the last matching character B corresponds to a "partial match value" of 2, so the following formula calculates the number of bits moved backwards:

Move digits = number of matched characters-corresponding partial match values

Because 6-2 equals 4, the search term is moved backwards by 4 bits.

10.

Because the spaces do not match the C, the search term continues to move backwards. At this point, the matched number of characters is 2 ("AB"), corresponding to the "partial match value" of 0. So, move the number of bits = 2-0, the result is 2, and then move the search word back 2 bits.

11.

Because the spaces do not match a, continue to move back one bit.

12.

Bitwise comparison until you find that C and D do not match. So, move the number of digits = 6-2 and continue to move the search word backwards by 4 bits.

13.

The search is completed by a bitwise comparison until the last one in the search term finds an exact match. If you want to continue searching (that is, find all matches), move the number of digits = 7-0, and then move the search word back 7 bits, there is no repetition.

14.

Here's how the partial match table is produced.

First, you need to understand the two concepts: prefix and suffix. "prefix" means the combination of all the headers of a string except the last character; "suffix" means all the trailing combinations of a string in addition to the first character.

15.

The partial match value is the length of the longest common element of the prefix and suffix. Take "Abcdabd" as an example,

-the prefix and suffix of "A" are empty, and the total element length is 0;

-the "AB" prefix is [A], the suffix is [B], the total element length is 0;

-the "ABC" prefix is [A, AB], the suffix is [BC, C], the length of the common element is 0;

-the "ABCD" prefix is [A, AB, ABC], suffix [BCD, CD, D], the length of the common element is 0;

-the "abcda" prefix is [A, AB, ABC, ABCD], the suffix is [bcda, CDA, DA, a], the common element is "a", the length is 1;

-"Abcdab" is prefixed with [A, AB, ABC, ABCD, abcda], suffix [Bcdab, Cdab, DAB, AB, B], the total element is "AB", the length is 2;

-"ABCDABD" is prefixed with [A, AB, ABC, ABCD, ABCDA, Abcdab], suffix [bcdabd, cdabd, Dabd, ABD, BD, D], with a total element length of 0.

16.

The essence of "partial match" is that sometimes the string header and tail are duplicated. For example, "Abcdab" has two "AB", then its "partial match value" is 2 ("ab" length). When the search term moves, the first "AB" Moves backwards 4 bits (the length of the string-part of the match), and it can come to the second "ab" position.

4, algorithm implementation 4.1 next[] array generation algorithm

1vector<int> Next (Needle.size (),0);2     intK =0;3      for(inti =1; I<needle.size (); i++){4          //phase three, before all matches, the value does not match5         //search for smaller match sequences, recursive calls, find matching sequences, if not found, to K 0 stop6          while(k>0&& (needle[k]!=Needle[i])) {7K = next[k-1];8         }9         //Phase One, no duplicate match values have been found. Execute i++ directly and continue looking forTen         //phase Two, finding duplicate match values, k++ One         if(Needle[k] = =Needle[i]) { A++K; -         } -Next[i] =K; the}

In, use the next[] array to save the matching table. The matching table is initialized to an array whose length is equal to the unknown origin string, and the default value is 0. The matching table procedure for generating a string is divided into three stages:

1, has not found any duplicate matching value, directly backward traversal, execution i++, matching table remains the default value of 0.

2, find duplicate match value, k++.

3. The string before the value matches, and the characters here do not match. We recursively call k = Next[k-1] to find a shorter match string.

For a third stage,

4.2 KMP Complete algorithm

1 classSolution {2  Public:3     intSTRSTR (stringHaystackstringneedle) {4    5     if(Needle.empty ())return 0;6     if(Haystack.empty ())return-1;7     //pre config,create the next[]8vector<int> Next (Needle.size (),0);9     intK =0;Ten      for(inti =1; I<needle.size (); i++){ One          //phase three, before all matches, the value does not match A         //search for smaller match sequences, recursive calls, find matching sequences, if not found, to K 0 stop -          while(k>0&& (needle[k]!=Needle[i])) { -K = next[k-1]; the         } -         //Phase One, no duplicate match values have been found. Execute i++ directly and continue looking for -         //phase Two, finding duplicate match values, k++ -         if(Needle[k] = =Needle[i]) { +++K; -         } +Next[i] =K; A     } at     intj=0;//To judge Needle to which one . -     //Match -      for(inti =0; I){ -         //Phase 1, no same value found, i++,j=0 -         //Phase 2, find the same value, i++,j++ -         //phase 3,j==needle.size (), find the appropriate entry, return I-J, estimate I J are large 1 in         //Stage 4 Branch, j>0, value different, j = next[j-1];i unchanged, re-compare, then different, recalculate J = next[j-1], same turn 2 or j==0 ext 1 -         //stage 5,i==hay.size () not found, returns -1. to          while((Haystack[i]!=needle[j]) &&j>0){ +j = next[j-1]; -         } the         if(Needle[j] = =Haystack[i]) { *J + +; $         }Panax Notoginseng         if(J = =needle.size ()) { -             //J is 1 larger than the actual value, and the I plus 1 operation is performed at the end of the For loop.  the             returni-j+1; +         } A    return-1; the      +     } -};

KMP algorithm of string matching

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More