KMP SUBSTRING Lookup algorithm

Source: Internet
Author: User

The KMP algorithm is an improved string matching algorithm found by D.e.knuth,j.h.morris and V.r.pratt, so it is called the Knut-Morris-Pratt operation (the KMP algorithm, which consists of their initials).

The key of the KMP algorithm is to use the information that has been partially matched to minimize the number of matches between the pattern string and the main string to achieve fast matching.

Before introducing KMP the simple solution, that is, the simplest method of violence, the naïve solution is to use a poor way to find a comparison of the function, sub-string and target matching failed to again from the beginning of the substring to continue to match, regardless of repeated judgment of the characters, so the efficiency is very poor.

Here is the implementation code for the naïve solution:

intSub_str_index (Const Char* S,Const Char*p)//S is the source string, p is the pattern string, substring {intRET =-1; RET record return value, initialized to-1 means no foundintSL =strlen (s); intPL =strlen (P); intLen = SL-PL; Len Records how long the source string is longer than the substring and is used to determine the number of bits that the substring can move backwards, avoiding the length of the remaining source string in the matching process for(intI=0; (ret<0) && (I<=len); i++)//If no match is found and the source string has not reached the bounds subscript {BOOLEqual =true; The equal is used to record a temporary match, and the default is true for the following loop for(intj=0; Equal && (J&LT;PL); J + +)//If the equal is true and the substring does not cross the border, continue the loop equal= (S[i + j] = =p[j]);//If a character of the current source string matches the character of the substring, equal is a true RET= (equal? I:-1); If the substring matches successfully, returns the subscript of the first character of the source string, otherwise 1 indicates mismatch}returnret;}

Next introduce the KMP algorithm:

KMP algorithm is an algorithm that uses known information to reduce invalid matching judgment.

This is ruan a peak of the explanation, I feel very good, for reference: http://www.ruanyifeng.com/blog/2013/05/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm.html

In addition, YouTube Huanghaojie also speak good, convenient friends can go to see.

Here is a picture from Ruan Feng Blog, I made some changes, use it to give examples:

After several searches, here the substring of D and the source string of space mismatch, in order to improve the efficiency of the search, we found that the direct string to move backward 4-bit can be more efficient matching, because the source string current 3 characters are not matched with the substring, thus eliminating the previous 3 matching process, to achieve efficiency improvement.

So how do we know how many bits to move and how to calculate it?

A partial match table for the KMP algorithm is used here, and the partial match table is used to record "the last character of a substring as a suffix" and the maximum number of characters that can be matched by "prefix the first character of a substring"; By this count we'll know the number of moving bits, Because it records the maximum number of characters for the current character, the following formula is obtained:

Move digits = number of matched characters-corresponding partial match values

A prefix is a combination of all characters except the last character, and the suffix is the opposite. For example, in the example above, for substrings:

Since we match the first character of the substring first, the successful match is moved backwards to match the second one, and so on, so our partial match table must figure out the corresponding partial match table for each character.

Take a look at the picture below

The first initialization match value is 0, each matching source string is labeled as a partial match of the previous substring character, because the value record is always the longest part of the matching value, so only need to continue from the matched source string, since this value is starting from 1, and the subscript is starting from 0, So it points to the next one that has matched the source string character.

The subscript 0 character A does not match, so it is set directly to 0.

Subscript 1 Character B cannot match the subscript 0 character, so it is also 0.

The Subscript 2 character C cannot match the subscript 0 character and cannot be matched with BC and AB, so it is also 0.

The subscript 3 character D cannot match the subscript 0 character, nor can the CD match AB, BCD and ABC can not match, and the result is 0.

Subscript 4 Character A can match the subscript 0 character, so the match value plus 1, the subsequent combination can no longer match, so the fifth character a of the partial match value no longer change, the result is 1.

Subscript 5 Character B because the previous subscript 4 characters match successfully, so to match the next source string character B, where the match is successful, the matching value on the basis of the last character matching value plus 1 to 2, and subsequent combinations can no longer match, so the final result is 2.

Subscript 6 character E and the previous character of the part of the match value 2 recorded characters (subscript is partially matched value-1) b does not match, so the partial match value is reset to the subscript 2-1 character B of the partial match value, that is, 0;e and character B's partial match value record of the character a also does not match, so the partial match value is unchanged,

And so on the subscript 7 character match succeeds, plus 1 ....

The character a at the last subscript 13 does not match the characters B (subscript 6-1) recorded by the previous character B's partial match value. A partial match value is reset to a partial match value of 6-1 characters at subscript 2; the character a at subscript 13 and the character C at subscript 2 are reset to a partial match value of 2 at subscript 0 , the character a at subscript 13 matches the characters at subscript 0 successfully, the partial match value is 1, and the final result is 1.

Online a lot of people put this part of the matching table as the next array, I have written in accordance with the teacher's code int type pointer, with heap space records, length and character length consistent, each unit is an int type, equivalent to an int array.

int* MAKE_PMT (Const Char* p) · Partial Match Table Lookup function{    intLen =strlen (P); int* ret = static_cast<int*> (malloc(sizeof(int) *len)); Application heap space for recording partial match tablesif(Ret! =NULL)//Only the heap space request succeeds to operate {intLL =0;//ll==>longest length, longest part matching valueret[0] =0; The first element does not match, so write directly to 0. for(intI=1; i<len; i++)        {             while(LL >0) && (p[ll]! = P[i]))//a partial match succeeds and then a failure occurs.{ll= ret[ll-1];//Resets the ll to the partial matching value of the character (that is, the current ll value) that corresponds to the previous character part match value, noting that the subscript starts at 0, so subtract 1, noting the difference between the lowercase ll and the number 1.             }            if(P[ll] = =P[i]) {ll++; If the match succeeds, ll value plus 1;} Ret[i]=ll; Save ll value in partial match table before entering next round of cycle}}returnret;}

Here is the KMP function:

intSTRING::KMP (Const Char* S,Const Char*p) {    intRET =-1; intSL =strlen (s); intPL =strlen (P); int* PMT =MAKE_PMT (P); Gets a partial match table for the substring, commenting that the function returns a heap space, which needs to be freedif(PMT = NULL) && (0< PL) && (PL <=SL)) The calculation is required only if the partial match table succeeds, the substring length is greater than 0, and does not exceed the length of the source string. for(intI=0, j=0; i<sl; i++)//For Initialize I and J variables, I is used to traverse each source string, J is used to record the number of matched substring characters { while(J >0) && (s[i]! =P[j])) If there are matched characters but subsequent characters fail to match {J= pmt[j-1];            //Move the substring so that the first character of the substring is aligned to the next matching character position of the source string, which is a one-time move, no longer a brute force search in a single move only once } if(S[i] = =p[j])//If the match succeeds, record the number of matched substring characters J plus 1 {j++; }            if(J = =PL)//If the matched substring is equal to the substring length, the match succeeds {ret= i +1-PL; Returns the subscript of the first character of the matching source string and jumps out of the loop to find; I point to the last character of the matching source string, minus the length of the substring equal to the first one of the matching characters, so add 1. Break; }        }    }     Free(PMT); returnret;}

KMP SUBSTRING Lookup algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.