Analysis and optimization of KMP string matching algorithm

Source: Internet
Author: User
Tags first string

Simple String Matching algorithm description

The most common scenario for a string matching algorithm is to find the specified text from a document. The text you need to find is called a pattern string, and the string from which you want to find the pattern string is called the find string.

To better understand the KMP algorithm, let's take a look at the naïve matching algorithm first. The naïve string matching algorithm, when a position mismatch of a pattern string, aligns the previous position of the mismatch position with the location of the lookup string, and then compares each position of the pattern string from the beginning. As shown in.

Analysis of KMP string matching algorithm

KMP string matching algorithm is the abbreviation of the Knuth-morris-pratt algorithm, the idea of KMP algorithm is that when a certain position mismatch of the pattern string, it is possible to align the position of the previous position with the location of the lookup string, and to start the comparison directly from that position. Following this line of thinking, the question becomes: when a pattern string is mismatch, find a more previous position aligned with the location of the lookup string. The more previous position of a pattern string mismatch is called the backtracking bit, which is usually represented by next, and its formula is:

next[i]= {0; When i = 1
k; for string p, there are 1 <= K < make P1. Pk-1 = = Pi-k+1..pi-1
1; other conditions}

The subscript of the string corresponding to this formula is starting from 1. This formula only shows that a certain position in the pattern string (not including this position) is preceded by the same substring (i.e., self-matching, for example, the head and tail contain a substring ab before the last Abcaba a), If the location mismatch can directly align the next position of the head substring with that position (for example, the pattern string Abcaba at the last a, you can simply slide the pattern string to align C to the last position where the A is aligned), This removes the redundancy comparison in the naïve matching algorithm that precedes the substring of the pattern string at a location mismatch (if the pattern string Abcaba needs to be moved three times in order to make the C align to the last position of a alignment). When a part of a pattern string does not have a substring of the same end before it does not contain this position, the position can be aligned directly at the beginning of the pattern string when it is mismatch. Such as.

This gives a description of the algorithm, but how can it be proved that the algorithm is correct? This says trouble is also troublesome, said simple also simple. Why bother? Because I have no way to give the proof process in formal language, just like the proof process in mathematics. In fact, I can believe that the algorithm is correct by demonstrating the sliding process of the string matching in image thinking. I am too lazy to give the proof process.

Next, the complete code for the KMP algorithm is given.

#include <iostream>#include<iomanip>#include<vector>#include<string>#include<cstdlib>using namespacestd;voidGet_next (Const string& m,vector<int> &next);intKmp_match (Const string& S,Const string& M,intPOS);intMain () {strings="ABCDEFGHABCDEFGHHIIJIKLMABC"; stringt="Hhiij"; intPos =kmp_match (S,t,3); cout<<"\ n"<<pos<<Endl; System ("Pause"); return 0;}voidGet_next (Const string& m,vector<int> &next) {    //generate Vector<int> Next (M.size (),-1) by pattern string; //The 1th element of the string here is subscript 0 .    inti =-1, j =0; intM_len = M.size ()-1;  Do    {     if((I <0) || (M[i] = =M[j])) {i++; J + +; NEXT[J]=i; }        ElseI=Next[i]; cout<<"i="&LT;&LT;RIGHT&LT;&LT;SETW (3) <<I<<"j="&LT;&LT;RIGHT&LT;&LT;SETW (3) <<J<<"next["<<j<<"] ="&LT;&LT;RIGHT&LT;&LT;SETW (3) <<next[j]<<Endl; } while(J <m_len);} intKmp_match (Const string& S,Const string& M,intPOS) {    intj = pos, i =0;//The 1th element of the string here is subscript 0 .    intS_len =s.size (); intM_len =m.size (); if((S_len-pos) <M_len)return-1; Vector<int> Next (M.size (),-1);         Get_next (M,next);  while(I<m_len && j<S_len) {        if(I <0|| s[j]==M[i]) {            ++i; ++J; }        Elsei = next[i];//J unchanged, I beat    }       if(i = = M_len)returnJ-i;//Match Success    Else return-1;}
View Code

Optimization of KMP string matching algorithm

Then look at the matching example above Ebaeb. The 2nd comparison is not necessary at all and can be skipped directly to the 3rd time. This comparison has a feature: when the pattern string is sliding over a distance, the characters participating in the comparison in the pattern string are the same as the characters in the previous comparison, all B. The following is the case when you follow the above formula to find the pattern string backtracking position:

This determines the backtracking position of the pattern string when an element with index 4 in the pattern string is mismatch, at which point the next element of the index bit is the same as the next element of the comparison bit (both B). This is also the case where the sliding of a character with an index of 4 in the pattern string is completed, that is, when the last B and D mismatch is found, because next[4] = 1, the character that is indexed to 1 (that is, the 2nd character) is required to be B to D. But has this B and D been compared once? This is because not only the part before the pattern string mismatch position can be self-matched, but the previous part of the pattern string that contains the mismatch position can also be self-matching. When a substring containing a mismatch position in a pattern string also has the same first and last name, the backtracking position of the mismatch position can be directly taken from the backtracking position of the first string, and for the string Ebaeb, next[4] is directly equal to next[1]. Then the next function of optimization is as follows:

             0 1                          <= K < makes P1. pk-1 = = pi-k+1.. pi-1 & Pk! =                        Pi1 <= k < I makes P1. Pk = = pi-k.. Pi                        1; other conditions                                      }
View Code

Next, the improved code for the next array is given.

voidGet_next (Const string& m,vector<int> &next) {    //generate Vector<int> Next (M.size (),-1) by pattern string; //The 1th element of the string here is subscript 0 .    inti =-1, j =0; intM_inx = M.size ()-1;  Do    {     if((I <0) || (M[i] = =M[j])) {i++; J + +; if(M[i]! =M[j]) next[j]=i; ElseNEXT[J] =next[i];} ElseI=Next[i]; } while(J <m_inx);}
View Code

Analysis and optimization of KMP string matching algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.