String Matching: simple match & amp; KMP Algorithm

**Introduction**

String pattern matching is a common operation. Pattern matching is simply to look for a given pattern in text (text, or str ). Generally, the text is large, while the mode is short. Typical examples include text editing and DNA analysis. When editing a text, the text is usually a paragraph or an article, and the pattern is often a word. If you want to replace a specified word, you must match it in the entire article. The efficiency requirement must be high.

**Simple pattern matching algorithm**
The simplest and easiest to think of is simple matching. Simply put, we can compare the mode string with the parent string from left to right or from right to left at 1.1: first, compare the first character of the mode string with the first character of the parent string. If the character is equal, then compare the corresponding character. If the character is not equal, move the mode string back to a position, compare the head of the mode string again ......

This is like an enumeration method: Compares the substrings of the same length as the mode string in the parent string. This matching method is obviously not enlightening or intelligent. For example:

The above steps are easy to understand, and its code is also various. below is one of them:

/* Simple pattern matching algorithm. Matching direction: if the previous and subsequent matching is successful, the subscript position of the primary string is returned (only the position of the First Matching successful is returned). Otherwise, returns the-1 parent or child string. if one of them is null, the return value is-1 */int naiveStringMatching (const char * T, const char * P) {if (T & P) {int I, j, lenT, lenP; lenT = strlen (T); lenP = strlen (P); // The length of the mode string is longer than that of the main string, apparently, if (lenP> lenT) return-1; I = 0; while (I <= lenT-lenP) {j = 0; if (T [I] = P [j]) {j ++; // the pointer is written as follows: while (j <lenP & * (T + I + j) = * P (j) j ++; while (j <lenP & T [I + j] = P [j]) j ++; // matches the end Of the pattern string smoothly, if (j = lenP) return I;} I ++;} if the program is running here, return-1 is not matched ;} return-1 ;}

Considering that the string type in c ++ is sometimes used, its code is as follows:

Int naiveStringMatching (const string T, const string P) {int I, j, lenT, lenP; lenT = T. length (); lenP = P. length (); // the length of a null string or a mode string is longer than that of the primary string. Obviously, if (lenT = 0 | lenP = 0 | lenP> lenT) cannot be matched) return-1; I = 0; while (I <= lenT-lenP) {j = 0; if (T [I] = P [j]) {j ++; while (j <lenP & T [I + j] = P [j]) j ++; if (j = lenP) return I ;} I ++ ;} return-1 ;}

Do not underestimate the above Code. Although the efficiency is not high, it is still necessary to master it. Code and algorithms are constantly being optimized, and all these optimizations begin with a simple situation.

The time complexity of Simple Matching is easy to analyze. If the match is successful, the first comparison is matched. In this case, you only need to perform strlen (P) times (the length of the pattern string) comparison. In the worst case, we always need to compare the last sub-string of the parent string with the same length as the mode string, total (strlen (T)-strlen (P) + 1) * strlen (P) comparison. The average is O (strlen (T) * strlen (P )).

KMP Algorithm

The KMP algorithm is a string matching algorithm, the efficiency of this algorithm is that when the matching fails at a certain position, you can start from another suitable position of the pattern string based on the previous matching results, instead of starting from the beginning each time. This algorithm is designed by Knuth, Morris, and Pratt. Therefore, it is named KMP.

Improvements to KMP

Looking back at the simple matching algorithm mentioned above, it is always I ++; j = 0 in each case of mismatching. from the screen, it is to move the pattern string to the backward position of the Main string, start a new round of comparison from the first character of the mode string. In mathematics, This is the geometric meaning of I ++; j = 0.

In case of mismatch, the subscript of the primary string is I, and the subscript of the pattern string is j. Then, we can be sure that the matching has been successful for j characters. For example:

Ti-j + 1... Ti-2 Ti-1 Ti

P0 P1... Pj-2 Pj-1 Pj diagram ()

In, the red characters are matched. Subscript 0, 1 .. J-1 is not exactly j?

KMP approach is to make full use of the information that has been matched, when the mismatch is: I unchanged, j = next [j], where next [j] <= J-1. In this case, the next [j] position of the pattern string is re-used to match the I position of the primary string. Therefore, the moving position of the mode string relative to the main string is j-next [j]> = 1, which is generally more efficient than moving a position with simple matching. This is the improvement of KMP.

The primary string T [I] is directly matched with the pattern string P [next [j], that is, the first [j] character of the pattern string is skipped, the following facts must exist:

Ti-next [j] Ti-next [j] + 1... Ti-2 Ti-1 Ti

P0 P1... Pnext [j]-2 Pnext [j]-1 Pnext [j] Figure (B)

The fact is: the character at the position from 0 to next [j]-1 of the pattern string subscript has been matched successfully.

Two arguments

(1) From the figure (a) to be clear: Ti-j... Ti-1 and P0.... Pj-1 is exactly the same. Therefore, it is okay to regard the former as the latter.

(2) As shown in figure (B), p0.0... pnext [j]-1 is p0.0... the first next [j] continuous character substring of the Pj-1, and Ti-next [j]... the Ti-1 is Pj-next [j]... pj-1 (according to the previous conclusion) is P0... the next [j] consecutive substring of the Pj-1. And they are exactly the same (matching ).

Prefix and suffix

This introduces the concept of prefix and suffix. Here is an example:

String "abc"

It has the prefix "a" "AB" "abc"

It has the suffix "c" "bc" "abc"

I don't need to explain it too much here. You can also understand what the prefix suffix refers.

Meaning of next [j]

Based on the concept of prefix and suffix, we can get the actual meaning of next [j]: next [j] is the mode string subscript 0... the maximum length of the corresponding prefix and suffix for the j characters of the J-1.

Obviously, it is equivalent to matching. Why is it the longest ?, Based on two considerations:

(I) In an intuitive view, next [j] is the largest, indicating that the remaining characters that require matching and verification are the least.

(Ii) In essence, only the maximum value is allowed; otherwise, possible matching is omitted. (Think about this !)

It should be pointed out that we can only consider non-ordinary prefix and suffix here, otherwise, it is meaningless for the ordinary. (Ordinary prefix suffix refers to: Empty strings and strings themselves. Others are extraordinary .)

Another thing we need to understand: the next array is determined by the mode string itself and is irrelevant to the main string!

To solve the next array, first obtain the longest matched prefix and suffix length of the substring ending with the current character. For the next value of the current character, refer to the solution result in step 1 of the previous character. As for the first character, because there is no "previous character", set it to-1 directly. Let's look at a specific example: the longest prefix and suffix length table of the pattern string "ababca"

Mode string |
A |
B |
A |
B |
C |
A |

Longest prefix and suffix Length |
0 |
0 |
1 |
2 |
0 |
1 |

Obtain the next array from the preceding table.

Subscript |
0 |
1 |
2 |
3 |
4 |
5 |

Next |
-1 |
0 |
0 |
1 |
2 |
0 |

Intuitively, the "table with the longest prefix and suffix length" is shifted to one right position, so the rightmost length is discarded and-1 is left blank. The result is the next array.

Recursive solution of the next Array

When the length of the mode string is very small, it is no problem to manually calculate the next array. Manual computation is not a final solution. It must be calculated by machines. In fact, the next array can be solved recursively, which is also a difficult point in understanding. (I) Initial next [0] =-1; (ii) if next [j] is k, p0.0... pk-1 (Pk ...) = Pj-k... pj-1 (Pj ...) (*) ('=' indicates that the corresponding position matches ). In two cases, solve next [j + 1]: if (Pk = Pj), then next [j + 1] = k + 1 = next [j] + 1; the principle is obvious: If the Pk is equal to Pj, the longest prefix suffix will grow one by one, as can be seen from the * formula. If Pk is not the same as Pj, update k = next [k]; if (Pk = Pj) next [j + 1] = k + 1; otherwise, repeat this process. We use the next array calculated manually to perform the following test: A primary string "cadabababcacadda", and a pattern string "ababca". The next array is the same as the previous one: next [] =, 0, 1, 2, 0}

Code

# Include
# Include
Using namespace std;/* set the value of the next Array Based on the mode string P */void setNext (const char * P, int * next) {int j, k, lenP; lenP = strlen (P); j = 1; next [0] =-1; while (j <lenP) {k = next [J-1]; // P [j]! = P [k] while (k> = 0) & P [J-1]! = P [k]) k = next [k]; if (k <0) next [j] = 0; elsenext [j] = k + 1; j ++ ;}}/* string pattern matching: KMP algorithm (I) T is the main string, in the form of a String constant or a string array ending with '\ 0, for example, "abc" or {'A', 'B', 'C', '\ 0'} (I) P is a substring, and its form is the same as that of the primary string (I) if the next array (o) matches successfully, the subscript of the first successful match in the main string is returned. Otherwise,-1 */int KMP (const char * T, const char * P, const int * next) {if (T & P) {// lenT is the master string length, lenP is the sub-String Length int lenT, lenP; lenT = strlen (T ), lenP = strlen (P); // if (lenT <lenP) return-1; int I, j, pos; I = j =- 1; pos = lenT-lenP; // I only need to change to the pos position at most. Think about it? Very simple while (I <= pos & j <lenP) {// if (j =-1 | T [I] = P [j]) {I ++; j ++ ;} else // matching failure j = next [j];} if (j = lenP) return I-lenP; // This return value is well understood by elsereturn-1 ;} return-1;} void print (int * array, int n) {if (array & n> 0) {int I; for (I = 0; I <n; I ++) cout <setw (4) <array [I]; cout <endl ;}} int main () {cout <"*** string pattern matching: KMP algorithm *** by David ***" <endl; char T [] = "cadababcacadda "; char P [] = "abab Ca "; cout <" Main string "<endl; cout <T <endl; cout <" sub string "<endl; cout <P <endl; int n = strlen (P); int * next = new int [n]; setNext (P, next ); cout <"print next array" <endl; print (next, n); cout <"Use the KMP algorithm for pattern matching" <endl; int index = KMP (T, P, next); if (index =-1) cout <"cannot match! "<Endl; elsecout <" matched successfully !, The matched subscript is "<index <endl; delete [] next; system (" pause "); return 0 ;}

Run and download: KMP Algorithm

Reprint please indicate the source, this article address: http://blog.csdn.net/zhangxiangdavaid/article/details/35569257

If this is helpful, try again!

Column Directory: Data Structure and algorithm directory