Analysis of KMP string matching algorithm based on finite state automation

Source: Internet
Author: User

I. Definition

1 Main string S = s1s2... SN, a string consisting of n characters.

2 mode string T = t1t2... TM, a string consisting of M characters

3 string matching problem definition: given the primary string s and the mode string T, if S contains t, return the location where T first appeared; otherwise, return-1.

2. Common practices

For (INT I = 0; I <n; I ++ ){

For (Int J = 0; j <M & S [I + J] = T [J]; j ++ );

If (j = m) return I;

}

Return-1;

Three Finite State Automation

Finite State Automation can be defined in the 5-element format, namely, the input Character Set, State set, initial state set, acceptance state set, and state transition function.

Our common regular expressions are actually finite state automation, and the languages that can be expressed with finite state automation are the regular language.

String Matching is essentially a process of Simulating Finite State Automation, that is, constructing a fa that only accepts all strings containing string T. if the input string S is accepted by the FA, it indicates that S contains t.

Suppose T = "ababcab"

The FA simulated by the above common practices can be described as follows:

Each time the inner loop starts, the status is 0. If the next string is a, the status is changed to 1, that is, J ++ until J = 7, that is, the string "ababcab" is recognized ", reached the acceptance Status 7. on the contrary, if the next input character is not 'a' at status 2, the matching fails. At this time, we should re-return to the starting state. In the code, we exit the inner loop, I ++, and then enter the inner loop. At this time, j = 0, that is, starting from status 0 again.

However, the above fa is not in the general sense, and its efficiency is very low. There are two reasons:

1) if each matching fails, the status returns to 0 again, that is, j = 0.

2) if each matching fails, roll back 1... J characters, that is, I = I + 1, which has already been read to I + J.

The above two rollbacks reflect both the inner and outer loops. For example, when J = 2, we find that the matching fails. At this time, we have read the I + 2 characters, but exit the inner loop, only the I ++ characters are read again, that is, the characters are read again from I + 1, while the I + 1 and I + characters are read, the calculation should not be repeated. In addition, status 2 should have the ability to remember historical matching. That is, status 2 indicates the process of 0-> 1-> 2, indicating that "AB" has already matched.

How can we use the known matching information to increase the computing speed? Start with two points:

1) if a match fails, the state returns to the latest equivalent State. For example, the equivalent state of State 3 should be 1, because 1 indicates that the character 'a' has been matched ', 3 indicates that 'aba 'has been matched. Of course, it cannot be said that 1 and 3 are completely equivalent. Only when the next input is not B, State 3 can be considered as equivalent to 1 because, at this time, it indicates that the "ABA" match has no meaning, and only the last "A" at most has a significance, because 0-> 1 also indicates matching "A", so it can be considered as equivalent to 3 and 1.

2) The primary string does not need to be rolled back, that is, the outer loop is no longer needed.

First, we can assume that any State has two jump functions, one of which, if a matched character is read, jumps to the next state, that is, I ++, J ++, the other one is the null jump (read ε) to the initial state 0. Of course, the null jump is only used in the case of mismatch, that is, j = 0. Note that it is a null jump, that is, it means that no input is read, and I remains unchanged at this time. This is also reasonable because Si! = TJ, then si is used for subsequent matching.

Based on the above analysis, we can obtain the subordinate FA:

Analysis:

1) when the status is 0, if the input is not a, it is 0-> 0; here it is not an empty jump. Note that if the initial status is continuously empty jump, then the performance of the program will remain unchanged, into an endless loop.

2) For any State K, if it does not match, it needs to be rolled back to the previous state. In the previous state, we make it back to the initial state. In order to improve efficiency, let it return to the last "equivalent" state. How can we calculate the last "equivalent" state?

Define the "equivalent" state function next [K], imagine we first find K of the former State K-1 of the equivalent State next [k-1], if t [next [k-1] = T [k-1], that means next [k-1]-> next [k-1] + 1 is the same as the jump condition for K-1-> K, then the next [k] = next [k-1] + 1 can be defined, in the image, that is, the equivalent State of the K-1 next [k-1] represents 0 ,..., Next [k-1]-1 character with K-next [k-1],… , If the K-1 character is the same, if the next [k-1] character is the same as the K-1 character, equivalent to 0 ,..., Next [k-1] = k-next [k-1],..., K Represents next [k] = next [k-1] + 1, but this is a recursive process, that is, if T [next [k-1] = T [k-1], then determine whether T [next [next [k-1] = T [k-1]

The next function can be defined here:

If K = 0, next [k] =-1/* is set to-1. On the one hand, it can be used as the termination condition in the following recursion, or in the 0 state, avoid empty jump */

If K> 0 and T [k-1] = T [next [k-1], next [k] = next [k-1] + 1, otherwise, recursively determine whether T [k-1] = T [next [next [k-1],…; Otherwise, next [k] = 0

The Java implementation of the program is as follows:

IntIndex (CharSRC [],CharPattern [])

{

Int[] Next = getnext (pattern );

IntI = 0, j = 0;

While(I <SRC. Length & J <pattern. Length ){

If(J =-1) {// the start position does not match. Continue to read forward.

I ++;

J = 0;

}

Else If(Pattern [J] = SRC [I]) {// status transfer

I ++;

J ++;

}

ElseJ = next [J]; // only change status J, I unchanged, that is, empty jump

}

If(J = pattern. length)ReturnI-j;

Else Return-1;

}

Int[] Getnext (CharPattern [])

{

Int[] Next =New Int[Pattern. Length];

Next [0] =-1;

For(IntI = 1; I <next. length; I ++ ){

Next [I] = 0; // Initialization is 0

IntJ = I-1;

While(Next [J]> = 0) {// recursive next,-1 is the termination condition, that is, when J = 0

If(Pattern [I-1] = pattern [next [J]) {

Next [I] = next [J] + 1;

Break;

}

ElseJ = next [J];

}

}

ReturnNext;

}

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.