String Matching-KMP Algorithm

Source: Internet
Author: User

String Matching-KMP Algorithm

Finite Automaton
A finite automatic machine M is A quintuple (Q, q0, A, Σ, Delta), where:

Q is a set of States,
Q0 and Q are in the initial state,
A is the word set of Q and A set of acceptance states,
Σ is a finite input alphabet,
Delta is a function from Q x Σ to Q, called a transfer function.
The following defines several related functions:

Phi (w) is the state in which M stops scanning string w. The function Phi has the following recursive relationship definitions: Phi (ε) = q0, Phi (wa) = delta (PHI (w), ),
σ (x) is the length of the longest prefix of P in the suffix of x.
String Matching Machine
Let's review the SIMPLE algorithm. The following two strings are given: The pattern string P and the matching string T.

I 0 1 2 3 4 5 6 7 8 9 10
P a B a c
T a B a c a B
When I = 0 for the first match, but I = 5 for the scan, the string does not match. In this case, I = 1 and rematch. This is where simple algorithms need to be improved. When I = 5, observe the table and find P [0... 3] = T [2... 5]. If we can match T [5 + 1] and P [3 + 1], we do not need to start scanning from I = 2, which greatly improves the efficiency, in this way, the time complexity of matching is only O (n. Here P [0... 3] is the prefix of P, and T [2... 5] is the suffix of T5. In this case, σ (T5) = 3. In this way, in the operations of the automatic machine, every state transfer can be ensured:

Phi (Ti) = σ (Ti)

Then we can ensure the final correct match. Here is a simple reasoning:

According to the definition of PHI (x), there are PHI (Tia) = delta (PHI (Ti), a), where a is any letter;

From PHI (Ti) = σ (Ti), we can obtain PHI (Tia) = σ (Tia) = q, that is, Phi (Tia) = σ (Pqa );

In summary, Delta (PHI (Ti), a) = σ (Pqa), a state transfer function delta (q, a) = σ (Pqa) can be obtained ). In this way, you can make a correct state transition diagram and then match the string.

Description in text: in an automatic machine, the state q is the length of the longest prefix of Ti Suffix in P. In this way, the algorithm can be correctly implemented every time this condition is met. Here, there is a detailed mathematical proof in the introduction to algorithms.

KMP Algorithm
The KMP algorithm does not create a finite automatic machine, but must construct a prefix function. Here it is called a prefix array. The pattern P matches itself first to obtain the prefix array. The prefix array actually stores the σ (x) value in the automatic machine. In this way, the time complexity of preprocessing and the ratio of automatic machines are much reduced.

Preprocessing
Given mode P:

I 0 1 2 3 4 5 6 7 8 9
P a B c
Next 0 0 1 2 3 4 5 6 0 1
Here, Pi [next [I] indicates the longest Suffix of Pi about P, and P [I] indicates the prefix of P about Pi.

When I = 0:

P0 and P compared, P0 [0]! = P [0], so next [0] = 0;

When I = 1:

Comparison between P1 and P, P1 [0]! = P [1], so next [1] = 0;

When I = 2:

P2 [0] = P [2], so next [2] = 1;

When I = 3:

Comparing P3 and P, P3 [1] = P [3], so next [3] = 2;

In this way, the next array can be obtained. Generally, the algorithm description array starts from 1, but when writing code, the array starts from the subscript 0, so each value of the next array above should be reduced by one. Next [I] =-1 indicates no prefix match. In this way, when writing code, it should be like this:

I 0 1 2 3 4 5 6 7 8 9
P a B c
Next-1-1 0 1 2 3 4 5-1 0
When I = 0, initialize next [0] =-1;

When I = 1, (P1 [next [0] + 1] = )! = (P [1] = B), next [1] =-1;

When I = 2, (P2 [next [1] + 1] = )! = (P [2] = a), next [2] = 0;

When I = 3, (P3 [next [2] + 1] = B )! = (P [3] = B), next [3] = 1;

...

In this way, it is not difficult to find the function of the next array and record the current σ (Pi ). Pi [next [I] + 1] = P [I] indicates whether the longest prefix of Pi matches a letter with the suffix of P. There are two situations:

Pi [next [I] + 1] = P [I], at this time σ (Pi + 1) = σ (Pi) + 1, continue
Pi [next [I] + 1]! = P [I]. At this time, scanning is not directly performed from Pi [0], but from Pi [next [I] + 1. It is guaranteed that Pi [next [next [I] is a prefix of P.
The following is the next array for the implementation of the c code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Void get_next (char * P, int next [], int len)
{
Printf ("len = % d \ n", len );
Next [0] =-1;
Int q =-1;
Int I;
For (I = 1; I <len; I ++ ){
While (q> 0 & P [q + 1]! = P [I]) {/* determine if P [q + 1] is suitable for equal to P [I] */
Q = next [q];/* if they are not equal, always find the longest suffix that meets the condition */
}
If (P [q + 1] = P [I]) q ++;/* if it is equal, it is good to continue ...*/
Next [I] = q;

}
}
Match
After finding the next array, you can perform string matching. The matching method and the method for finding next are familiar. The complete code is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
/*************************************** **********************************
> File Name: KMP. c
> Author: mr_zys
> Mail: 247629929@163.com
> Created Time: Thursday, June 17, October 09, 2014, 48 minutes, 30 seconds
**************************************** ********************************/
 
# Include <stdio. h>
# Include <string. h>
# Define maxn100
Int next [maxn];
Char P [maxn], T [maxn];
 
Void get_next (char * P, int next [], int len)
{
Printf ("len = % d \ n", len );
Next [0] =-1;
Int q =-1;
Int I;
For (I = 1; I <len; I ++ ){
While (q> 0 & P [q + 1]! = P [I]) {/* determine if P [q + 1] is suitable for equal to P [I] */
Q = next [q];/* if they are not equal, always find the longest suffix that meets the condition */
}
If (P [q + 1] = P [I]) q ++;/* if it is equal, it is good to continue ...*/
Next [I] = q;

}
}
Void KMP (char * P, char * T)
{
Int len_P = strlen (P );
Int len_T = strlen (T );
Int j =-1;
Int I;
For (I = 0; I <len_T; I ++ ){
While (j>-1 & T [I]! = P [j + 1]) {
J = next [j];
}
If (P [j + 1] = T [I]) {
J ++;
// Printf ("% d \ n", j, I );
}
If (j = len_P-1 ){
Printf ("start matching \ n at % d", i-len_P + 1 );
J = next [j];
}
}
}
Int main ()
{
Printf ("input the string P: \ n ");
Scanf ("% s", P );
Printf ("input the string T: \ n ");
Scanf ("% s", T );
Printf ("% s \ n", P );
Get_next (P, next, strlen (P ));
Int I;
For (I = 0; I <strlen (P); I ++ ){
Printf ("(% d)", next [I]);
}
Printf ("\ n ");
KMP (P, T );
Return 0;
}
Maybe there are some unclear expressions in the middle. please correct me!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.