Introduction to algorithms the 32nd chapter is a detailed string matching, automata, KMP algorithm

Last Update:2018-07-26 Source: Internet

Author: User

Tags comparison min

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In the middle skip a few chapters, first look at oneself think easier to understand a few chapters, the results found, proved really difficult. Although not how to read the other algorithm book, but think the introduction of the algorithm, although in the proof, the problem is a little bit wordy, but it is still good, it will not directly throw you a most effective algorithm, and then directly with you, it will be from the simplest algorithm gradually speak more efficient algorithm, This allows the reader to have a clearer grasp of the problem, and some efficient algorithms are often built on simple algorithms. String matching is the case, the naïve algorithm-the automatic recognition method-KMP algorithm. Rabin-karp algorithm I'm a little dizzy, let's put it on for a moment.

The problem with string matching is clear. is the given pattern string p, to match the string T to find if p appears, return the offset of the T position when matching.

We set the p length to m, t length n, Σ alphabet

1. the naïve matching algorithm is from the first character of T compared to P, when there is a mismatch, to the right to move a bit, in the re-and P-start comparison, each of T in the opportunity and p from the beginning of comparison, but its efficiency is not high, T in the n-m+1 bit than, And the worst case scenario is to compare M characters when this bit is matched, so the worst-case run time is (n-m+1) *m

How to improve it is to filter out those invalid offsets.

2. The Automaton method uses the constructed transfer function to filter out all invalid offsets in one step, but the cost of constructing an automaton transfer function can be quite large. The following is a brief explanation of the automata, the proof of the self-motive algorithm or to see more clearly, behind the proof of the KMP algorithm hold.

About the definition of automata everyone read a book, or casually find this compilation principle of the books have. Finite automaton m (q,q0,a,σ,δ)

The automaton constructed here is based on the pattern p structure,

The state set Q altogether has a m+1 state, the state is 0,1,2,3,4....M, q0=0 for the starting state, a=m for the receiving state. Δ for State P encounters a mapping of the character a shifts to the state Q. Δ (P,a) =q, and in the Automaton character match we define this transfer function as Δ (p,a) =q=σ ((PP) a), where Pp represents the length of p as the prefix of P.

The σ () function is the suffix function, which is defined as σ (x) is the length of the longest prefix of the suffix pattern p of the string x. A bit of a mouthful, examples are as follows:

P:a B A B a c a

Σ (Abdcaba) =3, because X a b d c a B a

The suffix of x for ABA is the longest prefix of p. Length is 3.

And then define what this is for. Let's just ignore how this transfer function is calculated. To see how the self-motive works, the self-motive is starting from the state q0, reading the characters, and transferring them according to the state transfer function. Now our automata input text is T, and then enter the T1,T2,T3 ... T[i]. Now let's see what the state of TI will be based on the transfer function we define, which defines the state that φ (TI) is in after the string ti is read into.

Now we prove φ (ti) =σ (TI). Proving this, we can see how the self-motive works.

Ps:ti A string representing the first I character of T, T[i] represents the I character of the T string

Conduct inductive i=0; t0= null σ (empty) =0, Φ (T0) =0 established, is the initial state. None of the characters are read in.

Suppose φ (ti) =σ (ti) to prove φ (ti+1) =σ (ti+1), state P is φ (TI), character A is t[i+1].

Φ (ti+1) =φ (Ti a) (according to the definition of ti+1, read into the t[i+1] characters in the state)

=δ (φ (Ti), a) (according to the definition of φ)

=δ (P,a) (according to the definition of P, it is φ (Ti))

=σ (Pp a) (according to the definition of the transfer function σ)

=σ (Ti a) ()

According to the hypothesis P=φ (ti) =σ (TI)

Ti:t1t2t2 .... T[i-p+1] ... T[i-2] t[i-1] t[i] A

Pp:p[1]p[2] ...... P[p] A

A character A is obviously equal to the top and bottom.

(UP) =σ (ti+1) (Ti+1=ti+a)

With φ (ti) =σ (TI) we can see that only a match is obtained when the state is m after reading the TI string. This way the automata will work properly. The starting state is 0, and a character is read into the next state according to the transfer function, and each time it enters the next state it is not m, and if it is M, it gets a match.

And the calculation of the transfer function, we will use the definition of brute force search method, the specific look at the code. Improvements can be made using the following KMP related methods.

The proof of the introduction of the algorithm sees me disoriented, gives a transfer function in a moment, and then appears to pass the transfer function through proof. I think the Automata method should be derived from the automata theory, then think of the transfer function, and then prove the feasibility, so it is not messy.

Specific implementation code:

#include <iostream> #include <fstream> #include <ctime> #include <cstdlib> #include "MyTimer.h"


#define MAXSIZE using namespace std;
unsigned int status[maxsize][26]; /*σδ according to the given pattern p to build automata, the most important part is the definition of transfer function, p's length m altogether m+1 state, starting state 0, accept the state m q for any State (0-M), a for the character transfer function δ (q,a) =σ (PQ) a) set σ ((PQ) a) =k meaning  : One of the longest prefixes of P P[1....K], and is the suffix of the string p[1...q]a */void Computetransitionfunction (String P)//o (m^3|σ|)
Can be improved to O (m|σ|)
    {//By defining a direct calculation of the conversion function there are 4 layers of loops, all of which look expensive, int m=p.size ();                       for (int q=0;q!=m+1;++q) {//starting from state 0 calculates altogether 0-m states m+1 times for (int j=0;j!=26;++j) {//26 characters each try       Σ times int k=min (m,q+1);
            The value of δ (Q,a), which has a maximum value of min (m,q+1) while (k!=0) {//up to 0 return status 0 up to m+1 times

                int i;
                        for (i=k;i!=0;--i) {//view P[1..K] Whether the requirement is p[1...q]a the suffix of this string m times if (i==k) {//p[k] to compare with a if (P[i-1]==char (j+97)) {//phaseAnd so on continue to continue;
            }else {//otherwise this k does not meet the requirements of break;}
                } else{//Continue to compare p[1..k-1] and p[1..q] if (p[i-1]==p[q-(k-i)]) continue;
            else break; 
      }}//for//If I is reduced to 0 then this k conforms to the requirement, if (i==0) {status[q][j]=k;break;}//Assignment else {//otherwise decrease K
         --k; }}//while}}}//constructed the next match of the automaton is very simple. void Finiteautomationmachine (String t,int m) {//o (n) I
  NT N=t.size ();  int q=0;  Initial state for (int i=0;i!=n;++i) {//Read into character Q=status[q][int (t[i]-97)];  Transfer status if (q==m) cout<<i-m+1<<endl;
  If a match is obtained for the status m:

    }} int main () {/* Ofstream outfile ("test.txt"); Srand (Time (0)); for (int i=1;i!=1001;++i) {int n=rand ()%26;
    Outfile<<char (97+n);
if (i%50==0) outfile<<endl;
} */Ifstream infile ("test.txt");
String T;<span style= "White-space:pre" > </span>//this to match the string outside the file read, you can specify the int i=0 yourself;
char x; While!infile.eof () &&infile>>x) {t.push_back (x);} string p= "FSDFSADSADFADSF"; <span style= "White-space:
Pre "> </span>//mode P mytimer times; Times.

Start ();
Computetransitionfunction (P);
Finiteautomationmachine (T,p.size ()); Times.
End ();

cout<<times.costtime<< "US" <<endl;
return 0; }

Self-review knowledge although can deepen understanding, but good time, have time to write KMP.

The KMP algorithm uses the prefix function to avoid useless offsets, and the work he does does not have the automatic transfer function done thoroughly, but can also achieve the goal. And the pretreatment time is reduced to O (m).

The proof of the calculation method of the prefix function, and the proof of the correctness of the KMP, have not been seen for several times, but the code can only be understood directly. ，

1. Mode p calculates the prefix function with its own comparison, Pai.

PAI[Q] Represents the true suffix of p[1......q] and is the maximum length of the prefix for P
such as P =a B a B a c a

Pai[5]= P[1...5] A b a B a
P a B A B a c a
Maximum length is 3
So pai[5]=3
What's the use of knowing this? When identifying the string T, it is possible to omit the garbage offset from this information.
1 23 4 5 6 7 8
T a B A B a a B C
P a B A B a c a

Assuming we search for the 6th character at this point, the result is a mismatch,
In the naive algorithm only advances the offset to 2, in the automaton match the state 5 reads the character a back to the state 1 that is the advance offset to 7, directly compares the first
7 characters, while in KMP we ignore the read-in character A, advance the offset 5-pai[5]=2, the invalid offset 2 is avoided, and continue to compare the sixth character of T and P[3+1] character
1 2 3 4 5 6 7 8
T a B A B a a B C
P a B A B a c a
Not the same, look at Pai[3]=1, go ahead, but still compare the sixth character
1 2 3 4 5 6 7 8
T a B A B a a B C
P a B A B a c a
It's not the same, at this time pai[1]=0
Equivalent to truncation from the sixth character of T and start the match with P again
So you can still see that the self-motive is one step, because he is not the same one also considered in, but the work done in advance, and KMP use PAI function, the immediate calculation of the need for information can also achieve the effect of automata, and improve the efficiency of a lot.

The calculation of the prefix function I can only understand through the code, proof let it go with the wind.

#include <iostream> #include <fstream> #include <ctime> #include <cstdlib> #include "MyTimer.h"


using namespace Std;               void Computeprefixfuction (String p,int *pai)//Use amortization also analysis O (m) {//Solve prefix function, prove really do not understand, look at the code bar int m=p.size (); pai[1]=0;                Pai[q]<q so obviously get int k=0;  K K=pai[q-1]//Represents P 1 2 before solving pai[q] before the loop.  .  K//P 1 2 3 4.   .  .  . Q-1//1. If k equals 0, you need to compare//2 from p[0] and p[q] when you ask for pai[q]. If K>0 and p[k+1]!=p[q] then you need to match the previous PK in the following Continue to seek, that is, to further reduce the K-value//3. If K>0 and P[k+1]!=p[q] then the simple match length plus 1 k++ for (int q=2;q!=m+1;++q) {while (k>0&&
    Amp;p[k]!=p[q-1]) {//2 k=pai[k];
    } if (P[k]==p[q-1]) {//1, 3 ++k;
} pai[q]=k;
    }} void Kmpmatch (String t,string P) {//has a prefix function matching process is relatively easy for a bit int m=p.size ();
    int n=t.size ();
    int pai[m+1];
    Computeprefixfuction (P,pai); int q=0;
    for (int i=1;i!=n+1;++i)//sweep from the first character to the last character O (n) {//from the above diagram through the prefix function matching method, each time the comparison is the first character, Change is before this character and P match the number of characters Q, initially 0 while (q>0 && p[q]!=t[i-1]) {//q is greater than 0, and P's q+1 character and T of the I-character does not match                               , then find Q=pai[q]; P[1...Q], the maximum match for the prefix of P, that is, Pai[q]} if (P[q]==t[i-1]) {///If q is 0 takes the first character of P and the I character of T is equal then Q plus 1 is not equal to enter the next character I        +1 ++q; Q is greater than 0 and the q+1 character of P matches the I character of T, Q plus 1 is very simple} if (q==m) {//When I compare the first character, I find q==m is matched to a P COUT&LT;&LT;I-M&LT;&L t; ""
        <<endl;   Q=PAI[Q];

Find a match, to re-regulate,}}} int main () {Ifstream infile ("test.txt"); string T;
int i=0;
char x;
while (!infile.eof () &&infile>>x) {t.push_back (x);} string p= "Ewqw";



String p= "Ababaca";
MyTimer times; Times.
Start ();

Kmpmatch (T,P); Times.
End ();
cout<<times.costtime<< "US" <<endl;
return 0;
 }

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More