4 string matching algorithms: BS naive rabin-karp finite automaton KMP (UP)

Source: Internet
Author: User
Tags modulus

String matching algorithm has always been a relatively basic algorithm, we have studied the data structure of undergraduate Min KMP algorithm. The KMP algorithm should be the most efficient algorithm, but it's a little bit difficult to understand. So, to open this blog, step by step introduction of 4 matching algorithms. Also mentioned in the introduction to the algorithm. I will implement all the four algorithms mentioned in the C + + language. Provide reference learning. table, the processing time and matching time of each algorithm are introduced. I hope I write more clearly. If you do not understand, or the wrong, please leave a message.

string matching algorithm and its processing time and matching time
algorithm preprocessing time
naïve algorithm 0 o ((n-m+1) m)
rabin-karp ⊙ (m) o ((n-m+1) m)
finite automaton algorithm o (m|∑|) ⊙ (n)
KMP (Knuth-morris-pratt) ⊙ (m) ⊙ (n)

= = =BF algorithm (naïve pattern matching)======================================================

Introduction, above these four algorithms before, and all the textbooks, first introduce the BF algorithm bar (brute force algorithm).

The idea is simple: take each string and compare it. The complexity of Time is O (m*n). We might as well look at the code first:

1 Char* STRSTR (Const Char* STR,Const Char*target)2 {3     if(!*target)returnstr;4     Char*P1 = (Char*) str;5     6       while(*p2)7     {8         Char*p1begin =P1;9         Char*P2 = (Char*) target;Ten          while(*p1 && *p2 && (*p1 = = *p2)) One         { Ap1++; -p2++; -         } the         if(!*P2)returnP1begin; -     } -     returnNULL; -}

Graphic:

is a picture that I drew from the above code to understand: line 10th ~14 is the key code, for the operation of the + +, do contrast, and the pointer has been pointing backwards. Until the T-string operation is complete. Finally, whether the T-string has reached the end, if it has reached the end, represents the contents of the T string contained in the S string, then returns the saved pointer.

The time complexity of the algorithm: two while loop, so O (m*n).

The BF algorithm belongs to the naïve algorithm, and the pseudo-code of the naïve algorithm in the introduction of the algorithm is like this:

1 n=t.length2 m=p.lenth3for0 to nm4      If p[1... m] = = T [s+1... s+m]5     " pattern occurs with shift "s

What do you mean?

In fact he just compares n-m+1 times can. We take the picture as an example. N-m=12,for cycle from the 0~12 only need to compare 13 times, that is (n-m+1), if the 13th time is unsuccessful, stating that there is no t in the S string, directly return null, his time complexity is O (n-m+1), but if there is, and in the last one, then his time complexity is O (( n-m+1) *m), when found, will also be a m-time comparison, that is, two while loop.

Therefore, I conclude that the BF algorithm should be one of the naïve algorithms. We have made this kind of algorithm a simple pattern matching algorithm.

Of course, if the pattern T, all the characters are different, there is no way to reduce the naïve algorithm to O (n), the answer is yes. (in addition, their time complexity is very good calculation, if not, go to see simple reference books).

1int StrStr1 (Const Char* STR,Const Char*target)2 {3      for(i =0, j =0; I! = N; i++)4     {5         if(Str[i] = =Target[j])6         {7J + +;8         }9         ElseTen         { Onej =0; A         } -         if(J = =m) -             return true; the     } -}
This problem is the third edition of the algorithm introduction of the 32 Chapter 32.1.2 topic, but I think this topic to consider, if I want to return is in the S string where, that is, want to return a pointer in the S string, the above method is obviously not possible. Because he just returned: there is no such string.
  How to improve it?
Char* STRSTR1 (Const Char* STR,Const Char*target) {    intI, J; intn =strlen (str); intm =strlen (target); Char*P1 = (Char*) str;  for(i =0, j =0; I! = N; i++)    {        if(Str[i] = =Target[j]) {J++; }        Else{J=0; }        if(J = =m)return(p1+i-j+1);//returns the address of the location    }}

Git: Code Download: brute_force.c

= =Rabin-karp=================================================

About the RABIN-KARP algorithm, it will be more complex. Because it involves some mathematical knowledge, some conversions are used. Also pulled into the mold and so on. I feel that to understand him, if all by looking at the above things, will certainly make himself dizzy. Let's take a look at a blog post (click to jump), and then we'll find a topic to practice practiced hand (POJ 1200). The answer is given in the following:

rabin-The Karp string search algorithm is a relatively fast string search algorithm, which requires an O (n) Average search time. This algorithm is based on the use of hashing to compare strings. RabinThe-KARP algorithm is not very commonly used in string matching, but its practicality is good, unless your luck is particularly bad, the worst case may require O ((n-m) *m) Run time (see article for the meaning of n,m). On average, it's better. Why is the naïve string matching algorithm slow? Because it is too forgetful, the previous matching information can actually be applied to the next match, and the simple string matching algorithm simply throw this information away from the beginning, so waste time. Take advantage of this information and naturally improve the speed of operation. This algorithm is not so easy to say clearly, I give an example (see the example of the introduction of the algorithm). We use E to denote the number of letters in the alphabet, and this example has the following alphabetical list: {0,1,2,3,4,5,6,7,8,9}, then E is 10, if you use lowercase English alphabet to do the alphabet, then E is 26, class this. Since the completion of the comparison between the two strings requires that the characters contained therein be examined for a longer period of time, and a numeric comparison can be done one at a time, we first convert the pattern (the matched string) to a number (the benefit of converting to a numerical value is not just here). In this example we can put the character 0Map to Digital 0~9。 For example, "423″, we can convert into 3+e* (2+e*4) , if the value is too large, we can choose a larger prime number to modulo, the value of the modulo as the value of the string. This way, then the conversion of the matching string, take the first m characters, such as the above operation on the value, and then compare the value. If it does not match, then continue looking down, how to do it? For example, the pattern is "423″, and the parent string is "324232″; The first step is to compare the values of 423 and 324, not equal, the next step should compare 423 and 242, then how do we use the information in the previous steps? First we put 324 to 300, then multiplied by E (here is 10), plus 2 is not the 242? Using an expression is the new value a (i+1) = (E (A (i)-s[i]) *h-s[s+m]) MOD P,p is the large prime number we select, S[i] represents the first character of the parent string, and a (i) represents the current value, in this case the 324,h represents the weight of the highest bit of the current value, for example,324, then h= -, which is the weight of the 3, the formal expression is h= (e^m-1) MOD p. Of course, due to the use of modulo operation, when the two are equal, not necessarily true equality, we need to conduct a careful check (a simple string matching operation). If they are not equal, they can be ruled out directly. Continue to the next step.

Answer:

#include <stdio.h>#include<string.h>Charstr[1000000];BOOLhash[16000000] = {false};intansi[ the] = {0};intMain () {intN, NC, ans =0; scanf ("%d%d%s", &n, &NC, str);  for(Char*s = str; *s; ++s) {//*s not sAnsi[*s] =1;//If the letter appears, assign a value of 1    }    intCNT =0;  for(inti =0; I < the; ++i) {        if(Ansi[i]) ansi[i]= cnt++;//numbering starting from 0    }    intLen =strlen (str);  for(inti =0; I < Len-n +1; ++i) {        intKey =0;  for(intj =0; J < N; ++j) {Key= key * NC + Ansi[str[i + j]];//Convert to NC-binary//printf ("%d\n", Ansi[str[i + j]);        }        //printf ("key=%d\n", key);        if( !Hash[key]) {ans++; Hash[key]=true; }} printf ("%d\n", ans); return 0;}
Answer click Open

If you do it seriously, you have some simple understanding of the algorithm. Then we analyze the pseudo-code and some points of knowledge in the calculation guide. Let's take a look at the graph on the guide.

The modulus of 31415 is that the modulus of the 7,14152 is 8,67399 and the modulo is also 7, however, these two numbers are not the same number, so he is a pseudo hit point. The computer steps of the module are clearly written in Figure C and do not have to be a lot of breath. But when we encounter two identical mods, it is necessary to compare and judge whether he is the same as the substring. If the same, then hit, determine whether his string (value) is the same. Same, true hit, otherwise pseudo-hit.

He pretreated the time to O (M), because he wanted to convert the numbers to modulo (as can be seen in pseudocode) for a period of time, and the time to convert is O (m). After that, just compare n-m+1 times. If the two numbers are the same, the M match is made. So his worst time complexity was O (m* (n-m+1)).

Let's take a look at the pseudo-code:

n =Len (T); M=Len (p); H= d^ (M-1) mod q;//denotes roundingp =0; t0=0; forI1to M//m time of pretreatmentp = (d*p+p[i]) mod q; T0= (d*t0+t[i]) mod q; forI0To N-m//start searching from string s inside     ifp==tiElsets+1= d (ts-t[s+1]H) + t[s+m+1]mod Q//A more important step. Get the next mod

===================

Note: The finite automata algorithm, KMP please pay attention to. Reprint please indicate the source.

4 string matching algorithms: BS naive rabin-karp finite automaton KMP (UP)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.