String _1 2016.4.27

Source: Internet
Author: User

First, string and string matching


How to monitor and extract a local feature given as a string in string data

This type of operation belongs to the category of string pattern matching (string patterns matching), short string matching


In general, that is:

For any text string based on the same character list T (| T| = N) and pattern string P (| p| = m):

Determine if there is a substring in t that is the same as P

If present (match), the starting position of the substring in T is reported


The length of the string n and M are usually large, but relative to n is greater, i.e. 2 << m << n


For example, if:

T = "Now was the time for all good people to come"

P = "People"

Then the matching position should be 29.


Ii. Violent Matching (brute‐force)

The most direct and intuitive way


Idea:




BOOL Match (char* P, char *t)//violent string matching algorithm (version 1) {    int n = strlen (T);  Text string length    int m = strlen (P);  Pattern string length    int i = 0, j = 0;    while (J < m && I < n) {    //from left to right one-by-one character        if (t[i] = = P[j]) {     //if matching            ++i;                Then go to the next character            ++j;        } else {                //text string fallback, mode string reset            i  -= (j-1);            j = 0;        }    }    if (I-j < n-m+1) {  //i-j < n-m + 1 o'clock; otherwise, mismatch        return true;    }    return false;}


BOOL Match (char* P, char *t)//violent string matching algorithm (version 2) {    int n = strlen (T);  Text string length    int m = strlen (P);  Pattern string length    int i, J;    for (i=0; i<n-m+1; ++i) {   //text string starting from the first character, match the corresponding character in the pattern string one by one for        (j=0; j<m; ++j) {            if (t[i+j]! = P[j]) {   // If the mismatch, the pattern string to move the whole right one character, and then make a round than the break                ;            }        }        if (j = = m) {   //find matching substring break            ;        }    }    if (I < n-m+1) {    //i<n-m+1 matches; otherwise mismatch        return true;    }    return false;}


Best case: Only after a round of alignment, that is to determine the match

#比对 = m = O (M)

Worst case scenario: each round is more than the last character of P, and this is repeated
Cycles per round: #比对 = m‐1 (Success) + 1 (failed) = m
Number of cycles = n‐m + 1
General m? n, so overall, #比对 = mx (n‐m+1) = O (NXM)


Three, KMP algorithm (1) Idea

Brute force algorithms are used in large-scale application environments, and it is necessary to improve

To do this, you might want to start by analyzing the worst case scenario


A little observation is not difficult to find, the problem is that there are a large number of local matches:

In M-times of each round, only the last possible mismatch, the character pointers of the text string, the pattern string, are rolled back, and the next attempt is started from the beginning


In fact, this kind of repetitive word transmitting is not necessary for operation

Since these characters have been compared and successful in the previous iteration, we have mastered all of their information.


So, how to use this information to improve the efficiency of the matching algorithm?


Memory = Experience = Predictive Power

After a comparison of the previous round, we have clearly known that the substring t[i-j, I] is composed entirely of ' 0 '

Remembering this nature, we can predict:

In the next round of comparisons immediately after the rollback, the first j-1 is bound to succeed.


Therefore, the direct "I will remain unchanged, j = j-1", and then continue to compare

So, the next round only 1 times, a total reduction of j-i times!


The meaning of the above "I remains unchanged, j = j-1", can be understood as "to move P to the right of a cell relative to T, and then to continue the match from the previous mismatch":

In fact, this technique can be used to:

Using the information (memory) provided by previous successes is not only to avoid the fallback of text string character pointers, but also to move the pattern string to the right as long as possible (experience)


General examples

Let's examine a more general example.



This round of comparison to find t[i] = ' E ' ≠ ' O '-p[4] mismatch, while keeping I constant, the pattern string P should be shifted to several units right?

Is it necessary to move right by unit?


It is not difficult to see that moving one or two units in this case is futile.

In fact, according to previous comparison results, there must be

T[i-4, i) = p[0,4) = "Regr"

If a match is achieved in this locality, then at least several characters to the left of the t[i] are matched

For example, this is the case when p[0] is aligned with t[i-1]


Further, if you notice that the i-1 is the leftmost position that can be so matched, you can move P right 4-1 = 3 units (equivalent to I remain unchanged, while J = 1), and then continue to match


(2) Next table



Generally, assume that the previous round is terminated in t[i]≠p[j]

According to the above idea, the pointer I do not have to fallback, but will t[i] and p[t] aligned and start the next round

So, how much should t exactly be taken?


After the previous round of comparison, it has been determined that the matching range should be:

P[0, j) = T[i-j, i)

Thus, if the pattern string P is properly shifted to the right, it is possible to exactly match a substring of T (including T[i]), a necessary condition is:

P[0, T) = t[i-t, i) = P[j-t, j)

That is, the true prefix of length T in P[0, J, should be exactly the same as the true suffix of T, so t must come from the set:


N (P, j) = {0≤t < J | P[0, T) = P[j-t, J)}


In general, the collection may contain multiple such t

It is important to note, however, which T-values are made up only by the pattern string P and the first mismatch position of the previous round of comparisons p[j], regardless of the text string T!


If the next comparison starts with a comparison of t[i] and p[t], this is equivalent to moving P to the right j-t cells, the displacement is inversely proportional to T

Therefore, to ensure that the alignment position of P and T (pointer i) is never reversed without missing any possible matches, the largest t in Set N (P, j) should be selected

That is, when there are multiple right-shift scenarios to be tempted, it is prudent to choose the shortest moving distance


So, if you make


NEXT[J] = max (N (P, J))


P[J] and T[i] mismatch, you can turn p[Next[j]] and t[i] to each other, and starting from this position to continue the next round of alignment


Since the set N (P, J) depends only on the pattern string p and the mismatch position J, and is independent of the text string, as the largest element, Next[j] must also have this property


Therefore, for any mode string P, it is advisable to pre-process all position J to calculate the corresponding next[j] value, and organized into a table for later repeated query

That is, to turn "memory" into "predictive power."




(3) KMP algorithm


Knuth and Pratt Apprentice, and Morris almost simultaneously invented the algorithm

They later co-branded the algorithm and named it with the initials of their surname


Here, it is assumed that the Next table of the pattern string P can be constructed by Build_next ()

Against the "Brute Force matching algorithm (version 1)", only the other branch of the mismatch is handled differently, which is the essence of the KMP algorithm


BOOL Match (char* P, char *t)//KMP main algorithm (for improved version) {    build_next (p);    int n = strlen (T);  Text string length    int m = strlen (P);  Pattern string length    int i = 0, j = 0;    while (J < m && I < n) {    //from left to right one-by-one characters        if (J < 0 | | T[i] = = P[j]) {     //if matched, or P has moved out of the leftmost (two order of judgement is not exchangeable)            ++i;                Then go to the next character            ++j;        } else {            j = next[j];        The pattern string moves right (note: The text string does not have to be rolled back)}    }    if (I-j < n-m+1) {  //i-j < n-m + 1 o'clock; otherwise mismatch        return true;    }    return false;}



(4) Next[0] = 1

It is not difficult to see, as long as J > 0 must have 0∈n (P, J)

At this point N (P, j) is not NULL, which guarantees that the "Maximum value" operation is possible

But conversely, if j = 0, even if the set N (P, j) can be defined, it must be an empty


In this case, how to define NEXT[J = 0]?


The process of cross-viewing matching

If the first pair of characters is mismatch in a round of comparisons, you should move P directly to the right one character and then start the next round of comparison

Therefore, it might be advisable to "append" a p[-1] to the left of P[0], and the character matches any character.

As far as the actual effect is concerned, this approach is entirely equivalent to "next[0" =-1 "


Clever use of Sentinel, can

-Simplified Code

-Unified Understanding

(5) Next[j + 1]
So, if you know Next[0, j], how can I recursively calculate next[j + 1]?

Is there an efficient method?


The next candidate for Next[j + 1] is defined by the function of the next table, which should be next[Next[j]] + 1,
next[next[Next[j]] + 1, ...


Therefore, simply replacing t (even if T = Next[t]) with next[t], you can traverse these candidates in order of precedence

Once found P[j] and P[t] Match (with p[t =-1] of the wildcard), you can make Next[j + 1] = Next[t] + 1

Since there is always next[t] < T, the t must be strictly reduced in this process

At the same time, even if T drops to 0, it is bound to end in the next[0] = 1 of the pass, without underflow

Thus, the correctness of the algorithm can guarantee

(6) Construct next table

void Build_next (char* p)    //Construction mode string P Next table {    int m = strlen (P);    int j = 0;  "Main" string pointer    int t = next[0] =-1;      Pattern string pointer while    (J < m-1) {        if (T < 0 | | P[J] = = P[t]) {    //matching            ++j;            ++t;            NEXT[J] = t;   This sentence can be improved ...        } else {    //mismatch            t = next[t];}        }    }

It can be seen that the next table's construction algorithm is almost identical to the KMP algorithm.


In fact, according to the above analysis, this construction process is completely equivalent to the pattern string self-matching, so two algorithms in formal approximation is not surprising



(7) Performance analysis

To do this, be aware of the variables I and J used as character pointers in the KMP algorithm


If k = 2i-j and examine the change trend of k in the process of KMP algorithm, it is not difficult to find:

While loops, K is strictly incremented for each iteration


In fact, there are only two cases that correspond to the If-else branch inside the while loop:

If a branch is transferred, I and J add one at the same time, so k = 2i-j will increase

Conversely, if it is transferred to the Else branch, then although I remains unchanged, J must decrease after the assignment J = Next[j], and k = 2i-j will inevitably increase


Throughout the process of the algorithm:

Start with i = j = 0, that is, k = 0, at the end of the algorithm i≤n and j≥0, so there is k≤2n


During this period, although the integer k starts from 0, the cumulative increase does not exceed 2n, so the while loop executes at most 2n wheels

In addition, the while loop body does not contain any loops or calls, so only O (1) Time

Therefore, if the time required to construct the next table is not counted, the KMP algorithm itself will not run more than O (n)


Since the flow of the next table construction algorithm is not materially different from the KMP algorithm, according to the above analysis, the next table construction only needs O (m) time


In conclusion, the overall running time of the KMP algorithm is O (n + M)


(8) Continue to improve

Although the above KMP algorithm has guaranteed linear uptime, there is still room for further improvement in some cases


Examine mode string P = "000010"

In the KMP algorithm, it is assumed that the previous round of comparisons is interrupted due to t[i] = ' 1 ' ≠ ' 0 ' = p[3] Mismatch

Then according to the next table, the next KMP algorithm will then turn p[2], p[1] and p[0] and T[i] alignment and do the comparison


These three-time ratios report "mismatch."

So, are these three times the result of a failure of the comparison to be accidental?

Further, can these ratios be avoided?


In fact, even if the comparison between P[3] and t[i is necessary, the following three times are not necessary

In fact, the consequences of their failure have long been doomed


Just note that p[3] = p[2] = p[1] = p[0] = ' 0 ', it is not difficult to see this

Since the previous comparison has found t[i]≠p[3], then continue to t[i] and those with the same character as P[3], both the same mistakes, more futile


Memory = Lesson = Predictive Power


In terms of algorithmic strategies, the real effect of introducing the next table is to help us to transform memory into predictive power by using the "experience" provided by previous successful comparisons.


In practice, however, much more than that has been done before, and it is also good to say that the failure is better than the----as a "lesson", but unfortunately it has been overlooked before


Past failures have actually provided us with an extremely important message----T[I]≠P[3]----Unfortunately, we're not using it effectively.

The reason why the original algorithm would perform the next three unnecessary comparison is that it failed to fully learn the lessons


Improved


To introduce this kind of "negative" to the next table, simply modify the definition of Set N (P, J) to:


N (P, j) = {0≤t < J | P[0, T) = P[j-t, J) and P[t]≠p[j]}


That is, in addition to the "corresponding to self-matching length", t only satisfies the "current character pair mismatch" requirement, can be classified into the set N (P, J) and as a candidate for the next table item


void Build_next (char* p)    //Construction mode string P Next table (improved version) {    int m = strlen (P);    int j = 0;  "Main" string pointer    int t = next[0] =-1;      Pattern string pointer while    (J < m-1) {        if (T < 0 | | P[J] = = P[t]) {    //matching            ++j;            ++t;            NEXT[J] = (p[j]! = p[t]? T:next[t]); Note the difference between this sentence and the previous one that was not improved        } else {    //mismatch            t = next[t];}        }}    

The only difference between the improved algorithm and the original algorithm is that each time the true and true suffixes of the length T are found to match each other in P[0, J, further checks for P[J] are required to be equal to p[t]

Only P[j]≠p[t], you can give T to next[j], otherwise, you need to instead next[t]


The improved next table construction algorithm also requires only O (m) time


A table of Bjfuoj1553 Yi Biao


#include <iostream> #include <cstdio> #include <cstring>using namespace Std;const int maxn = + 10;char T[MAXN * Maxn];char P[MAXN * maxn];int NEXT[MAXN * maxn];int N, m;int T_len, p_len;void build_next (char* P); bool Match (CH    ar* P, char *t); int main () {//Freopen ("In.txt", "R", stdin);        while (scanf ("%d%d", &n, &m)! = EOF) {T_len = n*m;        for (int i=0; i<t_len; ++i) {cin>>t[i];        } strlwr (T);        scanf ("%s", P);        P_len = strlen (P);        STRLWR (P);        if (Match (P, T)) {printf ("yes\n");        } else {printf ("no\n"); }} return 0;}    void Build_next (char* P) {int j = 0;    int t = next[0] =-1; while (J < p_len-1) {if (T < 0 | |            P[J] = = P[t]) {++j;            ++t;        NEXT[J] = (p[j]! = p[t]? T:next[t]);        } else {t = next[t];    }}}bool Match (char* p, char *t) {build_next (P); int i = 0, j = 0;    while (J < P_len && I < T_len) {if (J < 0 | |            T[i] = = P[j]) {++i;        ++j;        } else {j = next[j];    }} if (I-j < t_len-p_len+1) {return true; } return false;}




String _1 2016.4.27

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.