First, string and string matching
How to monitor and extract a local feature given as a string in string data
This type of operation belongs to the category of string pattern matching (string patterns matching), short string matching
In general, that is:
For any text string based on the same character list T (| T| = N) and pattern string P (| p| = m):
Determine if there is a substring in t that is the same as P
If present (match), the starting position of the substring in T is reported
The length of the string n and M are usually large, but relative to n is greater, i.e. 2 << m << n
For example, if:
T = "Now was the time for all good people to come"
P = "People"
Then the matching position should be 29.
Ii. Violent Matching (brute‐force)
The most direct and intuitive way
Idea:
BOOL Match (char* P, char *t)//violent string matching algorithm (version 1) { int n = strlen (T); Text string length int m = strlen (P); Pattern string length int i = 0, j = 0; while (J < m && I < n) { //from left to right one-by-one character if (t[i] = = P[j]) { //if matching ++i; Then go to the next character ++j; } else { //text string fallback, mode string reset i -= (j-1); j = 0; } } if (I-j < n-m+1) { //i-j < n-m + 1 o'clock; otherwise, mismatch return true; } return false;}
BOOL Match (char* P, char *t)//violent string matching algorithm (version 2) { int n = strlen (T); Text string length int m = strlen (P); Pattern string length int i, J; for (i=0; i<n-m+1; ++i) { //text string starting from the first character, match the corresponding character in the pattern string one by one for (j=0; j<m; ++j) { if (t[i+j]! = P[j]) { // If the mismatch, the pattern string to move the whole right one character, and then make a round than the break ; } } if (j = = m) { //find matching substring break ; } } if (I < n-m+1) { //i<n-m+1 matches; otherwise mismatch return true; } return false;}
Best case: Only after a round of alignment, that is to determine the match
#比对 = m = O (M)
Worst case scenario: each round is more than the last character of P, and this is repeated
Cycles per round: #比对 = m‐1 (Success) + 1 (failed) = m
Number of cycles = n‐m + 1
General m? n, so overall, #比对 = mx (n‐m+1) = O (NXM)
Three, KMP algorithm (1) Idea
Brute force algorithms are used in large-scale application environments, and it is necessary to improve
To do this, you might want to start by analyzing the worst case scenario
A little observation is not difficult to find, the problem is that there are a large number of local matches:
In M-times of each round, only the last possible mismatch, the character pointers of the text string, the pattern string, are rolled back, and the next attempt is started from the beginning
In fact, this kind of repetitive word transmitting is not necessary for operation
Since these characters have been compared and successful in the previous iteration, we have mastered all of their information.
So, how to use this information to improve the efficiency of the matching algorithm?
Memory = Experience = Predictive Power
After a comparison of the previous round, we have clearly known that the substring t[i-j, I] is composed entirely of ' 0 '
Remembering this nature, we can predict:
In the next round of comparisons immediately after the rollback, the first j-1 is bound to succeed.
Therefore, the direct "I will remain unchanged, j = j-1", and then continue to compare
So, the next round only 1 times, a total reduction of j-i times!
The meaning of the above "I remains unchanged, j = j-1", can be understood as "to move P to the right of a cell relative to T, and then to continue the match from the previous mismatch":
In fact, this technique can be used to:
Using the information (memory) provided by previous successes is not only to avoid the fallback of text string character pointers, but also to move the pattern string to the right as long as possible (experience)
General examples
Let's examine a more general example.
This round of comparison to find t[i] = ' E ' ≠ ' O '-p[4] mismatch, while keeping I constant, the pattern string P should be shifted to several units right?
Is it necessary to move right by unit?
It is not difficult to see that moving one or two units in this case is futile.
In fact, according to previous comparison results, there must be
T[i-4, i) = p[0,4) = "Regr"
If a match is achieved in this locality, then at least several characters to the left of the t[i] are matched
For example, this is the case when p[0] is aligned with t[i-1]
Further, if you notice that the i-1 is the leftmost position that can be so matched, you can move P right 4-1 = 3 units (equivalent to I remain unchanged, while J = 1), and then continue to match
(2) Next table
Generally, assume that the previous round is terminated in t[i]≠p[j]
According to the above idea, the pointer I do not have to fallback, but will t[i] and p[t] aligned and start the next round
So, how much should t exactly be taken?
After the previous round of comparison, it has been determined that the matching range should be:
P[0, j) = T[i-j, i)
Thus, if the pattern string P is properly shifted to the right, it is possible to exactly match a substring of T (including T[i]), a necessary condition is:
P[0, T) = t[i-t, i) = P[j-t, j)
That is, the true prefix of length T in P[0, J, should be exactly the same as the true suffix of T, so t must come from the set:
N (P, j) = {0≤t < J | P[0, T) = P[j-t, J)}
In general, the collection may contain multiple such t
It is important to note, however, which T-values are made up only by the pattern string P and the first mismatch position of the previous round of comparisons p[j], regardless of the text string T!
If the next comparison starts with a comparison of t[i] and p[t], this is equivalent to moving P to the right j-t cells, the displacement is inversely proportional to T
Therefore, to ensure that the alignment position of P and T (pointer i) is never reversed without missing any possible matches, the largest t in Set N (P, j) should be selected
That is, when there are multiple right-shift scenarios to be tempted, it is prudent to choose the shortest moving distance
So, if you make
NEXT[J] = max (N (P, J))
P[J] and T[i] mismatch, you can turn p[Next[j]] and t[i] to each other, and starting from this position to continue the next round of alignment
Since the set N (P, J) depends only on the pattern string p and the mismatch position J, and is independent of the text string, as the largest element, Next[j] must also have this property
Therefore, for any mode string P, it is advisable to pre-process all position J to calculate the corresponding next[j] value, and organized into a table for later repeated query
That is, to turn "memory" into "predictive power."
(3) KMP algorithm
Knuth and Pratt Apprentice, and Morris almost simultaneously invented the algorithm
They later co-branded the algorithm and named it with the initials of their surname
Here, it is assumed that the Next table of the pattern string P can be constructed by Build_next ()
Against the "Brute Force matching algorithm (version 1)", only the other branch of the mismatch is handled differently, which is the essence of the KMP algorithm
BOOL Match (char* P, char *t)//KMP main algorithm (for improved version) { build_next (p); int n = strlen (T); Text string length int m = strlen (P); Pattern string length int i = 0, j = 0; while (J < m && I < n) { //from left to right one-by-one characters if (J < 0 | | T[i] = = P[j]) { //if matched, or P has moved out of the leftmost (two order of judgement is not exchangeable) ++i; Then go to the next character ++j; } else { j = next[j]; The pattern string moves right (note: The text string does not have to be rolled back)} } if (I-j < n-m+1) { //i-j < n-m + 1 o'clock; otherwise mismatch return true; } return false;}
(4) Next[0] = 1
It is not difficult to see, as long as J > 0 must have 0∈n (P, J)
At this point N (P, j) is not NULL, which guarantees that the "Maximum value" operation is possible
But conversely, if j = 0, even if the set N (P, j) can be defined, it must be an empty
In this case, how to define NEXT[J = 0]?
The process of cross-viewing matching
If the first pair of characters is mismatch in a round of comparisons, you should move P directly to the right one character and then start the next round of comparison
Therefore, it might be advisable to "append" a p[-1] to the left of P[0], and the character matches any character.
As far as the actual effect is concerned, this approach is entirely equivalent to "next[0" =-1 "
Clever use of Sentinel, can
-Simplified Code
-Unified Understanding
(5) Next[j + 1]
So, if you know Next[0, j], how can I recursively calculate next[j + 1]?
Is there an efficient method?
The next candidate for Next[j + 1] is defined by the function of the next table, which should be next[Next[j]] + 1,
next[next[Next[j]] + 1, ...
Therefore, simply replacing t (even if T = Next[t]) with next[t], you can traverse these candidates in order of precedence
Once found P[j] and P[t] Match (with p[t =-1] of the wildcard), you can make Next[j + 1] = Next[t] + 1
Since there is always next[t] < T, the t must be strictly reduced in this process
At the same time, even if T drops to 0, it is bound to end in the next[0] = 1 of the pass, without underflow
Thus, the correctness of the algorithm can guarantee
(6) Construct next table
void Build_next (char* p) //Construction mode string P Next table { int m = strlen (P); int j = 0; "Main" string pointer int t = next[0] =-1; Pattern string pointer while (J < m-1) { if (T < 0 | | P[J] = = P[t]) { //matching ++j; ++t; NEXT[J] = t; This sentence can be improved ... } else { //mismatch t = next[t];} } }
It can be seen that the next table's construction algorithm is almost identical to the KMP algorithm.
In fact, according to the above analysis, this construction process is completely equivalent to the pattern string self-matching, so two algorithms in formal approximation is not surprising
(7) Performance analysis
To do this, be aware of the variables I and J used as character pointers in the KMP algorithm
If k = 2i-j and examine the change trend of k in the process of KMP algorithm, it is not difficult to find:
While loops, K is strictly incremented for each iteration
In fact, there are only two cases that correspond to the If-else branch inside the while loop:
If a branch is transferred, I and J add one at the same time, so k = 2i-j will increase
Conversely, if it is transferred to the Else branch, then although I remains unchanged, J must decrease after the assignment J = Next[j], and k = 2i-j will inevitably increase
Throughout the process of the algorithm:
Start with i = j = 0, that is, k = 0, at the end of the algorithm i≤n and j≥0, so there is k≤2n
During this period, although the integer k starts from 0, the cumulative increase does not exceed 2n, so the while loop executes at most 2n wheels
In addition, the while loop body does not contain any loops or calls, so only O (1) Time
Therefore, if the time required to construct the next table is not counted, the KMP algorithm itself will not run more than O (n)
Since the flow of the next table construction algorithm is not materially different from the KMP algorithm, according to the above analysis, the next table construction only needs O (m) time
In conclusion, the overall running time of the KMP algorithm is O (n + M)
(8) Continue to improve
Although the above KMP algorithm has guaranteed linear uptime, there is still room for further improvement in some cases
Examine mode string P = "000010"
In the KMP algorithm, it is assumed that the previous round of comparisons is interrupted due to t[i] = ' 1 ' ≠ ' 0 ' = p[3] Mismatch
Then according to the next table, the next KMP algorithm will then turn p[2], p[1] and p[0] and T[i] alignment and do the comparison
These three-time ratios report "mismatch."
So, are these three times the result of a failure of the comparison to be accidental?
Further, can these ratios be avoided?
In fact, even if the comparison between P[3] and t[i is necessary, the following three times are not necessary
In fact, the consequences of their failure have long been doomed
Just note that p[3] = p[2] = p[1] = p[0] = ' 0 ', it is not difficult to see this
Since the previous comparison has found t[i]≠p[3], then continue to t[i] and those with the same character as P[3], both the same mistakes, more futile
Memory = Lesson = Predictive Power
In terms of algorithmic strategies, the real effect of introducing the next table is to help us to transform memory into predictive power by using the "experience" provided by previous successful comparisons.
In practice, however, much more than that has been done before, and it is also good to say that the failure is better than the----as a "lesson", but unfortunately it has been overlooked before
Past failures have actually provided us with an extremely important message----T[I]≠P[3]----Unfortunately, we're not using it effectively.
The reason why the original algorithm would perform the next three unnecessary comparison is that it failed to fully learn the lessons
Improved
To introduce this kind of "negative" to the next table, simply modify the definition of Set N (P, J) to:
N (P, j) = {0≤t < J | P[0, T) = P[j-t, J) and P[t]≠p[j]}
That is, in addition to the "corresponding to self-matching length", t only satisfies the "current character pair mismatch" requirement, can be classified into the set N (P, J) and as a candidate for the next table item
void Build_next (char* p) //Construction mode string P Next table (improved version) { int m = strlen (P); int j = 0; "Main" string pointer int t = next[0] =-1; Pattern string pointer while (J < m-1) { if (T < 0 | | P[J] = = P[t]) { //matching ++j; ++t; NEXT[J] = (p[j]! = p[t]? T:next[t]); Note the difference between this sentence and the previous one that was not improved } else { //mismatch t = next[t];} }}
The only difference between the improved algorithm and the original algorithm is that each time the true and true suffixes of the length T are found to match each other in P[0, J, further checks for P[J] are required to be equal to p[t]
Only P[j]≠p[t], you can give T to next[j], otherwise, you need to instead next[t]
The improved next table construction algorithm also requires only O (m) time
A table of Bjfuoj1553 Yi Biao
#include <iostream> #include <cstdio> #include <cstring>using namespace Std;const int maxn = + 10;char T[MAXN * Maxn];char P[MAXN * maxn];int NEXT[MAXN * maxn];int N, m;int T_len, p_len;void build_next (char* P); bool Match (CH ar* P, char *t); int main () {//Freopen ("In.txt", "R", stdin); while (scanf ("%d%d", &n, &m)! = EOF) {T_len = n*m; for (int i=0; i<t_len; ++i) {cin>>t[i]; } strlwr (T); scanf ("%s", P); P_len = strlen (P); STRLWR (P); if (Match (P, T)) {printf ("yes\n"); } else {printf ("no\n"); }} return 0;} void Build_next (char* P) {int j = 0; int t = next[0] =-1; while (J < p_len-1) {if (T < 0 | | P[J] = = P[t]) {++j; ++t; NEXT[J] = (p[j]! = p[t]? T:next[t]); } else {t = next[t]; }}}bool Match (char* p, char *t) {build_next (P); int i = 0, j = 0; while (J < P_len && I < T_len) {if (J < 0 | | T[i] = = P[j]) {++i; ++j; } else {j = next[j]; }} if (I-j < t_len-p_len+1) {return true; } return false;}
String _1 2016.4.27