I believe we all have the experience of finding text content under Linux, such as when we use VIM to find a word or a paragraph in a text file, Linux responds quickly and gives results, especially handy and quick!
So, we have the wood to think about how Linux is correctly matched to the strings we need in voluminous text? This involves a pattern matching algorithm!
1. Pattern matching
What is pattern matching?
- Pattern matching, which is the locating operation of the substring P (pattern string) in the main string T (Target string), also known as String matching
Suppose we have two strings:T (target, destination string) and P (pattern, pattern string), andLocate the pattern string T in the target string T, called pattern matching .
There are two kinds of results for pattern matching:
- A substring of T is found in the target string, and the value of p at the starting position in T is returned;
- Unsuccessful match, return-1
Usually pattern matching algorithms have many, such as BF, KMP, BM, RK, Sunday and so on, they are different, we here focus on the BF and KMP algorithm (because more commonly used)
2. BF algorithm
The BF, or Brute-force algorithm, is also known as 朴素匹配算法
or 蛮力算法
, less efficient!
1). Algorithmic thinking
Basic idea:
-
- Compares the first character of the target string T with the first character of the pattern string p;
-
- If equal, the second character of T and P is compared
-
- If unequal, the next character of T is compared with the first character of P
-
- Repeat the steps above until the match succeeds or the target string T ends
The flowchart is as follows:
For example:
Set T=‘ababcabcacbab‘
, P=‘abcac‘
and matching process
- Step 1: the main string t and substring p do order comparison, when compared to position 2 o'clock, the main string t[2]= ' a ' and substring p[2]= ' C ' unequal (blue shaded), record the respective end position, and enter Step 2
- Step 2: The main string T moves back one bit, and the main string T and the substring p are compared from the beginning, compared with step 1
- Step 3: Each comparison, the substring starts at 0, the starting position of the main string has a certain relationship with the last position, and at some point it needs to "backtrack" (the position where the last comparison ended is to move forward), such as Step 1 where the end position of 2,step 2 starts at 1 The end position of the STP3 is 6,step 4 and the starting position is 3.
- Step 4: The index value I of the main string T and the index value J of the substring P are:i=i-j+1
2). Code implementation
/*----------------------------------------------------------------------------- * Function: BF - Does the P can be match in T * Input: Pattern string P, Target string T * Output: If matched: the index of first matched character * else: -1-----------------------------------------------------------------------------*/int BF(const string &T, const string &P){ int j=0, i=0, ret=0; while((j < P.length()) && (i<T.length())) { if(P[j] == T[i]) //字符串相等则继续 { i++; j++; //目标串和子串进行下一个字符的匹配 } else { i = i - j + 1; j = 0; //如果匹配不成功,则从目标字符串的下一个位置开始从新匹配 } } if(i < T.length()) //若匹配成功,返回匹配的第一个字符的下标值 ret = i - P.length() ; else ret = -1; return ret;}
3). Efficiency analysis
Efficiency analysis is mainly about time complexity and space complexity. In this case, the space complexity is low, temporarily do not consider, we look at the complexity of time.
Analysis of time complexity is usually the worst case scenario, and for the BF algorithm, the worst case scenario is as follows:
t= "Ggggggggk", p= "Ggk "
From the known, the first match, the first i-1 match, each time need to compare M times (m is the length of the pattern string p), therefore (i-1)m times, and the first I-match success also requires M-times comparison, so the total need to compare Mi.
For the t,i=n-m+1 of the main string of length n, the probability of the success of each match is pi and the probability is equal; in the worst case, the probability of a successful match Cmax can be expressed as:
Generally n>>m, therefore, the time complexity of BF is O (m*n)
3. KMP algorithm
BF algorithm every time need to backtrack, resulting in a large time complexity, then there is a more efficient pattern matching algorithm?
The answer is yes, that's the KMP algorithm.
1). Noun explanation
Before the algorithm is explained, the following nouns must be clarified, otherwise the algorithm cannot be understood
- Target string T: A large number of strings waiting to be matched
- pattern String P: That's the string we need to find
- string prefix : any header of the string (excluding the last character), such as "ABCD" prefixed with "a", "AB", "abc", but not "ABCD"
- string suffix : Any trailing end of the string (excluding the first character), such as "ABCD" with the suffix "D", "CD", "BCD", but not "ABCD"
- string prefix equal number of digits K: That is, the longest matching number of prefixes and suffixes,
2). Algorithmic thinking
The core idea of the KMP algorithm is to partially match the position of the main string to a position that has already been compared (no longer backtracking), but to continue to move forward based on the previous comparison result.
The concept is rather abstract, so we explain it in an example:
Step 2 is always matched until t[6] is mismatched at all times.
Step 3: The position of T is not backtracking, or remains at T[6] (theKMP algorithm stipulates: The target string T does not backtrack, the last position is the next start position );
The index value of p starts at 1 instead of 0 for the following reasons:
In step 2, t[5]= ' A ' has been compared, we know, and is equal to p[3], because p[0]==p[3], so there is no need to compare p[0] and t[5], because step 2 theoretically has been compared (in fact, see the substring P Step2 end position P[4] before p[ 0-3] The string prefix equals the number of digits K, so that p[k] is aligned with the end position of the last main string t[6]
From the above analysis, the key point in the process of KMP algorithm is to seek: the prefix equal number k before the end position of the substring P.
is the prefix-to-suffix relationship analysis of the pattern string p= "ABCABCA" (including the prefix string equal number k)
We can give the value of the next start position when the T-string is done at the end position of each character;
- J is the end position of this match for T (mismatch position);
- Next[j] is the start position of the next matching pattern string p
PS: next[j] is the prefix string equal number of digits K
According to the above discussion, we can get the arithmetic formula of Next[j] :
Where, -1
is a token that identifies the next starting position of the target string, the pattern string p is
If you do not understand the above, it does not matter, just remember next[j] function can be, everything else is based on it!
3). Code implementation
/*-----------------------------------------------------------------------------* function:kmp-does the P can be Match in T * Input:pattern string P, array next * output:if matched:the index of first matched character * else:-1-----------------------------------------------------------------------------*/void getNext (const string &p, int next[]) {int j=0; The subscript value of the pattern string P/index value int k=-1; The prefix of the pattern string p and the number of digits equal to the suffix string next[0]=-1; Set initial value while (J < P.length ()) {if (k = =-1) | | (P[j] = = P[k])) Compares the main string and the substring {j + +, from the starting position of the pattern string p or sequentially; k++; NEXT[J] = k; } else//set re-compare Position: J-string unchanged, K-string starting from next[k] position k = next[k]; }}/*-----------------------------------------------------------------------------* Function:kmp-does the P can be Match in T * Input:pattern string P, Target string T * output:if matched:the index of first matched character * else:-1-----------------------------------------------------------------------------*/int KMP (const string &t, const string &p) {int next[maxsize]={0}; int i=0; Subscript value of target string t/index value int j=0; The subscript value of the pattern string P/index value int ret=0; GetNext (P, next); Gets the next array of mode string P int plen = P.length (); int TLen = T.length (); while ((I < t.length ()) && (J < Plen))//strange, here I use the j<p.length () will not work, pending {if (j==-1) | | (P[j] = = T[i])) J=-1 represents the first comparison of {i++; j + +; } else {j = next[j]; }} if (J >= p.length ()) ret = I-p.length (); else ret =-1; return ret;}
4). Efficiency analysis
Because the KMP algorithm does not backtrack, the comparison is sequential, so the worst-case KMP time complexity is O (m+n).
where M is the string length of the pattern string p, and n is the string length of the target string T.
Common algorithm 3-string lookup/pattern matching algorithm (BF & KMP algorithm)