In my previous article "BM algorithm details", there was a huge defect that I failed to provide an efficient algorithm to jump tables with suffixes in the computing mode. Robert S. Boyer and J
In the essay by Strother Moore, I did not give such an algorithm for some reason. the time complexity of the barbaric algorithm O (N ^ 3) greatly compromises the practicability of the BM algorithm. In fact, there is an algorithm for calculating the suffix jump table of a pattern string in linear time, But before introducing this algorithm, I would like to recommend an authoritative book on string processing, algorithms on strings, trees and sequences, by Dan gusfield. The book covers almost all of today's string processing technologies with practical value. Of course, the BM and KMP Algorithms also cover these technologies. The content of this article is derived from this book. However, the content of this book can be said to be very difficult, and it is very difficult to thoroughly understand it.
In my two articles on KMP and BM algorithms, I have mentioned a key issue, that is, self-inclusion of the Front/suffix. Both the KMP algorithm and the BM algorithm jump table are directly related to the self-contained prefix/suffix. Here we need to introduce a concept Zi (s), where S represents the mode string, for the mode string s [1... n], Zi (s) indicates the substring s [I... j], where J is the length of all... j] = s [1... in J-I + 1. It is quite mysterious. Actually, it is the longest prefix starting with "I. For S = aabcaabxaaz, we have
- Z5 (S) = 3, (AAB) C (AAB) xaaz
- Z6 (S) = 1, (a) ABCA (a) Baaz
- Z7 (S) = Z8 (S) = 0, when s [I]! = S [1], Zi (S) = 0
- Z9 (S) = 2, (aa) bcaabx (AA) Z
We know from z5 (S) = 3 above that s [5... 7] = s [1... 3], and s [5... 8]! = S [1... 4], here we put s [5... 7] a z-block of string S. For Zi (s), if Zi (s )! = 0, then the marked Z-block starts with I and ends with I + Zi (S)-1. Obviously, a string may contain several Z-blocks, and the Z-blocks may overlap each other. We then define two values, Li and RI. Li and RI are the largest right endpoint in all Z-blocks containing S [I], as shown in, here, there are two z-blocks that contain I. Only the L value and R value of Z-block marked with a are the actual values of Li and RI. In fact, s [li... Ri] = s [1... Ri-Li + 1].
Now let's introduce it to you, In Z1 (s ),......, If Zi (s), Li, and RI are known, how can we solve Zi + 1 (s)? Here we set li = L, rI = r, I + 1 = K, i-Li + 2 = K '.
1. if the Z-block relationship between K, ZK '(s) and L, R is shown in, because s [l... r] = s [1... r-l + 1], so we can put s [l... r] The problem in the interval is put in S [1... in the R-l + 1] range, K' is the corresponding vertex of K in the range of 1 and R-l + 1 '. We need to pay attention to the known amount ZK '(s). In this case, ZK' (s) determines that Z-block is completely included in 1, within the R-l + 1 range. That is, k' + ZK '(S)-1 <R-l + 1. ZK (s) is actually ZK' (s ).
2. If the relationship between K, ZK '(s) and Z-block determined by L, R is shown in. At this point, we also put the problems in the S [L... R] range into the s [1... R-l + 1] range for analysis. At this time, ZK '(s) determines that the right end of the Z-block must exceed R-l + 1, that is, for ZK (s ), we already know the former r-k + 1 element and S [1... r-k + 1] is the same, however, whether the elements after s [R] can be connected with the preceding r-k + 1 element to form a longer include prefix can only be known after comparison. Because we already have s [k... r] = s [K '... r-l + 1] = s [1... r-k + 1] (Note the several regions marked with beta in the figure), so we can skip the comparison of these two intervals, start directly from S [r-k + 2] and compare with s [r-k + 2] until the matching fails. Then we get the new right endpoint RI + 1, at the same time, update Li + 1 to I + 1.
3.
If R is <= K. The previously calculated Z-block does not help us. We can find the smallest k starting from r so that s [R... K]! = S [1... r-k + 1]. At this time we also need to update the corresponding Li + 1 = I + 1, RI + 1 = K-1.
After processing the preceding three cases, we can recursively fill in all the Zi (s) values of S [1... n] in the linear time. Assume that the mode string S = "aabaabcaxaabaabcy", the corresponding Zi (s) value is as follows.
|
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
S |
A |
A |
B |
A |
A |
B |
C |
A |
X |
A |
A |
B |
A |
A |
B |
C |
Y |
Zi (s) |
0 |
1 |
0 |
3 |
1 |
0 |
0 |
1 |
0 |
6 |
1 |
0 |
3 |
1 |
0 |
0 |
0 |
When z12 (s) is to be calculated, all Z1 (s) to z11 (s) have been calculated. At this time, L = 10, r = 15, that is, s [10... 15] The resulting Z-block is the rightmost current Z-block and contains s [12]. Now we need to calculate z12 (s), because s [10... 15] = s [1... 6], so z12 (s) is closely related to Z3 (s). We found that Z3 (S) = 3 + Z3 (S) = 3 <6, this is in line with the first case, so z12 (S) = Z3 (S) = 0.
For Z10 (s), when Z10 (s) is calculated, it is known that the rightmost Z-block is s [8], L = 8, r = 8, because 10> 8, so in line with the third case above, we will look for the containing prefix of S from S [10] And find s [10... 15] is a prefix of 6 s, so Z10 (S) = 6, update L = 10, r = 15 at the same time.
In the Zi (s) value calculation, the scenario in the second case is rare, but the second case is also the most vulnerable part in the Zi (s) calculation.
The following is a self-written algorithm used to calculate the Z array.
void ZBlock(const char* pattern, unsigned int length, unsigned int zvalues[]){unsigned int i, j, k;unsigned int l, r;l = r = 0;zvalues[0] = 0;for(i = 1; i < length; ++i){if(i >= r){j = 0;k = i;zvalues[i] = 0;while(k < length && pattern[j] == pattern[k]){++j;++k;}if(k != i){l = i;r = k - 1;zvalues[i] = k - i;}}else{if(zvalues[i - l] >= r - i + 1){j = r - i + 1;k = r + 1;while(k < length && pattern[j] == pattern[k]){++j;++k;}l = i;r = k - 1;zvalues[i] = k - i;}else{zvalues[i] = zvalues[i - l];}}}}
Because the normal string starts from index 0, this is adjusted in the algorithm.
In theory, the Z-block algorithm completely solves the problem of prefix self-contained computation, the Z-block algorithm is superior to the KMP algorithm in describing the next table construction process. With the Z-value array of the mode string, the next hop table of the corresponding KMP algorithm will become efficient and intuitive in the calculation of the good suffix table of the BM algorithm.