At first, it was very difficult to understand the papers and templates written by Luo, later, I used the study notes in a Baidu Post Bar to understand the meaning of the multiplication algorithm code, so I wrote a comment on the code of the Multiplication Algorithm of "Luo, I hope this will help colleagues who are learning the suffix array.
Attached to the link of the article on the hundred degrees post: http://tieba.baidu.com/f? Kz= 754580296
Int wa [maxn], wb [maxn], wv [maxn], ws [maxn];
Int cmp (int * r, int a, int B, int l)
{Return r [a] = r [B] & r [a + l] = r [B + l];} // as mentioned in the paper, because 0 is entered at the end, if r [a] = r [B] (actually y [a] = y [B]), it indicates that the two strings to be merged are j strings, and the preceding one must not contain 0 at the end. Therefore, the starting position of the following string is at most 0 and will not be backward, therefore, the array is not out of bounds.
// The parameter n of the da function represents the number of characters in the string. Here, n contains the 0 value added at the end of the string, however, no 0 at the end of the string is shown in the figure.
// The m parameter of the da function represents the value range of the characters in the string, and is a base sort parameter. If the original sequence is a letter, you can directly take 128. If the original sequence is an integer, then m can be 1 greater than the largest integer.
Void da (int * r, int * sa, int n, int m)
{
Int I, j, p, * x = wa, * y = wb, * t;
// The following four lines of code sort the base of each character (that is, a string with a length of 1). If you do not understand why this sorting can be achieved, I can use a pen or paper to simulate it myself.
For (I = 0; I <m; I ++) ws [I] = 0;
For (I = 0; I <n; I ++) ws [x [I] = r [I] ++; // In x [], it is intended to save the rank value of each suffix, but rank value is not stored here, because it only involves the comparison of x [] in the future, therefore, this step can reflect the relative size without storing the real rank value.
For (I = 1; I <m; I ++) ws [I] + = ws [I-1];
For (I = n-1; I> = 0; I --) sa [-- ws [x [I] = I; // The reason why I starts a loop from n-1, is to ensure that when there are equal strings in the string, the default front string is smaller.
// In the following loop, p represents the number of strings that are not used by rank. If p reaches n, the relationship between the sizes of each string is clear.
// J indicates the length of the string to be merged. Each time two strings with the length of j are merged into a string with the length of 2 * j, of course, if the end of a string is included, the value should be a different argument, but the idea is the same.
// M also indicates the value range of elements sorted by the Base.
For (j = 1, p = 1; p <n; j * = 2, m = p)
{
// The following two lines of code sort the second keyword
For (p = 0, I = n-j; I <n; I ++) y [p ++] = I; // combining the illustrations of the paper, we can see that the second keyword of the elements from n-j to n is 0. Therefore, if the elements are sorted by the second keyword, they must all be at the top.
For (I = 0; I <n; I ++) if (sa [I]> = j) y [p ++] = sa [I]-j; // combined with the illustration of the paper, we can see that the second keyword of the next row is not 0, which is obtained based on the sorting result of the previous row, in the previous line, only sa [I]> = j is the sa [I] character string (here, and? Strings "are not ranked by Lexicographic Order, but by the position of the first character in the string) rank is used as the second keyword of the sa [I]-j string in the next line, obviously, the rank [sa [I] in the order of sa [I] is incremental, so the second keyword of the remaining elements is sorted.
// After the second keyword base is sorted, y [] stores the string subscript sorted by the second keyword.
For (I = 0; I <n; I ++) wv [I] = x [y [I]; // here is equivalent to extracting the first keyword of each string (as mentioned earlier, x [] stores the rank value, that is, the first keyword of the string ), put it in wv [] to facilitate later use
// The following four lines of code are sorted by the first keyword.
For (I = 0; I <m; I ++) ws [I] = 0;
For (I = 0; I <n; I ++) ws [wv [I] ++;
For (I = 1; I <m; I ++) ws [I] + = ws [I-1];
For (I = n-1; I> = 0; I --) sa [-- ws [wv [I] = y [I]; // The reason why I starts from n-1, the meaning is the same as above. Note that this is y [I], because y [I] contains the subscript of the string.
// The following two rows calculate the merged rank value, and the merged rank value should exist in x, however, we must use the rank value of the previous layer when calculating, that is, what is put in x []. If I want to take it from x, what should I do if I want to put it in x? Of course, the x [] is first put into another array, saving trouble. Here we use the exchange pointer method to efficiently replicate the things of x [] to y.
For (t = x, x = y, y = t, p = 1, x [sa [0] = 0, I = 1; I <n; I ++)
X [sa [I] = cmp (y, sa [I-1], sa [I], j )? P-1: p ++; // here is the rank value of each string calculated using the x [] storage. Remember what we said before, when calculating the sa [] value, if the same string is smaller than the first one by default, but the same string must be considered as having the same rank when calculating the rank, otherwise p = n will not be recycled.
}
Return;
}
// The Key to linear calculation of height [] is the nature of h [] (height [rank, that is h [I]> = h [I-1]-1, the next specific analysis of the source of this inequality.
// The proof in the paper was first visible to me in the fog, and then I finally figured it out. Let's put what we want to prove here: for the I-th suffix, set j = sa [rank [I]-1]. That is to say, j is the rank-Based String of I, by definition, the longest common prefix of I and j is height [rank [I]. Now we want to know the minimum value of height [rank [I, what we need to prove is at least height [rank [I-1]-1.
// Okay. Now let's get started.
// First we may wish to set the I-1 string (here and later referred to as "? String "is not ranked by Lexicographic Order, but by the position of the first character in the string) the character string before the lexicographic ranking is the k string, note that k is not necessarily a I-2, because the k-th string is the one before the I-1 in alphabetical order, not the I-1-th string that is positioned before the I-2 in the original string.
// At this time, according to the definition of height [], the public prefix of the k string and the I-1 string is naturally height [rank [I-1], now we will first discuss the relationship between the k + 1 string and the I string.
// In the first case, the first character of the k string is different from the first character of the I-1 string, then the ranking of the k + 1 string may be prior to the I, it may be behind I, but it doesn't matter, because height [rank [I-1] is 0, then no matter how much height [rank [I] will have height [rank [I]> = height [rank [I-1]-1, that is h [I]> = h [I-1]-1.
// In the second case, the first character of the k string is the same as the first character of the I-1 string, because the k + 1 string is obtained by removing the first character from the k string, the I-th string is also obtained by removing the first character from the I-1 string, so it is clear that the k + 1 string must be placed before the I-th string, or there is a conflict. At the same time, the longest common prefix for the k and I-1 strings is height [rank [I-1], the longest common prefix of the natural k + 1 and I strings is height [rank [I-1]-1.
// Till now, the proof of the second case has not been completed. Let's think about the strings that rank higher than the Lexicographic Order of the I string, who has the highest similarity with the string I (the similarity here refers to the length of the longest public prefix )? Obviously it is the string that ranks next to the string I, that is, sa [rank [I]-1]. That is, the longest common prefix of sa [rank [I] and sa [rank [I]-1] is at least height [rank [I-1]-1, then there is height [rank [I]> = height [rank [I-1]-1, that is, h [I]> = h [I-1]-1.
// After this is proved, the following code is easier to understand.
Int rank [maxn], height [maxn];
Void calheight (int * r, int * sa, int n)
{
Int I, j, k = 0;
For (I = 1; I <= n; I ++) rank [sa [I] = I; // calculate the lexicographic ranking of each string
For (I = 0; I <n; height [rank [I ++] = k) // calculates the value of height [rank [I, that is, k is assigned to height [rank [I]. I loops from 0 to n-1, but in fact the order of height [] calculation is calculated from height [rank [0] to height [rank [n-1].
For (k? K --: 0, j = sa [rank [I]-1]; r [I + k] = r [j + k]; k ++ ); // The last calculation result is k. First, if k is 0, then k does not need to be moved, starting from the first character, we can see how many strings are the same before string I and string j. If k is not 0, according to the proof above, the length of the longest common prefix is at least a K-1, so check from the beginning after the K-1 characters.
Return;
}
// The last note is about calling da and calheight. In fact, in the source program written by "Luo", the call is as follows, in this way, we can clearly see that int n in da and calheight is not a concept, and the valid range of the value of the height array is height [1] ~ Height [n] Where height [1] = 0, the reason is that sa [0] is actually the complement 0, therefore, the longest public prefix of sa [1] and sa [0] is naturally 0.
Da (r, sa, n + 1,128 );
Calheight (r, sa, n );