Suffix array Summary

Source: Internet
Author: User

Suffix Arrays are called string processing artifacts. However, many people only use templates. In fact, this is not the essence of algorithms. To learn the essence of algorithms, we should understand the implementation principles and implement them, especially for algorithms, it is more important to consider an idea. A year ago, I only used other people's templates. Recently, I calmed down and studied the suffix array and wrote my own template.

I basically followed the slides I taught. Of course, Baidu is also indispensable. Let's talk about the basic concepts first. (A large number of connected teaching ppt files are referenced here)

Basic definition:

Sub-string Note: string! = String

Substring R [I .. of string s .. j], I ≤ j, indicates that the R string from I to J is arranged in sequence R [I], R [I + 1],..., a substring formed by R [J.

Suffix

A suffix is a special substring from a position I to the end of the entire string. The suffix table starting with the I character of the string R is suffix (I), that is, suffix (I) = R [I .. len (r)].

Suffix array (SA [I] stores the first character subscript of the top I substring)

The suffix array SA is a one-dimensional array that stores a certain arrangement of 1. n sa [1], sa [2],..., Sa [N], and suffix (SA [I]) <suffix (SA [I + 1]), 1 ≤ I <n. That is to say, after sorting the N suffixes of S from small to large, the starting positions of the suffixes sorted in order are sequentially placed into SA.

Ranking array (rank [I] stores the priority of suffix (I)

Rank [I] stores the "rank" of suffix (I) in ascending order of all suffixes ".

Note: This is the sorting keyword ~ (This sentence is the focus of our sorting)

Algorithm objectives:

Returns the SA array and rank array of the string.

Yi Zhi SA and rank are inverse operations, that is, sa [rank [I] = I;

Rank [SA [I] = I; (so as long as we obtain one, we can calculate the other by O (n)

Note: This conclusion is only true when sorting is completed.

The SA and rank definitions are always applicable.

The reason is that there will not be two substrings with the same rank at the end.

Basic algorithm flow

• Set the current sorting length to H. Suffix (I, h) indicates the first H characters of suffix (I) (will be truncated if the length is greater than) • First press H = 1 to pair suffix (I, H) (0 <I <S. length) Sort • multiply the length of H, use the rank array obtained after the length of H/2 in the previous sorting as the keyword, and use the later part of H/2 as the second keyword, use the first part of H/2 as the first keyword to sort the substrings with H length. • because it is a doubling length, it is so complicated to sort logs for n times at most to achieve nlogn. Obviously, sorting requires O (N) and O (n) sorting is generally counted. Count sorting: The http://baike.baidu.com/view/1209480.htm won't look at this by itself, is the main part of the code. There is another important reason for counting and sorting. It is a stable sorting, which ensures the second keyword under the array. As we mentioned above, for the doubling length H, use the rank array obtained after sorting the length of H/2 as the keyword and use the later part of H/2 as the second keyword, so we need to first sort the order of H/2 and then get the new array sequence. The subscript is the second keyword, And the array contains the first H/2.
Rank value. This is the first keyword, so direct sorting is equivalent to sorting the first H/2. If it is equal here, it will be sorted by subscript, both the second keyword.
Follow the code to manually simulate the construction process of a string Abab. For details about the SA array, see the code implementation:

• Sort by H = 1
// CNT is the secondary array for counting and sorting, k is the first keyword, ID is the subscript array of the second keyword, r is the new structure array with the subscript as the second keyword, and W stores the string information. Sa stores the int * k = rk, * id = height, * r = res, * CNT = wa; // count the sorted rep (I, up) CNT [I] = 0; rep (I, Len) CNT [K [I] = W [I] ++; rep (I, up) CNT [I + 1] + = CNT [I]; for (INT I = len-1; I> = 0; I --) {SA [-- CNT [K [I] = I ;}


Evaluate the second keyword (think about why adding 0 at the end of constructing the W array)

// CNT is the secondary array for counting and sorting, k is the first keyword, ID is the subscript array of the second keyword, r is the new structure array with the subscript as the second keyword, W stores the string information, sa stores the information for (INT I = len-D; I <Len; I ++) ID [p ++] = I; rep (I, len) if (SA [I]> = d) ID [p ++] = sa [I]-D; // ID stores the sequence sorted by H/2, that is, the post-I h/2 is the rep (I, Len) R [I] = K [ID [I] in the original array; // construct a new sort Array

Sort new Arrays

// CNT is the secondary array for counting and sorting, k is the first keyword, ID is the subscript array of the second keyword, r is the new structure array with the subscript as the second keyword, W stores the string information, sa stores the rep (I, UP) CNT [I] = 0; rep (I, Len) CNT [R [I] ++; rep (I, UP) CNT [I + 1] + = CNT [I]; for (INT I = len-1; I> = 0; I --) {SA [-- CNT [R [I] = ID [I];}

Get a new keyword (that is, the discrete sequence after sorting by H length)

// CNT is the secondary array for counting and sorting, k is the first keyword, ID is the subscript array of the second keyword, r is the new structure array with the subscript as the second keyword, W stores the string information, sa stores swap (K, R); P = 0; k [SA [0] = P ++; rep (I, len-1) {If (SA [I] + D <Len & SA [I + 1] + D <Len & R [SA [I] = R [SA [I + 1] & R [SA [I] + D] = R [SA [I + 1] + D]) K [SA [I + 1] = p-1; else K [SA [I + 1] = P ++ ;}

Repeat the preceding steps to obtain the SA array SA and rank. What is the purpose? Returns the height array !!
Height [I] indicates the longest prefix of SA [I] and SA [I-1]. The structure of height can be understood by the Code.
The full template code is provided below

#define rep(i,n) for(int i = 0;i < n; i++)using namespace std;const int size  = 200005,INF = 1<<30;int rk[size],sa[size],height[size],w[size],wa[size],res[size];void getSa (int len,int up) {int *k = rk,*id = height,*r = res, *cnt = wa;rep(i,up) cnt[i] = 0;rep(i,len) cnt[k[i] = w[i]]++;rep(i,up) cnt[i+1] += cnt[i];for(int i = len - 1; i >= 0; i--) {sa[--cnt[k[i]]] = i;}int d = 1,p = 0;while(p < len){for(int i = len - d; i < len; i++) id[p++] = i;rep(i,len)if(sa[i] >= d) id[p++] = sa[i] - d;rep(i,len) r[i] = k[id[i]];rep(i,up) cnt[i] = 0;rep(i,len) cnt[r[i]]++;rep(i,up) cnt[i+1] += cnt[i];for(int i = len - 1; i >= 0; i--) {sa[--cnt[r[i]]] = id[i];} swap(k,r);p = 0;k[sa[0]] = p++;rep(i,len-1) {if(sa[i]+d < len && sa[i+1]+d <len &&r[sa[i]] == r[sa[i+1]]&& r[sa[i]+d] == r[sa[i+1]+d])k[sa[i+1]] = p - 1;else k[sa[i+1]] = p++;}if(p >= len) return ;d *= 2,up = p, p = 0;}}void getHeight(int len) {rep(i,len) rk[sa[i]] = i;height[0] =  0;for(int i = 0,p = 0; i < len - 1; i++) {int j = sa[rk[i]-1];while(i+p < len&& j+p < len&& w[i+p] == w[j+p]) {p++;}height[rk[i]] = p;p = max(0,p - 1);}}int getSuffix(char s[]) {int len = strlen(s),up = 0;for(int i = 0; i < len; i++) {w[i] = s[i];up = max(up,w[i]);}w[len++] = 0;getSa(len,up+1);getHeight(len);return len;}

The last few questions and exercises are given. • poj 2774-longest public continuous substring, getting started Question • poj1743-longest non-overlapping compound substring • hint: Be careful when determining binary. This question is a little special. • Poj3294-the largest substring with more than half of occurrences • hint: the key to determining the number of occurrences of different strings in A group • poj3261-overlapped substrings with K duplicates. • Hint: there will be two questions above. This question should be very simple. You can try monotonous stack. • Poj2758-suffix array + rmq • hint: the difficulty of this question is not rmq, but the ability to write code and implement query algorithms.



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.