Suffix Array Learning notes "detailed |"

Last Update:2014-12-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

suffix Array Learning notes "detailed"

God, a suffix array do not know how many days to see, finally still understand Ah!

The most important thing is to subscript for a moment, and a numerical representation of the dead.

I don't know how many times I've had my hands come to understand. In fact, I also suggest beginners hand run a few times, but must pay attention to the meaning of the array, otherwise it is useless.

array Meaning:

S[]: Input string, pre-processing will be added at the end of a 0

Sa[]: Its subscript is the suffix rank

X[] = t[]: Used to save the first keyword ranking, note! Its value is the rank. The initial time is exactly the ASCII code of the string. Dictionary order!

Y[] = t2[]: Its subscript is the second keyword ranking, the second keyword is directly from the sa[], the relationship is very close

C[]: Used for Cardinal sort. The initial value is exactly the number of occurrences of each character. Later, its function is closely related to the cardinal sort, and it is recommended to learn the base sort

One thing must be noticed! The second keyword is from the sa[] array, but the first keyword is not from the sa[] array! This point does not know how many people confuse, is because the diagram given in the paper is a schematic diagram, not the code to achieve the figure, not the karma!

P.s. In order to optimize the time space, avoid a new intermediate array to copy the value of t[], with its pointer x and t2[] The pointer y Exchange method. Note that this time t2[] is useless.

Let me give you a code that understands the suffix array and adds the intermediate output:

#include <cstdio> #include <cstring> #include <algorithm>using namespace Std;const int N = $, M = 130;c Har s[n];int sa[n], t[n], t2[n], c[m], N;int rank[n], high[n]; #define DBG#IFDEF dbgint db[n];void debug (int *f) {for (in t i = 0; I < n;    i++) {Db[f[i]] = i;    } printf ("%3d", db[0]);    for (int i = 1; i < n; i++) {printf ("%3d", Db[i]); }puts ("]");} #endifbool cmp (int *y, int i, int k) {return y[sa[i-1]] = = Y[sa[i]] && y[sa[i-1]+k] = = Y[sa[i]+k];}    void build (int m) {int I, *x = t, *y = t2;    for (i = 0; i < m; i++) c[i] = 0;    for (i = 0; i < n; i++) c[x[i] = s[i]]++;    for (i = 1; i < m; i++) c[i] + = c[i-1];    for (i = n-1; I >= 0; i--) sa[--c[x[i]] = i; #ifdef DBG printf ("sa get:[");    Debug (SA); Puts (""); #endif for (int k = 1, p; K <= N;        K<<=1, m=p) {p = 0;        Y[] Subscript is the corresponding second keyword ranking, which is directly obtained by sa[]//other y[] The content is the first key word location for (i = n-k; i < n; i++) y[p++] = i; for (i = 0; I < n;        i++) if (Sa[i] >= k) y[p++] = sa[i]-K; #ifdef DBG printf ("Gain y:[");        Debug (y);        printf ("Look x:{");        printf ("%3d", x[0]);        for (i = 1; i < n; i++) {printf ("%3d", X[i]); }puts ("}"), the content of the #endif//x[] is the corresponding first keyword ranking//According to the content of x[] and y[] subscript to merge, get the new rank as sa[] subscript for (i = 0; i < m; i++        ) C[i] = 0;        for (i = 0; i < n; i++) c[x[y[i]]]++;        for (i = 1; i < m; i++) c[i] + = c[i-1];        for (i = n-1; I >= 0; i--) sa[--c[x[y[i]]] = y[i]; #ifdef DBG printf ("sa get:[");        Debug (SA);        Puts (""); #endif//extract old x[in order of sa[], calculate new x[] Swap (x, y); p = 1;        X[sa[0]] = 0;//sa[0] must be the added character 0, ranked No. 0 for (i = 1; i < n; i++) {x[sa[i]] = cmp (y, I, K)? P-1: p++;    }//Pruning, at this time x[] is no longer the same value, sa[] is determined if (P >= N) break;    }}void Get_high () {int k = 0;    for (int i = 0; i < n; i++) rank[sa[i]] = i;        for (int i = 0; i < n; i++) {if (k) k--;        Int j = Sa[rank[i]-1];        while (s[i+k] = = S[j+k]) k++;    High[rank[i]] = k;    }}void PR () {printf ("the Rank is:\n");    printf ("%d", rank[0]);    for (int i = 1; i < n-1; i++) printf ("%d", rank[i]); Puts ("");} int main () {scanf ("%s", s), n = strlen (s) + 1;int maxi = 0;for (int i = 0; i < n; i++) {maxi = maxi > s[i]? maxi:s[i ];}    S[n-1] = 0;    Build (maxi+1); Get_high (); #ifdef DBG PR (); #endif return 0;}

Based on this code, enter some data to test and study the intermediate output carefully.

Recommended Data:

Abaab

Aabaaaab

Banana

Next is the hand run process:

The box indicates that the value inside is subscript, and the curly braces represent the values. They are all corresponding to the first line of the red number one by one.

For the time being, we don't care how the first keyword is calculated.

According to the above procedure, you fill in the figure of the value. One by one to fill in can understand. (x[] The value of the array is directly on the graph, and note that each x[] array is computed at the previous level of the cardinality sort )

Sa[] The initial value is exactly based on the number of occurrences of a character, and it is easy to run out of hand. This completes the cardinality of the one-digit sort.

The blue Word is the second keyword, which happens to be extracted from the sa[. The yellow arrows indicate that there are no second keywords, they are ranked from left to right starting from 0, to fill out this and then extract the other second keyword. again, although Wired, the first keyword is not the number in the sa[] array!

Then give the x[] and the freshly filled y[] Merge (green font), calculate sa[]. This is a two-digit cardinality sort.

Next, continue multiplying to complete the four-digit cardinality sort. (If you are confused why still only two numbers are indicated by the line, suggest to read the paper)

Finally, in fact, it is not necessary to base the eight-digit order, because this time the new x[] Array (the second line in the last row) there is no duplicate ranking, and the first keyword is the primary, so the sa[] array is determined. Here you can add a pruning, break a bit.

how to get x[] array:

After each get sa[] array, calculate the new x[] by extracting the old x[from the rank order in sa[] (that is, SA[1...N]) to calculate it. If a string is exactly the same as the previous one (that is, the CMP () function), the ranking is the same (p-1).

According to the above words, then you fill x[] array it!

Suffix Array Learning notes "detailed |"

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Suffix Array Learning notes "detailed |"

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Suffix Array Learning notes "detailed |"

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support