Get a complete understanding of the suffix array

Source: Internet
Author: User
Tags first string

What is a suffix array first of all to know what is called a suffix

such as the string abcdef so abcdef bcdef cdef def EF F is called the suffix that is, from the last letter before the beginning of a letter to the last letter of the string is called the suffix

What can a suffix array do? I'm not going to introduce you here. I think you know the suffix array.

But I've read a lot of suffix arrays before. Just twenty or thirty code but didn't find a blog to explain from beginning to end

Self-intermittent One months is finally a multiplication algorithm (that is, a name does not have to tangle what is called multiplication algorithm) has a more in-depth understanding

This is the original code

int WA[MAXN],WB[MAXN],WV[MAXN],WS[MAXN]; int cmp (int *r, int A, int b, int l) {return r[a] = = R[b] && r[a+l] = = R[b+l];}    void da (int *r, int *sa, int n, int m) {int I, J, p, *x = WA, *y = WB, *t;    for (i = 0; i < m; i++) ws[i] = 0;    for (i = 0; i < n; i++) ws[x[i] = r[i]]++;    for (i = 1; i < m; i++) ws[i] + = ws[i-1];    for (i = n-1; I >= 0; i--) sa[--ws[x[i]] = i;        for (j = 1,p = 1; p < n; j <<= 1,m = p) {for (P = 0, i = n-j; i < n; i++) y[p++]=i;        for (i = 0; i < n; i++) if (Sa[i] >= j) y[p++] = sa[i]-J;        for (i = 0; i < n; i++) wv[i] = X[y[i];        for (i = 0; i < m; i++) ws[i] = 0;        for (i = 0; i < n; i++) ws[wv[i]]++;        for (i = 1; i < m; i++) ws[i] + = ws[i-1];        for (i = n-1; I >= 0; i--) sa[--ws[wv[i]] = y[i]; for (t = x,x = Y,y = T,p = 1,x[sa[0]] = 0,i = 1;    I < n;i++) x[sa[i]]=cmp (y,sa[i-1],sa[i],j) p-1:p++; }}

要想了解上面的代码  首先你要知道什么叫基数排序(基数排序 百度百科)

假设你也已经了解了了基数排序  那么下面我们就要解析上面的代码

还有在这里你首先要知道

Suffix Array (sa[i] holds the first string of the number I Poute)

The suffix array SA is a one-dimensional array that holds 1: n a permutation of sa[1], sa[2], ..., sa[n], and guaranteed Suffix (Sa[i]) <suffix (sa[i+1]), 1≤i<n. That is, the n suffixes of S are sorted from small to large, and the beginning position of the ordered suffix is placed in the SA sequentially.

Rank Array (Rank[i] stores the priority of suffix (i))

The rank array Rank[i] holds the "rank" Suffix (i) from small to large in all suffixes.

The final summary for sa[i] = J indicates that the suffix from small to large is a suffix starting with J (subscript).

Rank[i] = J indicates that the suffix ranking is J based on the index starting from small to large

RANK表示你排第几   SA表示排第几的是谁

The following diagram is the idea of the algorithm above, but I was dizzy when I looked at it.

Let's take a step-by-step

First of all, whatever the algorithm, we're going to have a brute force hypothesis. Now we're going straight to the size of a string all suffixes what would you do

Two for loop comparison but this algorithm is definitely slow.

int Smpstr (char* str,int len) {int k=0;      for (int i=0;i<len;i++) {for (int j=i;j<len;j++) {if (strcmp (str+k,str+j) >0) {k = J;  }} Rank[k] = i; }}

Considering the particularity of the suffix array, we'll change the way we compare

First of all, consider it convenient. We'll subtract all the letters, A-1. Here I only consider the way all the letters are lowercase letters.

Adding a string is Aabaaaab

Next, merge the two neighbors into an integer

In this way, the following sequence is used to sort the merged integer using the cardinality sort. Because of its number of digits, you might ask that

The letter Z minus ' A '-1 is not greater than 10? Is that not a 3-digit number? Subtract Z ' a '-1 = 26 as a number instead of 26

will be equivalent to 16 binary as 15 is not considered as a double digit but with an F to represent of course, you're happy. 26 Writing Z is 26

Now, let me explain why this is important. Why 22 is combined into a single number

First, all suffix arrays are the last group to be

Then each suffix is a duplicate of the 1th suffix of the first two of the No. 0 suffix to a third letter

So one analogy, that means I'm divided into 221 groups.

The result of sorting the 221 groups of integers by the cardinality is

Explain the first 11 row, number two, second 12, second place.

So did you find out that the first two letters of the No. 0 suffix to the 7th suffix are already out because the first 11 is the top two letters of the No. 0 suffix the second 12 is the first two letters of the 1th suffix

Okay, now that we've compared the first two letters of all the suffixes, I'm going to start comparing the back then how do I compare the string after the first two letters because I've already compared the size of all 22 letters? I can now use the following results to compare the picture

So now on 1121 1211 2111 1111 1111 1120 1100 2000 are sorted into two groups of the first two letters a group of two letters a group such as 1121 this four digits 11 and 212 parts to the Cardinal sort

Wait, have you found out that our ranking above is related to the first keyword and the second key word, which means

The size of the ranking is the second keyword ranking why because the ranking is the second keyword sorting results

So what's the relationship with the first keyword? Did you find out that 11 of the first keyword was removed and then added a 00?

So I can do this, that is, 00 direct Replacement 11 position in the other position unchanged (think about it)

To give a vivid example, there are a lot of people in the queue, the beginning of a disorderly order now security requirements from low to high rank

After the sequence, everyone has their own position. Now security is gone. The team went back to the beginning of the state and the original standing at the beginning of the people (Disorderly order is standing at the beginning of the people) came a dwarf must be the shortest guard back to ask for the line again so the dwarf must stand at the front of the guard shouted the last row The first person in the sequence is connected. If that person is the first, then continue back if it's not the last person who ranked first, then security continues to call the last one in the last ranking.

Get a complete understanding of the suffix array

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.