Get a complete understanding of the suffix array

Source: Internet
Author: User
Tags ming

What is suffix array first of all to know what is called a suffix?

For example, the string abcdef so abcdef bcdef cdef def EF F is called a suffix that starts with a letter from the last letter until the last letter (so the BCD is not a suffix because no To the last F) the string is called a suffix.

What can a suffix array do? I'm not going to introduce you here. This is not the focus of this article! This article mainly explains how the suffix array should write code!

Reasons for writing this article

But I've read a lot of suffix arrays before. Just twenty or thirty code but didn't find a blog to explain from beginning to end

Probably because I didn't find it.

Self-intermittent One months is finally a multiplication algorithm (that is, a name does not have to tangle what is called multiplication algorithm) have a more in-depth understanding

This is the original code

intWA[MAXN],WB[MAXN],WV[MAXN],WS[MAXN];intcmpint*r,intAintBintl) {    returnR[a] = = R[b] && r[a+l] = = r[b+l];}voidDa (int*r,int*sa,intNintm) {    intI, J, p, *x = WA, *y = WB, *T;  for(i =0; I < m; i++) Ws[i]=0;  for(i =0; I < n; i++) Ws[x[i]= r[i]]++;  for(i =1; I < m; i++) Ws[i]+ = ws[i-1];  for(i = n1; I >=0; i--) sa[--ws[x[i]] =i;  for(j =1, p =1; P < n; J <<=1, M =p) { for(P =0, i = n-j; I < n; i++) Y[p++]=i;  for(i =0; I < n; i++)            if(Sa[i] >=j) Y[p+ +] = Sa[i]-J;  for(i =0; I < n; i++) Wv[i]=X[y[i]];  for(i =0; I < m; i++) Ws[i]=0;  for(i =0; I < n; i++) Ws[wv[i]]++;  for(i =1; I < m; i++) Ws[i]+ = ws[i-1];  for(i = n1; I >=0; i--) sa[--ws[wv[i]] =Y[i];  for(t = x,x = Y,y = T,p =1, x[sa[0]] =0, i =1; I < n;i++) X[sa[i]]=CMP (y,sa[i-1],sa[i],j)? p1:p + +; }}

要想了解上面的代码  首先你要知道什么叫基数排序(基数排序 百度百科)

假设你也已经了解了了基数排序  那么下面我们就要解析上面的代码

还有在这里你首先要知道What are the two concepts

Suffix Array (sa[i] holds rank i large suffix first word Poute) the following quotes inside the contents can not look

The suffix array SA is a one-dimensional array that holds 1: n a permutation of sa[1], sa[2], ..., sa[n], and guaranteed Suffix (Sa[i]) <suffix (sa[i+1]), 1≤i<n. That is, the n suffixes of S are sorted from small to large, and the beginning position of the ordered suffix is placed in the SA sequentially.

Rank Array (Rank[i] The priority of each suffix)

The rank array rank[i] is the "rank" in which the suffix starting with the following Mark i is arranged from small to large in all suffixes.

The final summary for sa[i] = J indicates that the suffix from small to large is a suffix starting with J (subscript).

Rank[i] = J indicates that the suffix ranking is J based on the index starting from small to large

RANK表示你排第几   SA表示排第几的是谁(just remember this.)

The following diagram is the idea of the algorithm above, but I was dizzy when I looked at it.

Let's take a step-by-step

First of all, whatever the algorithm, we're going to take a violent assumption. Now we're going straight to the size of all suffixes of a string (the so-called suffix size is the size of the comparison string this must know) what do you do?

Thought of it!

Two for loop comparison but this algorithm is definitely slow.

int Smpstr (char* str,int len) {int k=0;      for (int i=0;i<len;i++) {for (int j=i;j<len;j++) {if (strcmp (str+k,str+j) >0) {k = J;  }} Rank[k] = i; }}

  

Considering the particularity of the suffix array, we'll change the way we compare

Why call it special because there is a relationship between all the suffixes of a string

For example, with string abcdef.

Suffix bcdef and cdef have a stronger relationship after one is the previous component is exactly the latter part

So, how do you use their relationship, please keep looking.

First of all, consider it convenient. We'll subtract all the letters, A-1. Here I only consider the way all the letters are lowercase letters.

Adding a string is Aabaaaab

Next, combine the two adjacent numbers into an integer

In this way, the following sequence is used to sort the merged integer using the cardinality sort. Because of its number of digits, you might ask that

The letter ' z ' minus ' a '-1 is not greater than 10? Is that not a 3-digit number? Subtract Z ' a '-1 = 26 as a number instead of 26

will be equivalent to 16 binary as 15 is not considered as a double digit but with F to represent of course you're happy to be able to write 26 Z after Z is 26

Now, let me explain why this is important. Why 22 is combined into a single number

First, all suffix arrays are the last group to be

Then each suffix is a duplicate of the 1th suffix of the first two of the No. 0 suffix to a third letter

So one analogy, that means I'm divided into 221 groups.

The result of sorting the 221 groups of integers by the cardinality is

Explain the first 11 row, number two, second 12, second place.

So did you find out that the first two letters of the No. 0 suffix to the 7th suffix are already out because the first 11 is the top two letters of the 1th suffix the second 12 is the first two letters of the 2nd suffix

What do you mean look at the picture

Okay, now that we've compared the first two letters of all the suffixes, I'm going to compare the strings behind the first two letters. Because I've already compared the size of all 22 letters, I can use the following results to compare the image of the first suffix after merging 1121. First four letters 1211 is the first four letters of the second suffix

Let's start again. The last number is spelled eight digits, which is exactly the length of the string. This can be compared by using a radix sort, but if the string is 10,000, then there are 10,000 suffixes each suffix is 10000, which means that the last number to be spliced is also 10,000-bit 10000*100 00 We need to open up such big data that's not going to work. So can we reduce the size of each stitch?

Look at the picture first

First, the suffix array is ultimately to get the suffix of the ranking so whether 1112  or 11   is 1221 or 24 It doesn't matter

I just keep them in the right size  , for example. Xiao Ming took 100 points   Little Red test 89 points   small just tested 55 points

Xiaoming is the first   that's the truth.  

is stopped.  

So now on 1121 1211 2111 1111 1112 1120 1200 2000 are sorted into two groups of the first two letters a group of two letters a group such as 1121 this four digits 11 and 212 parts to the Cardinal sort

Wait, have you found out that our ranking above is related to the first keyword and the second key word, which means

The size of the ranking is the second keyword ranking why because the ranking is the second keyword sorting results

So what's the relationship with the first keyword? Did you find out that 11 of the first keyword was removed and then added a 00?

To give a vivid example, there are a lot of people in the queue, the beginning of a disorderly order now security requirements from low to high rank

After the sequence, everyone has their own position. Now security is gone. The team went back to the beginning of the state and the original standing at the beginning of the people (Disorderly order is standing at the beginning of the people) came a dwarf must be the shortest guard back to ask for the line again so the dwarf must stand at the front of the guard shouted the last row The first person in the sequence is connected. If that person is the first, then continue back if it's not the last person who ranked first, then security continues to call the last one in the last ranking.

The above story corresponds to the following code

for (p = 0, i = n-j; i < n; i++)

Y[p++]=i;

for (i = 0; i < n; i++)

if (Sa[i] >= j)

y[p++] = sa[i]-J;

Get a complete understanding of the suffix array

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.