"Suffix array"

Source: Internet
Author: User

A suffix array is a powerful tool for working with strings. ———— Ro

• The previous statement

In the suffix tree, suffix automaton, and suffix array three of the indefinitely choice, the combination of code and practicality considerations, you chose to learn the suffix array. This article will, as usual, give the rice cake its own understanding of the suffix array in a more simple and understandable way.

· lcp--the introduction of a problem

LCP (Longest Common Prefix) is the longest public prefix. Here's a question: Enter a string with many queries, and each query (a, a, a) indicates that the string length is a suffix and the length of B is the suffix of LCP. Two methods:

① violence: For each query, scan through the two suffixes to get the value of the LCP. O (N)

② suffix array: For each query, make a ___ operation to get the value of the LCP. O (LOGN)

In order to solve this problem, the next step is to explain the suffix array, but also to explore the solution to the problem.

• Sorting suffixes--construction of suffix arrays

Touch the new algorithm and you'll want to know what this new gadget looks like. Here we can give a simple example to tailor a suffix array to a string:

[Example] you get the string s= "a b c a c a B C" , see below:

first of all, we give each suffix number, numbering the suffix of the left end of the subscript (for example, the length of the suffix of 2 "BC" number is 7). Then you can read:

Do you see the suffix array? Let's take one by one explanations. Each suffix of the original string is taken out to order the lower part of the picture according to the dictionary order from small to large. Then we can see that for each suffix, two elements are labeled on the right: rank and Pos. Rank in fact refers to its dictionary order is the number of small (in the order of the first few), as for the POS, it is mentioned above each suffix numbers. In sort order, put the number in an array sa[],sa[i] to represent the number of the small suffix of the dictionary order--you get an array of suffixes.

But what about the suffix array, which is sold in a package, and it's bundled with the array of these effects that cannot be underestimated: rank[],height[].

Rank[] This is different from the rank in the picture. Rank[] is actually an inverse array of sa[], let's compare the definition to know:

————— ①sa[i]: Indicates the number of the suffix of the small I

————— ②rank[i]: Indicates the suffix of number I

Height[] is a big feature. Its definition is wonderful: Heigt[i] represents the small suffix of I and the LCP length of the small suffix (i-1). This will make people feel height[] The structure must be slow (such as violence pretreatment), but the fact is in the following will O (n) to draw all the height[i].

• Solution to LCP problem--Display of suffix array effect

Back to the initial question, ask any two suffix for LCP length what should I do with a suffix array? If the two suffix rankings you ask are adjacent, the answer is Height[i]. But what if the rankings are not adjacent? For example, now ask for the LCP lengths with suffixes numbered 4 and 6:

We found that asking for 4,6 LCP is actually equal to asking for the minimum value of LCP and 1,4 of 6,1. In summary, the query number i,j suffix LCP, in fact, is in the ordering suffix sequence in the queue in front of the query string in the interval between the queue of all height[] minimum value. (Don't forget the definition of height!) ) draw a picture to deepen understanding (red for the two suffixes of the inquiry):

The advantage is that each inquiry is converted to the minimum of the asking interval, so it can be maintained quickly by writing a RMQ. Although not yet sa[],height[] construction method, but can first preview to realize the superiority of this method: sa[] Need O (NLOGN) time construction, height[] need O (n) time construction, plus RMQ, the method time complexity of O (Nlogn), Perfection trumps violence. A small reminder is that it is important to understand the rankings and numbering.

• Radix sort--o (Nlogn) preprocessing SA array
Although it's just a preprocessing, it's important and relatively detailed to understand (Rujia in the book: "There's a lot of detail in the code implementation ...").

The cardinality sort is similar to the bucket sort, the method is to use C[i] as the number of numbers of I, will c[i] cumulative, that is: c[i]+=c[i-1], and then reverse sequence enumeration, at this time the C[a[j]] is represented by the number from small to large row, then C[a[j]]--。 For example, enter a sequence of length n, with an element size not exceeding 100 and a positive integer, and output each element in the order in which it is sorted from small to large. The code for this problem is given in order to quickly understand the cardinality sort:

It is important to note that the preprocessing suffix array (sa array) is not only used for cardinality sorting, but also requires two other ideas:

Thought one: Do not need to compare each one. In fact, two suffixes from the first position to compare the size, as long as the first to arrive at a different one, you can judge who is big or small. Therefore, the purpose of preprocessing is: first to compare each suffix only first, and then to rank, if the ranking has the same (that is, the first bit the same), then compare each suffix of the second position, until there is no same ranking, just jump out of the program, because at this time can be resolved all the suffix size relationship.

Thought two: The method mentioned at the end of thought can be optimized by multiplying the thought. That is, all suffixes first compare the first 1 digits, if the ranking has the same compared to the first 2 bits, then the first 4 bits, the first 8 bits ... Here the method is established for the reason that the original comparison length can be spelled in twice times the length of the new comparison. It is worth noting, then, that each time this is equivalent to a double keyword sort, the following gives a partial ordering of the sample array:

you can see that the SA array in the diagram is false, because sa[1],sa[2],sa[3] three suffixes should be ranked the same instead of three-way. But this is only temporary, don't forget to compare the jump condition is not the same rank, that is, rank does not have the same element. The next step is to let the preprocessing run, and the image above is initialized. Because we find that only the first bit is insufficient to differentiate the size, so compare the top 2 bits:

It seems that the entire preprocessing can be done directly on rank. But how do you write the cardinal order of the double key word? Coup: We know that for double keyword sorting, you can sort the numbers by lower priority keywords, and then sort by the first keyword. So, here's a way to get the second keyword (that is, the second bit of each parenthesis in the diagram) in order, and then sort the cardinality of the first keyword.

So that the second keyword is ordered, we can directly use the current false sa[], because the SA internal is ordered, only need to follow the order of the SA subscript the corresponding number suffix into the array to ensure that the second keyword is ordered, and finally the first keyword ordering of the array. The subscript at the end of the sort is the value of the new rank. Be sure to distinguish rank from sa!. (This paragraph should be the hardest to understand, and the next code explanation will explain it in a different way)

at the end of each comparison, we can launch a new SA and rank. Since the first k for each comparison is a multiplier, the algorithm here has a time complexity of O (Nlogn).

Here is the details of the pre-processing SA code (try to understand, do not understand the small section will be analyzed next)

Where S[i] represents the original string, K is the current comparison before the 2*k bit, y[] is the above mentioned for the Double keyword sort of array, and eventually it will update x[],x[] is a temporary rank array. (But in order to facilitate the last paragraph will be directly exchanged x, Y, meaning also exchanged)

Step-by-step analysis:

Outside of the while loop is the initialization, which represents the rank of each suffix and the corresponding sa[] only when the first bit is compared.

Within a while loop:

① use sa[] to sort by the second keyword:

the meaning of the first line is that the newly formed rank array between the sections [N-k+1,n] is not the second keyword, i.e. (key1,0), because now to compare 2*k bits

The meaning of the second line is to use the order of the SA array to ensure that the second keyword is ordered. Pay attention to understand why is Sa[i]-k, said K take to spell into 2*k, just like in the picture the second key source for each two-tuple is the location +k from the Oldrank array, so subtract K when using SA to get the position of the two-tuple where the second keyword is located.

② will have the second keyword ordered y[] array for the first keyword cardinality sort:

The first three rows are the standard cardinality sort. The last line concludes with a new SA array.

③ Update x in the order of the new SA (that is, update rank)

Notice here y actually is old x[] (old rank), the fourth row of the judging condition means that if the current sorting after the next two-tuple double keyword is the same then the ranking is the same, p cannot + +. P statistics Out is the number of different rankings, when the number of different rankings equals N, while you can jump out.

in summary, this part of the complex in the SA and rank on the preprocessing of the same time, because the two arrays themselves are mutually inverse, so related to the array subscript order problems, easy to confuse.

• Discovery rule--o (n) preprocessing height array

in a hodgepodge of pre-processing of cardinal orders , the linear recursion of the height array looks fresh and wonderful. First, the conclusion: if the suffix number I in the order of enumeration Height[rank[i], then the linear time complexity can be pre-processing. Based on such a conclusion (note the definition of rank):

Height[rank[i]]>=height[rank[i-1]]-1

The proof is as follows:

Because the suffix numbered i is one element less than the suffix numbered (i-1),

consider the order of the dictionary ordering, the i-1 number and the string of x in front of it is the LCP value of height[rank[i-1]], and then we cut the first element of x, that is, another suffix, then: because this suffix and the i-1 number from the second bit to start with ( HEIGHT[RANK[I-1]]-1) is the same length, the i-1 number is only the first element with the I number difference, so this suffix must be equal to the LCP I (height[rank[i-1]]-1), so Height[rank[i]] will not be less than this value.

This part of the code can't wait to come out: (height shorthand is H,rank abbreviated to R)

[Final Code] (base on HDU 1403):

#include <stdio.h>#include<cstring>#include<algorithm>#defineGo (i,a,b) for (int i=a;i<=b;i++)#defineRO (i,a,b) for (int i=a;i>=b;i--)using namespaceStdConst intn=200003;CharS[n];intC[n],m,n,t1[n],t2[n],sa[n],h[n],r[n],mid;voidSA () {int*x=t1,*y=t2,k=1, p=0; Go (i,1, m) c[i]=0; Go (i,1, N) c[x[i]=s[i]]++; Go (i,2, m) c[i]+=c[i-1];ro (I,n,1) sa[c[x[i]]--]=i;  while(k<=n&&p<n&&1+ (p=0) {Go (i,n-k+1, N) y[++p]=i; Go (i,1-nhif(sa[i]>k) y[++p]=sa[i]-K; Go (i,1, m) c[i]=0; Go (i,1, N) c[x[y[i]]]++; Go (i,2, m) c[i]+=c[i-1];ro (I,n,1) Sa[c[x[y[i]]]--]=y[i];swap (x, y);p =x[sa[1]]=1; Go (i,2, N) x[sa[i]]=y[sa[i-1]]==y[sa[i]]&&y[sa[i-1]+k]==y[sa[i]+k]?p:++p;m=p;k<<=1; }}voidHeight () {intk=0, J; Go (i,1, N) r[sa[i]]=i; Go (i,1, N) {k-=k>0; j=sa[r[i]-1];  while(S[i+k]==s[j+k]) k++; h[r[i]]=k;}}intMain () { while(~SCANF ("%s", s+1) ) {m= -; N=strlen (s+1); s[mid=n+1]='a'-1; scanf ("%s", s+n+2); N=strlen (s+1); SA (); Height ();intmax=0; Go (i,2, N)if(h[i]>max&& (Mid-sa[i]) * (mid-sa[i-1]) <0) max=H[i]; printf ("%d\n", Max); }    return 0;}//Paul_guderian

Summary of Rice Fragrance:

This paper focuses on exploring a good way to understand the whole process of suffix array construction. The suffix array can replace the suffix tree, or it can replace the suffix automaton in most topics, so it is the preferred method for related problems. In addition, for the preprocessing SA array, it can be obtained with the DC3 algorithm O (n), but the constants, space and coding complexity become larger. This article does not mention some classical examples of suffix arrays, rice cakes suggested to see Ro's paper on the classic topics to compensate for this shortcoming. Sincerely wish to visit this oier can have some gains, step by step Chase Dream, down to the peak.

But in Beijing, Shanghai, Guangzhou, Shenzhen, a day at midnight suddenly woke up, like the fate woke up,
It says you can't just go through your whole life ... ———————— "You were a teenager."

"Suffix array"

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.