Getting Started with suffix arrays

Last Update:2017-07-03 Source: Internet

Author: User

Tags first string

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Basic Introduction:

http://www.nocow.cn/index.php/%E5%90%8E%E7%BC%80%E6%95%B0%E7%BB%84

Application: Collation from the suffix array-a powerful tool for handling strings

2.1, the longest public prefix

This first introduces some of the properties of the suffix array.

Height array: The longest public prefix that defines height[i]=suffix (sa[i-1]) and suffix (sa[i]), which is the longest common prefix of the two suffixes that rank next to each other. Then for J and K, it is best to set rank[j]<rank[k]. The following properties are available:

The longest common prefix for suffix (j) and suffix (k) is height[rank[j]+1],height[rank[j]+2],height[rank[j]+3]. The minimum value in ..., Height[rank[k]].

For example, the string "Aabaaaab", the suffix "abaaaab" and the suffix "aaab" the longest common prefix, 4 see:

So how to efficiently find the height value?

Suppose to press height[2],height[3], ..... Height[n] is calculated in the order of the worst case time complexity O (n2). Doing so does not take advantage of the character of the string. Definition H[i]=height[rank[i]]. The longest public prefix of suffix (i) and the suffix in its previous name.

The H array has the following properties:

H[i]≥h[i-1]-1

Prove:

Set suffix (k) is the suffix of the suffix (i-1). Their longest common prefix is h[i-1]. Then suffix (k+1) will be in front of suffix (i) (h[i-1]>1 is required here, assuming h[i-1]≤1. The original is clearly established) and the longest common prefix of suffix (k+1) and suffix (i) is h[i-1]-1, so suffix (i) and the longest public prefix in its previous suffix are at least h[i-1]-1. According to H[1]. H[2],......,h[n], and takes advantage of the properties of the H array. The time complexity can be reduced to O (n).

Detailed implementation:

When implemented, there is actually no need to save the H array, just according to H[1]. H[2], ... H[n] can be calculated in order. Code:

int RANK[MAXN],HEIGHT[MAXN];
void calheight (int *r,int *sa,int N)
{
int i,j,k=0;
for (i=1;i<=n;i++) rank[sa[i]]=i;
for (I=0;I<N;HEIGHT[RANK[I++]]=K)
For (k?

k--:0,j=sa[rank[i]-1];r[i+k]==r[j+k];k++);
Return
}

Example 1: Longest public prefix

Given a string, ask for the longest common prefix of a two suffix.

Algorithm Analysis:

In accordance with the above, the longest common prefix of two suffixes can be converted to the minimum value on a range.

For this RMQ problem (assuming that you are unfamiliar with the RMQ Range Minimum Query), you can pre-preprocess it with O (Nlogn) time, and each time the query is answered O (1). So for this problem, the preprocessing time is O (Nlogn), and each time the query is answered O (1). Suppose the RMQ problem is preprocessed with O (n) time. Then this problem preprocessing time can do O (n).

2.2, a single string of related issues

One common use of this type of problem is to first seek the suffix array and the height array, and then solve them using the height array.

2.2.1, repeated substrings

Repeated substring: The string R appears at least two times in the string L, which is called R is the repeated substring of L.

Example 2: can overlap the longest repeated substring

Given a string, the longest repeated substring, the two substrings can overlap.

Algorithm Analysis:

This problem is a simple application of the suffix array. The procedure is relatively simple. Only the maximum value in the height array is required.

First, the longest repeated substring is asked. The maximum value of the longest public prefix that is equivalent to finding two suffixes.

Since the longest public prefix of the arbitrarily two suffixes is the smallest value in a segment of the height array. Then this value must not be greater than the maximum value in the height array. So the length of the longest repeated substring is the maximum value in the height array. The time complexity of this procedure is O (n).

Example 3: Non-overlapping longest repeated substrings (pku1743)

Given a string, the longest repeated substring is obtained. These two substrings cannot overlap.

Algorithm Analysis:

This question is slightly more complicated than the previous one. The first two-point answer, the question into a decision-making problem: to infer whether there are two of the length of the K-string is the same. and do not overlap.

The key to solving the problem is to use the height array. Divide the sorted suffix into groups. The height value between the suffixes of each group is not less than K. Like what. The string is "Aabaaaab", and when k=2, the suffix is divided into 4 groups. 5 of what you see.

Easy to see. There are two suffixes that hope to become the longest public prefix not less than k must be in the same group.

Then for each set of suffixes, it is only necessary to infer whether the difference between the maximum and minimum values of the SA value for each suffix is not less than K.

Suppose there is a set of satisfies. The description exists. Otherwise, it does not exist. The time complexity of the whole procedure is O (NLOGN). The method of using height value to group suffixes in the subject is not often used, please read the reader seriously.

Example 4: Overlapping K-times longest repeated substring (pku3261)

Given a string. The longest repeated substring with at least k occurrences, the K-string can overlap.

Algorithm Analysis:

This is almost the same as the previous question, is the first two answers, and then divided into groups of suffixes. The difference is that there is no number of suffixes in a group that is not less than K. Suppose there is. Then there is k the same substring satisfies the condition, otherwise does not exist. The time complexity of this procedure is O (NLOGN).

2.2.2, substring number Example 5: Number of strings not identical (spoj694,spoj705)

Given a string, the number of substrings is not the same.

Algorithm Analysis:

Each substring must be prefixed with a suffix. The original problem is equivalent to the number of prefixes that are not identical between all suffixes.

Assume that all suffixes follow suffix (sa[1]). Suffix (sa[2]), suffix (sa[3]), ..... Suffix (sa[n]), it is not difficult to find, for each new add-in suffix suffix (sa[k]), it will produce n-sa[k]+1 a new prefix. However, the height[k] is the same as the prefix of the preceding string. So suffix (sa[k]) will "contribute" out of n-sa[k]+1-height[k] a different substring. The answer to the original question is the summation. The time complexity of this procedure is O (n).

2.2.3, palindrome string

Palindrome string: Assuming that a substring of the string L r is written in turn and the original string r, then the string R is a palindrome substring of the string L.

Example 6: Longest palindrome string (ural1297)

Given a string, the longest palindrome substring is obtained.

Algorithm Analysis:

Poor lift each one. The longest palindrome substring centered on this character is then computed. Note that there are two cases, one is the length of the palindrome substring is odd, the second is the length is even. Both cases can be converted to the longest common prefix of a suffix and a suffix that is written in turn. The detailed approach is to write the entire string in turn after the original string, separated by a special character.

This turns the problem into the longest common prefix of a two suffix that asks for the new string.

6 of what you see.

The time complexity of this procedure is O (NLOGN). It is assumed that the RMQ problem is preprocessed with the time O (n) method. Then the time complexity of the subject can be reduced to O (n).

2.2.4, continuous recurrent substring

Continuous string: Suppose that a string L is obtained by a string s repeated r times, it is said that L is a continuous repeated string. R is the number of times this string is repeated.

Example 7: Continuous recurrent substring (pku2406)

Given a string l, it is known that the string is obtained by a string s repeated r times. The maximum value of R.

Algorithm Analysis:

The procedure is relatively simple, exhaustive the long k of the string s, and then infers whether it is satisfied.

When inferring, see if the length of the string L is divisible by K, and see if the longest common prefix for suffix (1) and suffix (k+1) equals N-k. When asking for the longest public prefix. Suffix (1) is fixed, so the RMQ problem is not necessary to do all the preprocessing. Only the requirement for the minimum value from each number in the height array to height[rank[1]] is OK. The time complexity of the whole procedure is O (n).

Example 8: Successive repeated substrings (spoj687) with the largest number of repetitions. pku3693)

Given a string, the most repeated consecutive substrings are evaluated.

Algorithm Analysis:

First, the length of the long l, and then the length of the substring can be more than a few consecutive occurrences. The first 1 consecutive occurrences are sure to be able. So this is only considered at least 2 times in this case.

If 2 consecutive occurrences of the original string are recorded, this substring is s, then s must contain the character r[0],r[l],r[l*2]. R[l*3], ... One of the adjacent two. So just look at the characters R[l*i] and r[l* (i+1)] and how far forward and backward each can be. Remember this total length for k, then here successively appeared k/l+1 times.

Finally see what the maximum value is. 7 of what you see.

The time for the exhaustive length l is n. The time for each calculation is n/l. Therefore the time complexity of the entire procedure is O (n/1+n/2+n/3+......+n/n) =o (NLOGN).

2.3, two strings related issues

One common use of this type of problem is to connect the two strings first. The suffix array and the height array are then evaluated. The height array is then used to solve the problem.

2.3.1, Common substring

Common substring: Assume that the string L is present in string A and string B at the same time. The string L is said to be the common substring of string A and string B.

Example 9: Longest common substring (pku2774,ural1517)

For a given two strings A and b, the longest common substring is obtained.

Algorithm Analysis:

Whatever substring of a string is prefixed with a suffix of that string. The longest common substring of a and B is equivalent to the maximum value of the longest common prefix for the suffix of a and B.

Assuming that all suffixes of a and B are enumerated, this is obviously inefficient. Because the suffix of a and the longest common prefix of the suffix of B are computed, the second string is first written after the first string, separated by a character that does not appear. Then ask for the suffix array for this new string. Take a look. See if you can find some rules from the suffix array of this new string.

Take a= "Aaaba", b= "Abaa" as an example. 8 of what you see.

So is the maximum value in all the height values the answer? Not necessarily! It is possible that the two suffixes are in the same string, so there is actually only when suffix (sa[i-1]) and suffix (sa[i]) are not two suffixes in the same string. Height[i] is to meet the conditions. And the biggest of these is the answer. The lengths of the string A and string B are respectively | A| and | b|. The time of the suffix array and the height array for the novelty string is O (| a|+| b|), and then the maximum value of the height value of the two suffix that is adjacent but originally not in the same string. Time is also O (| a|+| b|). So the time complexity of the whole procedure is O (| a|+| b|). Time complexity has been taken to the lower limit, it is seen that this is a very good algorithm.

Number of 2.3.2 and substrings example 10: Number of common substrings with a length not less than K (pku3415)

Given two strings A and B, the number of common substrings with a length of not less than K (can be the same).

Example 1:

A= "xx", b= "xx", K=1, the number of common substrings with a length of not less than K is 5.

Example 2:

A = "Aababaa", B = "Abaabaa", k=2. The number of common substrings with a length of not less than K is 22.

Algorithm Analysis:

The basic idea is to calculate the length of the longest common prefix between the full suffix of a and the full suffix of B. Add up all parts of the longest public prefix length not less than K.

First, two strings are connected together, separated by a character that has not been seen in the middle. After grouping by height values, the next task is to speed up the sum of the longest common prefixes between the suffixes in each group.

Scan it again. Each encounter with the suffix of a B is counted with the suffix of the preceding a can produce how many lengths of a common substring of not less than K, where a suffix need to use a monotonous stack for efficient maintenance. Then do it again for a.

Detailed details are left to the reader to think.

2.4. Problems related to multiple strings

One of the often-used practices of such problems is. Concatenate all the strings first, then the suffix array and the height array, and then use the height array to solve them.

This may require a two-point answer.

Example 11: Not less than the oldest string in K-strings (pku3294)

Given n strings, find the oldest string that is now not less than the K strings.

Algorithm Analysis:

Connect n strings together. The middle is separated by the characters that are not the same and are not present in the string. The suffix array. Then the two-point answer, using the same method as in Example 3, divides the suffix into groups, and infers whether the suffix of each group is not less than the original string of K.

The time complexity of this procedure is O (NLOGN).

Example 12: Oldest string (spoj220) with at least two occurrences and no overlap

Given n strings, the oldest string that appears at least two times and does not overlap in each of these strings.

Algorithm Analysis:

The procedure is similar to the previous question, which is to connect the n strings first, separated by the characters that are not the same and are not present in the string. The suffix array.

And then the two-point answer. The suffixes are then grouped. When inferring, see if there is a set of suffixes that appear at least two times in each of the original strings. Moreover, in each of the original strings, the difference between the maximum and minimum values of the starting position of the suffix is not less than the current answer (inference can not overlap, assuming that there are no overlapping requirements in the topic, then do not do this inference).

The time complexity of this procedure is O (NLOGN).

Example 13: The oldest string (PKU3294) that appears or reverses in each of today's strings

Given n strings, the oldest string in every string that appears or is reversed.

Algorithm Analysis:

The difference between this question is whether you want to infer that it appears in a reversed string. In fact, this does not increase the difficulty of the problem.

You just have to write each string in reverse. The middle is separated by a character that is not the same and is not present in the string. Then all the n strings are connected, and the middle one is separated by a character that is not the same and is not present in the string, seeking the suffix array. Then the two-point answer, and then the suffix group.

When inferred, it depends on whether a set of suffixes appears in each of the original strings or in the inverted string. The time complexity of this procedure is O (NLOGN).

Getting Started with suffix arrays

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More