Data structure suffix array

Source: Internet
Author: User
Tags string back

1. Overview

A suffix array is a powerful tool for solving string problems. It is easier to implement and consumes less memory than a suffix tree. In practical applications, suffix arrays are often used to solve complex problems related to strings.

Most of this article is excerpted from the reference material [".]

2. Suffix array

2.1 Several concepts

(1) The suffix array sa is a one-dimensional array that holds 1. An arrangement of N Sa[1],sa[2],......,sa[n], and guaranteed Suffix (Sa[i]) < Suffix (sa[i+1]), 1≤i<n. That is, the n suffixes of s are sorted from small to large, and the beginning position of the ordered suffix is placed in the SA sequentially. where suffix (i) represents the string s[i,i+1...n-1], which is the suffix of the string s starting at the first character.

(2) rank array rank[i] saved is suffix (i) in all suffixes from small to large ranked "rank."

Simply put, the suffix array is "who is the first?" "The rank array is" What's your rank? ”。 It is easy to see that the suffix array and the rank array are mutually inverse.

(3) Height array: The longest public prefix that defines height[i]=suffix (sa[i-1]) and suffix (sa[i]), which is the longest public prefix of the two suffixes that rank next to each other.

(4) H[i]=height[rank[i]], which is the longest public prefix of suffix (i) and the suffix in its previous name.

(5) LCP (I,J): A positive integer i,j defines LCP (I,J) =LCP (Suffix (sa[i)), Suffix (Sa[j]), where i,j are integers of 1 to N. LCP (I,J) is the length of the longest common prefix of the I and J suffixes in a suffix array. wherein, the function LCP (u,v) =max{i|u=v}, that is, from the beginning to compare the corresponding characters of U and V, corresponding to the maximum number of characters continuously equal, known as the longest common prefix of the two strings.

2.2 Several properties

(1) LCP (I,J) =min{height[k]|i+1≤k≤j}, that is, the calculation of LCP (I,J) is equivalent to asking for the minimum value of all elements in the i+1 to J range of a one-dimensional array height.

Proof slightly.

(2) for i>1 and rank[i]>1, there must be h[i]≥h[i-1]-1.

Proof: Set suffix (k) is the suffix of suffix (i-1), then their longest common prefix is h[i-1]. Then suffix (k+1) will be in front of suffix (i) (where h[i-1]>1 is required, if h[i-1]≤1, the primitive is clearly established) and suffix (k+1) and suffix (i) the longest common prefix is h[i-1]- 1, so suffix (i) and the longest common prefix of the suffix in its previous name is at least h[i-1]-1. calculated in order of h[1],h[2],......, H[n], and using the properties of H array, the time complexity can be reduced to O (n).

3. Suffix Array implementation

This section gives an efficient algorithm for calculating sa,rank,height and H

(1) Compute rank array rank and suffix array sa

Using the multiplication algorithm, the rank rank is calculated first, and then the suffix array sa is obtained in O (n) time. In the multiplication method, the length of each character starting with 2^k is sorted, and the rank value is calculated. K starting from 0, each plus 1, when 2k is greater than N, each character starts with a substring of length 2^k equivalent to all suffixes. And these substrings must have been compared to the size, that is, the rank value does not have the same value, then the rank value is the final result. Each time the sort takes advantage of the rank value of the last string of length 2^ (k-1), the string of length 2^k can be represented as a keyword with a string of two lengths of 2^ (K-1), followed by a cardinal order, and the rank value of a string of length 2k is obtained. Take the string "Aabaaaab" as an example, as shown in the entire process. where x and Y are two keywords representing a string of length 2k.

(2) Compute array h

You can make I loop from 1 to N to calculate the H[i] in the following way:

If rank[i]=1, then h[i]=0. The character comparison number is 0.

If I=1 or h[i-1]≤1, the suffix (i) and suffix (rank[i]-1) are compared directly from the first character until the characters are different, thus h[i] is calculated. The number of characters compared to h[i]+1, not more than h[i]-h[i-1]+2.

Otherwise, the description i>1,rank[i]>1,h[i-1]>1, according to the nature of 2,suffix (i) and Suffix (rank[i]-1) at least the first h[i-1]-1 characters are the same, so the character comparison can start from h[i-1], H[i] is calculated until a character is not the same. The number of character comparisons is h[i]-h[i-1]+2.

The final algorithm can be obtained in the complexity of O (n).

4. Suffix Array application

4.1 Single string-related issues

(1) The longest repeating substring can overlap. Given a string, the longest repeating substring, the two substrings, can overlap.

"Resolution" requires only the maximum value in the height array.

(2) cannot overlap the longest repeating substring. Given a string, the longest repeating substring, the two substrings, cannot overlap.

"Parse" the first two-point answer, the question into a decision-making problem: to determine whether there are two of the length of the K-string is the same, and does not overlap. The key to solving this problem is to use the height array. Divides the sorted suffix into groups, where the height value between the suffixes of each group is not less than K. For example, the string is "Aabaaaab", and when k=2, the suffix is divided into 4 groups:

It is easy to see that there is a hope that the longest common prefix of two suffixes not less than k must be in the same group. Then for each set of suffixes, only the difference between the maximum and minimum values of the SA value for each suffix is not less than K. If there is a set of satisfies, the description exists, otherwise it does not exist. The time complexity of the whole procedure is O (NLOGN).

(3) The longest repeating substring of k-times that can overlap. Given a string, the longest repeating substring that appears at least k times, which can overlap.

"Resolve" the first two answers, then divide the suffix into groups. The difference is that there is no number of suffixes in a group that is not less than K. If there is, then there is k the same substring satisfies the condition, otherwise does not exist. The time complexity of this procedure is O (NLOGN).

(4) the longest palindrome substring. Given a string, the longest palindrome substring is obtained.

"Parse" writes the entire string back to the original string, separated by a special character. This turns the problem into the longest common prefix of a two suffix that asks for the new string.

(5) Continuous repetition of substrings. Given a string l, it is known that the string is obtained by repeating R for some string s, and the maximum value of R is calculated.

"Parse" the length k of the string s to be exhaustive, and then determine if it is satisfied. When judging, see if the length of the string L can be divisible by K, and then see if the longest public prefix of suffix (1) and suffix (k+1) equals N-k. When asking for the longest public prefix, suffix (1) is fixed, so there is no need to do all the preprocessing for the RMQ problem, only the minimum value between each number in the height array to height[rank[1]] is required. The time complexity of the whole procedure is O (n).

(6) Repeats the most repeated number of consecutive substrings. Given a string, the number of consecutive repeating substrings with the most repetitions is obtained.

"Parse" first to be poor lift the length of L, and then for the length of the substring of L can appear a number of consecutive times. It is certainly possible to have 1 consecutive occurrences, so it is only considered at least 2 times in this case. Assuming that the original string appears consecutively 2 times, remember that substring is s, then s must include the character r[0], r[l], r[l*2],r[l*3], ... One of the adjacent two. So just look at the characters R[l*i] and r[l* (i+1)] forward and backward can be matched to how far, remember this total length of k, then there is a succession of k/l+1 times. Finally see what the maximum value is.

The time of the exhaustive length L is N, and the time for each calculation is n/l. Therefore the time complexity of the entire procedure is O (n/1+n/2+n/3+......+n/n) =o (NLOGN).

4.2 Two string-related issues

(1) The longest common substring. For a given two strings A and b, the longest common substring is obtained.

"Parsing" first writes the second string after the first string, separates it with a character that is not present, and then asks for the suffix array of the new string. When suffix (sa[i-1]) and suffix (sa[i]) are not two suffixes in the same string, Max{height[i]} is the one that satisfies the condition

(2) The number of common substrings of a length not less than K. Given two strings A and B, the number of common substrings with a length of not less than K (which can be the same) is obtained.

The basic idea of "parsing" is to calculate the length of the longest common prefix between all suffixes of a and all suffixes of B, adding up the portion of the longest public prefix length not less than K. First, two strings are connected together, separated by a character that has not been seen in the middle. When you group by height values, the next task is to quickly count the sum of the longest common prefixes between the suffixes in each group. Scan again, each encounter a suffix of a B to count with the suffix of the preceding a can produce how many lengths are not less than k of the common substring, where a suffix needs to use a monotonous stack for efficient maintenance. Then do it again for a.

More than 4.3 string-related issues

(1) not less than the oldest string in K-strings. Given n strings, the eldest string that appears in a string of not less than K.

"Parse" connects N strings, and the middle is separated by a character that is not identical and does not appear in the string, seeking a suffix array. Then the two-point answer: Divide the suffix into groups to determine whether the suffix of each group appears in the original string of not less than K. The time complexity of this procedure is O (NLOGN).

(2) The oldest string with a minimum of two occurrences and no overlap. Given n strings, the oldest string that appears at least two times in each string and does not overlap.

The "parsing" approach is similar to the previous question, which is to connect the n strings together, separating the characters in the middle of the string with a different character, and finding the suffix array. Then the two-point answer, and then the suffix group. When judging, to see if there is a set of suffixes in each of the original string at least two times, and in each original string, the suffix of the starting position of the maximum and the difference between the minimum is not less than the current answer (to determine whether it does not overlap, if the problem does not overlap the requirements, then do not make this judgment). The time complexity of this procedure is O (NLOGN).

(3) The oldest string that appears in each string after the occurrence or reversal. Given n strings, the oldest string that appears in each string after a occurrence or reversal.

The difference between "parsing" is to decide whether to appear in a reversed string. In fact, this does not increase the difficulty of the problem. You only need to write each string in reverse, in the middle with a different character that does not appear in the string, and then the n strings are all connected, and the middle is separated by a different character that does not appear in the string, to find the suffix array. Then the two-point answer, and then the suffix group. When judging, see if there is a set of suffixes appearing in each of the original strings or in the inverted string. The time complexity of this procedure is O (NLOGN).

5. Summary

The suffix array can actually be thought of as the suffix tree where all the leaf nodes are placed in the array from left to right, so the suffix array may not be used beyond the suffix tree. It can even be said that if the LCP is not matched, the range of suffix arrays is very narrow. But the array of suffixes in conjunction with the LCP function is very powerful and can accomplish the tasks most of the suffix trees can accomplish, because the LCP function actually gives the nearest public ancestor of any two leaf nodes, and this content can be studied by ourselves.

6. References

(1) Xu Zhilei, IOI2004 National Training team paper "suffix array"

(2) Ro, IOI2004 National Training team paper "suffix array-powerful tool for handling strings"

----------------------------------------------------------------------------------------------more information on data structures and algorithms, See: Data structure and algorithm summary----------------------------------------------------------------------------------------------

Original articles, reproduced please specify: Reproduced from Dong's blog

This article link address: http://dongxicheng.org/structure/suffix-array/

Top
0

Data structure suffix array

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.