Data structure after the prefix array _ data structure

Source: Internet
Author: User
Tags first string string back
1. Overview

The suffix array is a powerful tool for solving string problems. It is easier to implement and consumes less memory than a suffix tree. In practical applications, the suffix array is often used to solve complex problems with strings.

Most of this article is excerpted from reference [1][2].

2. Suffix array

2.1 Several concepts

(1) The suffix array sa is a one-dimensional array that holds 1 ... An arrangement of N Sa[1],sa[2],......,sa[n], and guarantees Suffix (Sa[i]) < Suffix (sa[i+1]), 1≤i<n. That is, the n suffixes of s are sorted from small to large, and the beginning position of the ordered suffix is placed in the SA in sequence. where suffix (i) represents a string s[i,i+1...n-1], that is, the string s starts with the suffix of the first character.

(2) The rank array Rank[i] holds the "rank" in which suffix (i) is arranged from small to large in all suffixes.

To put it simply, the suffix array is "Who's in the row." "And the rank array is" what do you rank. ”。 It is easy to see that the suffix array and the rank array are mutual inverse.

(3) Height array: The longest common prefix that defines height[i]=suffix (sa[i-1]) and suffix (sa[i]), which is the longest common prefix of the two suffixes that rank adjacent.

(4) H[i]=height[rank[i]], that is, suffix (i) and the longest public prefix of the suffix in its previous name.

(5) LCP (I,J): A positive integer i,j defined LCP (i,j) =LCP (Suffix (sa[i)), Suffix (Sa[j)), where i,j are integers of 1 to N. LCP (I,J) is the length of the longest public prefix of the first and the J suffixes in the suffix array. In which, the function LCP (u,v) =max{i|u=v}, that is, to start from the beginning to compare the corresponding characters of U and V, the corresponding character of the maximum number of equal, known as the longest common prefix of the two strings.

2.2 Several properties

(1) LCP (I,J) =min{height[k]|i+1≤k≤j}, that is, the calculated LCP (I,J) is equivalent to asking the minimum value of all elements in the range of I+1 to J of the one-dimensional array height.

Prove slightly.

(2) for i>1 and rank[i]>1, there must be h[i]≥h[i-1]-1.

Proof: Set suffix (k) to be the suffix of the previous suffix (i-1), their longest public prefix is h[i-1]. Then suffix (k+1) will be in front of suffix (i) (this requires h[i-1]>1, if h[i-1]≤1, the original form is clearly established) and the longest public prefix suffix (k+1) and suffix (i) is h[i-1]- 1, so suffix (i) and the longest public prefix of the suffix in its previous name is at least h[i-1]-1. According to the order of H[1],h[2],......, H[n], and using the properties of H array, the time complexity can be reduced to O (n).

3. Suffix Array implementation

This section gives an efficient algorithm for computing sa,rank,height and H

(1) Calculate rank array rank and suffix array sa

The multiplication algorithm is used to first find the rank rank, and then the suffix array sa is obtained in O (n) time. A multiplication method is used to sort the substring at the beginning of each character to 2^k, and to find the rank, that is, the rank value. K starting from 0, each time plus 1, when the 2k is greater than n, the length of each character beginning with 2^k is equivalent to all suffixes. And these substrings must have been compared to the size, that is, the rank value does not have the same value, then the rank value is the final result. Each order utilizes the rank value of the last string of length 2^ (k-1), a string of length 2^k can be represented as a keyword with the rank of two 2^ (k-1) strings, then a cardinal order, and the rank value of a string of length 2k is obtained. Take the string "Aabaaaab" as an example, the entire procedure is shown in the following illustration. where x and y are two keywords that represent strings of length 2k.

(2) Compute array h

You can make I loop from 1 to N to calculate h[i in sequence as follows:

If rank[i]=1, then h[i]=0. The number of characters is compared to 0.

If I=1 or h[i-1]≤1, the suffix (i) and suffix (rank[i]-1) are compared directly from the first character until the characters are different, and the h[i is calculated. The number of characters compared to h[i]+1, no more than h[i]-h[i-1]+2.

Otherwise, the description i>1,rank[i]>1,h[i-1]>1, according to the Nature 2,suffix (i) and Suffix (rank[i]-1) at least have the previous h[i-1]-1 characters are the same, so the character comparison can start from h[i-1], The h[i] is computed until a character is different. The number of characters compared to h[i]-h[i-1]+2.

It is found that the final algorithm has the complexity of O (n).

4. Suffix Array application

4.1 Single string related issues

(1) can overlap the longest repeating substring. Given a string, ask for the longest repeating substring, and these two substrings can overlap.

"Parsing" requires only the maximum value in the height array.

(2) cannot overlap the longest repeating substring. Given a string, ask for the longest repeating substring, and these two substrings cannot overlap.

"Analytic" first two-point answer, the problem into a judgment question: to determine whether there are two substrings of length k are the same, and do not overlap. The key to solving this problem is to use the height array. The sorted suffix is divided into groups, where the height value between the suffixes of each group is not less than K. For example, the string is "Aabaaaab", and when k=2, the suffix is divided into 4 groups:

It is easy to see that there is a hope that the longest public prefix of two suffixes not less than k must be in the same group. Then for each set of suffixes, only the difference between the maximum and minimum values of the SA value for each suffix is determined to be no less than K. If there is a set of satisfied, then it exists, otherwise it does not exist. The time complexity of the whole procedure is O (NLOGN).

(3) can overlap K-th longest repeating substring. Given a string, the longest repeating substring of at least K is present, and the K-substring can overlap.

"Parse" first two answers, then divide the suffix into groups. The difference is, here is to determine whether there is a group of suffix number not less than K. If there is, then K has the same substring to satisfy the condition, otherwise it does not exist. The time complexity of this procedure is O (NLOGN).

(4) the longest palindrome string. Given a string, ask for the longest palindrome substring.

Parsing writes the entire string back to the original string, separating the middle with a special character. This turns the problem into the longest public prefix for a two suffix of this new string.

(5) Continuous repeating substring. Given a string l, the maximum value of R is known to be obtained by repeating r of a string s.

"Parse" the length k of the string s, and then judge whether it is satisfied. When judging, see whether the length of the string L can be divisible by k, and see if the longest public prefix of suffix (1) and suffix (k+1) equals N-k. When asked for the longest public prefix, suffix (1) is fixed, so the RMQ problem does not need to do all the preprocessing, only the minimum value between each number in the height array to height[rank[1]. The time complexity of the whole procedure is O (n).

(6) Continuous repeating substring with the most repetition times. Given a string, a continuous repeating substring with the highest number of repetitions is obtained.

The "parse" first gives the length L, and then the substring of the length L can appear up to several times in a row. First of all, 1 consecutive times is certainly OK, so this is only considered at least 2 times. Assuming that there are 2 consecutive occurrences in the original string, this substring is s, then s must include the character r[0], r[l], r[l*2],r[l*3], ... Two adjacent to one of the two. So just look at the character R[l*i] and r[l* (i+1)] forward and back each can match to how far, remember this total length of k, then there is a succession of k/l+1 times. Finally, see what the maximum value is.

The time of the poor lifting length L is N, the time of each calculation is n/l. Therefore the time complexity of the entire procedure is O (n/1+n/2+n/3+......+n/n) =o (NLOGN).

4.2 Two string related issues

(1) The longest common substring. Given two strings A and B, find the longest common substring.

Parsing first writes the second string behind the first string, separating the middle with a character that has not been seen before, and then asks for the suffix array of the new string. When suffix (sa[i-1]) and suffix (sa[i]) are not two suffixes in the same string, Max{height[i] is satisfied

(2) The number of common substrings of length not less than K. Given two strings A and B, the number of common substrings with a length of not less than K (can be the same) is obtained.

The basic idea of "parsing" is to compute the length of the longest common prefix between all the suffixes of a and all the suffixes of B, adding up all the parts of the longest public prefix length of not less than K. First, connect the two strings, separating the middle with a character that doesn't appear. After grouping by the height value, the next step is to quickly count the sum of the longest public prefixes between the suffixes in each group. Scan again, each encounter a B suffix on the statistics and the previous a suffix can produce a number of lengths not less than K of the common substring, where a suffix needs to use a monotonous stack to efficiently maintain. Then do this once for a too.

More than 4.3 string-related issues

(1) not less than the oldest string in K strings. Given n strings, find the oldest string that appears in no less than k strings.

"Parse" connects n strings, separating them with characters that do not appear in the string, and a suffix array. Then the second answer: divide the suffix into groups, and determine whether the suffix of each group appears in the original string of not less than K. The time complexity of this procedure is O (NLOGN).

(2) Each string appears at least two times and does not overlap the eldest child string. Given n strings, find the oldest string with at least two occurrences and no overlap in each string.

The "parsing" approach is similar to the previous one, and it is the first to connect n strings together, separating the characters in the middle with a different character that does not appear in the string, to find the suffix array. Then divide the answers, and then group the suffixes. When judging, see if there is a set of suffixes that appear at least two times in each of the original strings, and in each original string, the difference between the maximum and minimum of the starting position of the suffix is not less than the current answer (judging whether you can do without overlapping, if there are no overlapping requirements in the title, then do not have to do this). The time complexity of this procedure is O (NLOGN).

(3) The oldest string appearing or reversed in each string. Given n strings, the oldest string that appears in each of these strings appears or is reversed.

The difference between "parsing" is to determine whether it appears in the inverted string. In fact, this does not increase the difficulty of the topic. You just need to write each string in turn, separating the characters from each other and not appearing in the string, and then connecting the n strings together with a separate character that doesn't appear in the string, and the suffix array. Then divide the answers, and then group the suffixes. When judging, see if there is a set of suffixes that appear in each original string or in the inverted string. The time complexity of this procedure is O (NLOGN).

5. Summary

The suffix array can actually be thought of as all leaf nodes of the suffix tree are formed in an array from left to right, so the use of the suffix array cannot exceed the range of the suffix tree. It can even be said that if you do not cooperate with LCP, the application range of the suffix array is very narrow. But the array of suffixes that the LCP function fits together is powerful enough to accomplish the tasks that most suffix trees can accomplish, because the LCP function actually gives the most recent public ancestor of any two leaf nodes, and this is something that you can study on your own.

6. Reference materials

(1) Xu Zhilei, IOI2004 national team paper "suffix array"

(2) Ro, IOI2004 national team paper "suffix array-powerful tool for processing strings"

----------------------------------------------------------------------------------------------
More about data structures and algorithms , see: Data Structure and algorithm rollup
-------------------------------------------------------------------------------------------- --

Original articles, reproduced please specify: Reprinted from Dong's Blog

This article link address: http://dongxicheng.org/structure/suffix-array/

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.