In
In string processing, the suffix tree and the suffix array are both powerful tools. The suffix tree is well known and rarely seen in China. In fact, the suffix array is a very
A sophisticated alternative, which is easier to program than the suffix tree, can implement many features of the suffix tree, and the time complexity is not inferior, and it is much smaller than the space occupied by the suffix tree. It can be said that in the informatics Competition
Suffix Arrays are more practical than suffix trees. Therefore, in this article, I want to introduce the basic concepts and construction methods of the suffix array, and the construction method of the longest common prefix array with the suffix array.
Let's talk about the application of Suffix Arrays.
Basic Concepts
First, define some necessary definitions:
Character Set a character set Σ is a set that establishes a fully ordered relationship. That is to say, any two different elements α and β in Σ can be compared in size, either α <β, either β <α (or α> β ). The elements in the character set Σ are called characters.
A string s is an array consisting of n characters in sequence. n is the length of S, expressed as Len (s ). The I character of S is represented as s [I].
Substring of string s [I .. j], I ≤ j, indicates the segment from I to J in the S string, that is, sequential arrangement of S [I], s [I + 1],..., string formed by S [J.
A suffix refers to a special substring from a position I to the end of the entire string. The suffix starting with I of string S is suffix (S, I), that is, suffix (S, I) = s [I.. Len (s)].
Off
Compared with the string size, it usually refers to the "dictionary order" comparison, that is, for two strings U, V, make I compare U [I] and V [I] sequentially from 1. If they are equal, add 1 to I. Otherwise
If u [I] <V [I], u <v, U [I]> V [I] is considered u> V (that is, v <u ), comparison ends. If I> Len
(U) or I> Len (v) still do not compare the results, if Len (u) <Len (v) is considered U <v, if Len (u) = Len (v) is considered u =
V. If Len (u)> Len (V), u> v.
According to the definition of string size comparison, The comparison results of the two suffixes U and V with different starting positions of s cannot be equal, because the necessary condition Len (u) of U = V) = Len (v) cannot be satisfied here.
Lower
We define a character set Σ and a string S, set Len (S) = n, and s [N] = '$', that is, s ends with a special character '$, and '$' is less than any word in Σ.
. Except for S [N], all other characters in S belong to Σ. For the specified string S, the suffix starting with position I is directly written as suffix (I), saving the parameter S.
After
The suffix array SA is a one-dimensional array. It stores a certain arrangement of 1. n sa [1], sa [2],... sa [N] and ensures
Suffix (SA [I]) <suffix (SA [I + 1]), 1 ≤ I <n. That is, sort the N suffixes of S from small to large, and then sort the beginning of the Suffix in order.
Place them in SA sequentially.
Rank = SA-1, that is, if SA [I] = J, rank [J] = I, it is not difficult to see that rank [I] Stores suffix (I) the "ranking" in the ascending order of all suffixes ".
Constructor
How to construct a suffix array? The most direct and simple method is to regard the suffix of S as some common strings and sort them from small to large according to the general string sorting method.
It is not hard to see that this method is very clumsy, because it does not take advantage of the organic connection between various suffixes, so it is not very efficient. Even if the multi-key quick sort is used in string sorting, the time complexity in the worst case is still O (n2), which cannot meet our needs.
The following describes multiplication.Algorithm(Doubling algorithm), which fully utilizes the relationships between suffixes to successfully reduce the worst-time complexity of constructing a suffix array to O (nlogn ).
For a string U, we define the K-prefix of U.
Define the K-Prefix comparison relationship <K, = K, and ≤ k:
Set two strings U and V,
U <kV when and only when UK <VK
U = kV only when UK = VK
U ≤ kV only when UK ≤ VK
Intuitively, the meaning of these comparative symbols with a subscript K is to compare the lexicographic orders of the first k characters of the two strings, in particular, it does not matter if the length of a string is smaller than or equal to K, you only need to obtain the first string greater than or less than the second string before the comparison of k characters ends.
We can obtain the following important properties based on the nature of prefix comparison characters:
Property 1.1 for k ≥ n, suffix (I) <ksuffix (j) is equivalent to suffix (I) <suffix (j ).
Nature 1.2 suffix (I) = 2 ksuffix (j) is equivalent
Suffix (I) = ksuffix (j) and suffix (I + k) = ksuffix (J + k ).
Nature 1.3 suffix (I) <2 ksuffix (j) is equivalent
Suffix (I) <KS (j) or (suffix (I) = ksuffix (j) and suffix (I + k) <ksuffix (J + k )).
This
When I + k> N or J + k> N, suffix (I + k) or suffix (J + k) is a non-clearly defined expression, but you don't need to consider it.
This problem occurs because the length of suffix (I) or suffix (j) cannot exceed K, that is, their K-Prefix ends with '$, therefore, the results of K-Prefix comparison cannot be equal,
That is to say, the first k characters can be compared to the output size, and the subsequent expressions can be ignored. This shows the special use of S ending with '$.
Define a K-suffix Array
SAK stores a certain arrangement of 1. N. SAK [1], SAK [2],… SAK [N] Makes suffix (SAK [I])
≤ Ksuffix (SAK [I + 1]), 1 ≤ I <n. That is to say, all suffixes are sorted from small to large in the K-Prefix comparison relationship, and the starting position of the sorted suffix is placed in sequence.
In the SAK array.
Define the K-ranking array rankk. rankk [I] represents the "ranking" of suffix (I) from small to large in the K-Prefix relationship, that is, 1 plus suffix (j) the number of J of <ksuffix (I. With Sak, it is easy to obtain rankk in O (n) time.
False
If we have obtained SAK and rankk, we can easily find sa2k and rank2k, because the comparison relationship between 1.2 and 1.3, 2 k-prefix can be determined by the constant K
-Prefix comparison relationships are expressed in equivalent ways, while the rankk array actually provides a method for comparing <K and = K within a constant time, that is:
Suffix (I) <ksuffix (j) When and only when rankk [I] <rankk [J]
Suffix (I) = ksuffix (j) When and only when rankk [I] = rankk [J]
Because
Therefore, comparing the values of suffix (I) and suffix (j) in the K-Prefix comparison can be completed within a constant time, therefore, all suffixes are sorted in a regular order under the ≤ k relationship.
There is no difference between orders. In fact, each suffix (I) has a primary keyword rankk [I] and a secondary keyword rankk [I + K]. For o
(Nlogn), the complexity of constructing sa2k from SAK and rankk is O (nlogn ). The smarter method is to sort by the base, and the complexity is O (n ).
After the sa2k is obtained, the rank2k can be constructed according to sa2k in O (n) time. Therefore, the release of sa2k and rank2k from SAK and rankk can be completed in O (n) time.
Lower
There is only one problem to solve: How to Construct SA1 and rank1. This problem is very simple: the operators <1, = 1, and ≤1 are actually comparing the first character of a string.
Therefore, if you sort each suffix by its first character, you can find SA1. You may want to use a fast sort, with the complexity of O (nlogn ).
Therefore, SA1 and rank1 can be obtained in O (nlogn) time.
After finding SA1 and rank1, We can find SA2 and rank2 in O (n) Time. Similarly, we can use O (n) Time to find SA4 and rank4. In this way, we can find the following in sequence:
SA2 and rank2, SA4 and rank4, sa8 and rank8 ,...... Until Sam and rankm, where M = 2 K and M ≥ n. Depending on the nature of 1.1, Sam and SA are equivalent. This requires a total of N logo (n) processes. Therefore
The suffix array SA and rank array can be calculated within the O (nlogn) time.
Longest public prefix
Now
SA, a suffix array of string S, can be calculated in O (nlogn) time. Using SA, we can already do a lot of things, such as pattern matching in O (mlogn) time,
M and n are the lengths of the pattern string and the string to be matched, respectively. But to make full use of the suffix array power, we also need to calculate a secondary tool-the longest common prefix (longest
Common prefix ).
For two strings U, V defines the function LCP (u, v) = max {I | u = IV}, that is, compare the corresponding characters of U and V from the beginning sequentially, the longest common prefix of the two strings.
For positive integers I, j defines LCP (I, j) = LCP (suffix (SA [I]), suffix (SA [J]), where I, J is an integer ranging from 1 to n. LCP (I, j) is the length of the longest common prefix of the I and j Suffixes in the suffix array.
LCP has two obvious properties:
Nature 2.1 LCP (I, j) = LCP (J, I)
Nature 2.2 LCP (I, I) = Len (suffix (SA [I]) = N-sa [I] + 1
The use of these two properties is that when we calculate LCP (I, j), we only need to consider the I <j, Because I> J can exchange I, j, when I = J, the result N-sa [I] + 1 can be directly output.
According to the definition, it is obviously inefficient to calculate LCP (I, j) by comparing the corresponding characters in sequence. The time complexity is O (n ), therefore, appropriate preprocessing is required to reduce the complexity of each LCP calculation.
After careful analysis, we found that the LCP function has a very good nature:
Set I <j, then LCP (I, j) = min {LCP (K-1, k) | I + 1 ≤ k ≤ j} (LCP theorem)
To prove LCP theorem, first prove LCP lemma:
For any 1 ≤ I <j <k ≤ n, LCP (I, K) = min {LCP (I, j), LCP (j, k )}
Proof: If P = min {LCP (I, j), LCP (j, k)} is set, there are LCP (I, j) ≥ p, LCP (j, k) ≥p.
Set suffix (SA [I]) = u, suffix (SA [J]) = V, suffix (SA [k]) = W.
U = PV from u = LCP (I, j) V; similarly, V = pw.
So suffix (SA [I]) = psuffix (SA [k]), that is, LCP (I, K) ≥ p. (1)
Set LCP (I, K) = q> P, then
U [1] = W [1], U [2] = W [2],... U [Q] = W [Q].
Min {LCP (I, j), LCP (j, k )} = P indicates U [p + 1] =v [p + 1] Or V [p + 1] =w [q + 1],
Set U [p + 1] = x, V [p + 1] = Y, W [p + 1] = z, obviously x ≤ y ≤ z, P <q obtains p + 1 ≤ q, and X = z, that is, x = y = Z, this is in conflict with U [p + 1] ≠ V [p + 1] Or V [p + 1] ≠ W [q + 1.
Therefore, q> P is not true, that is, LCP (I, K) ≤ p. (2)
According to the test results of (1), (2) zhi LCP (I, K) = p = min {LCP (I, j), LCP (j, k)} And LCP lemma.
Therefore, LCP theorem can prove the following:
When J-I = 1 and J-I = 2, it is obviously true.
When J-I = m is set, LCP theorem is set to true. When J-I = m + 1,
Known by LCP lemma LCP (I, j) = min {LCP (I, I + 1), LCP (I + 1, J )},
Because j-(I + 1) ≤ m, LCP (I + 1, J) = min {LCP (K-1, k) | I + 2 ≤ k ≤ j }, therefore, when J-I = m + 1
LCP (I, j) = min {LCP (I, I + 1), min {LCP (K-1, K) | I + 2 ≤ k ≤ j }}= min {LCP (K-1, k} | I + 1 ≤ k ≤ j)
Based on mathematical induction, LCP theorem was established.
According to LCP theorem, it is inevitable that:
LCP corollary for I ≤ j <K, LCP (j, k) ≥ LCP (I, K ).
Define the height of a one-dimensional array, so that height [I] = LCP (I-1, I), 1 <I ≤ n, and set height [1] = 0.
By LCP
Theorem, LCP (I, j) = min {height [k] | I + 1 ≤ k ≤ j}, that is, calculate LCP (I, j) it is equivalent to asking about the lower part of the height of a one-dimensional array.
The minimum value of all elements in the range of I + 1 to J. If the height array is fixed, this is a very classic rmq (range minimum query) problem.
For rmq problems, we can use a line segment tree or a static sorting tree to pre-process the O (nlogn) time, and then spend time O (logn) for each query. A better method is the standard rmq algorithm, preprocessing can be performed within O (n) time, and each query can be completed within a constant time.
For a fixed string s, its height array is obviously fixed. As long as we can efficiently find the height array, after using the rmq Method for preprocessing, each time we calculate LCP (I, j) the time complexity is constant. So there is only one problem-how to calculate the height array as efficiently as possible.
Based on the experience of calculating the suffix array, we should not regard n suffixes as non-correlated common strings, but should try to use the relationships between them. The following shows a very useful property:
For ease of description, set H [I] = height [rank [I], that is, height [I] = H [SA [I]. The H array satisfies one property:
Property 3 for I> 1 and rank [I]> 1, there must be H [I] ≥h [I-1]-1.
To prove nature 3, we need to clarify two facts:
set I 1, the following two points are true:
fact 1 suffix (I) fact 2 must have LCP (suffix (I + 1), suffix (J + 1) = LCP (suffix (I), suffix (j)-1.
look
it's amazing, but it's natural: LCP (suffix (I), suffix (j)> 1 Description of suffix (I) and suffix (j) the first character of is the same
. If it is set to α, suffix (I) is equivalent to α and then connected to suffix (I + 1), suffix (j) it is equivalent to connecting suffix (J + 1) after α ). When suffix
(I) and suffix (j) are compared, the first character α must be equal, so suffix (I) and suffix (j) are compared ), therefore, fact
1 is true. Fact 2 can be proved similarly.
So we can prove the nature 3:
When H [I-1] ≤ 1, the conclusion is obvious because H [I] ≥0 ≥h [I-1]-1.
When H [I-1]> 1, that is, height [rank [I-1]> 1, you can see rank [I-1]> 1, because height [1] = 0.
Let J = I-1, K = sa [rank [J]-1]. Apparently there is suffix (k) <suffix (j ).
By H [I-1] = LCP (suffix (K), suffix (j)> 1 and suffix (k) <suffix (j ):
Known by fact 2 LCP (suffix (k + 1), suffix (I) = H [I-1]-1.
Rank [k + 1] <rank [I] is known by fact 1, that is, rank [k + 1] ≤ rank [I]-1.
Therefore, according to the LCP corollary
LCP (rank [I]-1, rank [I]) ≥ LCP (rank [k + 1], rank [I])
= LCP (suffix (k + 1), suffix (I ))
= H [I-1]-1
Since H [I] = height [rank [I] = LCP (rank [I]-1, rank [I]), finally, H [I] ≥h [I-1]-1 is obtained.
Based on the nature of 3, we can make I cycle from 1 to n and calculate H [I] in sequence according to the following method:
If rank [I] = 1, h [I] = 0. The number of character comparisons is 0.
If I = 1 or H [I-1] ≤ 1, suffix (I) and suffix (rank [I]-1) compare from the first character until there are different characters, and then calculate H [I]. The number of character comparisons is H [I] + 1, not greater than H [I]-H [I-1] + 2.
No
Then, describe I> 1, rank [I]> 1, h [I-1]> 1, according to nature 3, suffix (I) and suffix (rank [I]-1) at least
The first H [I-1]-1 characters are the same, so the character comparison can start from H [I-1] until a character is different, then H [I] is calculated. The number of character comparisons is H [I]-H [I-
1] + 2.
If SA [1] = P, it is not difficult to see that the total number of character comparisons does not exceed
That is to say, the complexity of the entire algorithm is O (n ).
The H array is obtained. Based on the relationship height [I] = H [SA [I], the height array can be obtained in O (n) time.
You can obtain the height array in O (n) time.
Combined with the rmq method, after the O (n) time and space are pre-processed, the LCP (I, j) can be calculated for arbitrary (I, j) Within the constant time ).
Because LCP (suffix (I), suffix (j) = LCP (rank [I], rank [J]), therefore, we can obtain the longest public prefix between any two suffixes of s within the constant time. This is one of the important reasons why the suffix array can effectively handle many string problems.
Application of suffix Array
The following uses two examples to illustrate how to use Suffix Arrays.
Example 1: multi-mode string pattern matching
Given a fixed string to be matched, the length is N, and then input a pattern string p each time, the length is m, requires that a matching of P in S be returned or a matching failure is returned. the so-called match means that a position I satisfies 1 ≤ I ≤ n-m + 1 so that s [I .. (I + m-1)] = P, that is, suffix (I) = MP.
We know that if there is only one mode string, the best algorithm is the KMP algorithm. The time complexity is O (n + M), but if there are multiple mode strings, we need to consider making appropriate preprocessing so that it takes less time to match each pattern string. the simplest preprocessing method is to create a suffix array of S (Add '$' After S '), then, each search is converted into a suffix that uses the Binary Search Method to locate the longest public prefix with P in SA and determine whether the longest public prefix is equal to M. in this way, the complexity of comparing P and a suffix is O (M), because M characters may be compared in the worst case. for binary search, the number of comparisons to be called is O (logn). Therefore, the total complexity is O (mlogn). Therefore, the complexity of each matching changes from O (N + M) to O (mlogn ), it can be said that it has improved a lot. however, this still cannot satisfy us. as mentioned above, LCP can increase the power of Suffix Arrays,
Let's try to solve this problem.
We analyze the original binary search algorithm, which consists of the following steps:
Step 1 to left = 1, Right = N, max_match = 0.
Step 2: Set mid = (left + right)/2 (here "/" indicates the entire division ).
Step 3 sequentially compare suffix (SA [Mid]) and P, find the longest public
Prefix R, and determine their size relationship. If R> max_match, max_match = r, ANS = mid.
Step 4 If suffix (SA [Mid]) P
Right = mid-1. If suffix (SA [Mid]) = P, go to Step 6.
Step 5 if left
Step 6 Output ans if max_match = m; otherwise, output "no matching ".
Attention is quickly concentrated on Step 3. If we can avoid comparing the correspondence between suffix (SA [Mid]) and P from the beginning each time, the complexity may be further reduced. similar to the preceding height array, we consider using the longest common prefix obtained previously as the "basis" for comparison to avoid redundant character comparison.
Before comparing suffix (SA [Mid]) and P, we calculate LCP (MID, ANS) with constant time, and then compare LCP (MID, ANS) and max_match: Scenario 1: LCP (MID, ANS) k + 1, t [I-R '.. i-1] and T [I + 1 .. I + R'] is not symmetric with T [I], so R can only be up to K. the increasing R process is called expansion to both sides. Once expansion is performed, the length of the odd echo substring centered on T [I] can be added 2. the maximum value that r expands to determines the length (2R + 1) of the elders in the T [I]-centered odd response substring ). set Len (t) = M. If you use the method of comparing the corresponding characters in sequence to calculate the maximum value extended to both sides, it is possible to compare a maximum of 1-1 characters. since we need to enumerate each location as the center to expand to both sides, the overall complexity can reach O (m2) in the worst case, which is not ideal.
The core part of the optimization algorithm below
-- Calculates the maximum value of expansion to both sides with a single position as the center.
Add a special character '#' At the end of the T string, specify that it is not equal to any character of T, and then turn the T string upside down, followed, add the special character '$' After the t' string, which requires that it is smaller than any character before. The concatenated string is called the S string. it is not hard to see that any character in the T string can find the same character in the T. if it is expressed by characters in S, s [1 .. m] is a t string, s [M + 2 .. 2 m + 1] is a t' string, then each s [I] (1 ≤ I ≤ m) about '#' symmetric character is s [2m-i + 2]. in this way, the substring s [I .. j] (1 ≤ I ≤ j ≤ m) about '#', you can also find a substring with equal reflection s [2m-j + 2 .. 2m-i + 2].
Now let's set a location of the T string s [I] as the center. If I-r and I + R are extended to both sides, then s [I-r .. i-1] And s [I + 1 .. I + R] is equivalent to the reflection, s [I] can find the symmetric character s [2m-i + 2] In t', set I '= 2m-i + 2, then s [I-r .. i-1] can also be found in T 'symmetric substring s [I' + 1 .. I '+ R],
Banana # ananab $
Tt'
II '= 2m-i + 2
So s [I + 1 .. I + R] And s [I '+ 1 .. I '+ R] simultaneously with s [I-r .. i-1] reflection is equal, that is, s [I + 1 .. I + R] = s [I '+ 1 .. I '+ R]. because s [I] = s [I '], s [I .. I + R] = s [I '.. I '+ R]. that is to say, suffix (I) = R + 1 suffix (I '). it is not hard to see that r = LCP (I, I ')-1. the above reasoning still has a problem, that is, the obtained LCP (I, I ')-1 can only be regarded as an upper bound of R, and cannot be considered as the maximum value of R, because we also need to prove that the longest public prefix of suffix (I) and suffix (I ') is given, we can find the corresponding I-centered echo string in the T string, this proof is similar to the previous reasoning, but you only need to note that the special character '#' is used here to avoid the potential LCP (I, I ') the danger of exceeding the actual r maximum. this proof is left to the reader to complete. in short, we have determined that the maximum value of extending from T [I] to both sides is equivalent to finding LCP (I, I '), this step can be completed within a constant time based on the previous suffix array and LCP related content. As long as we calculate the complexity of O (nlogn) in advance, Suffix Arrays, height arrays, and preprocessing can be performed. N = Len (S) = 2 m + 2.
Now, the length of the elders in each response string centered on T [I] can be completed within a constant time. We enumerate I from 1 to m, find all the elders in sequence, and record the maximum length, which is the length of the required longest odd response string. since the time spent on each center is a constant, the total complexity is O (m ). therefore, the complexity of the entire algorithm is O (nlogn + M) = O (2 mlog (2 m) + M) = O (mlogm), which is a very good algorithm, this is much better than the previous square-level algorithm.
comparison between Suffix Arrays and suffix trees
based on the above two examples, I believe that the reader has some knowledge about the powerful functions of Suffix Arrays. Another data structure is the suffix tree, it can also be used in these problems. What are the differences and links between the suffix array and the suffix tree? Let's compare them:
first, the suffix array is easier to understand and programming, unlike the suffix tree, which requires pointer operations, it is easier to debug. second, the space occupied by the suffix array is smaller than that occupied by the suffix tree. We did not mention the space complexity issue in the analysis just now. Here we will briefly describe it: both the suffix array SA and the noun array rank only need n integer spaces, and two one-dimensional arrays are used to assist in the calculation of sa2k by rankk, each occupies the space of N integers and performs operations in a rolling manner. The entire algorithm only needs the four one-dimensional arrays and constant auxiliary variables. Therefore, the total space occupied is 4N integers. A suffix tree usually has more than 2n nodes, and each node usually requires two integers (even if some techniques are used, at least one integer must be saved ), each node must have two pointers (assuming the son-brother representation). Therefore, the total space occupied is at least 4 N pointers and 2n integers (at least N integers ). if other methods are used to represent the tree structure, a larger space is required. the size of the suffix array is smaller than that of the suffix tree.
Finally, compare their complexity:
first, according to the total number of characters | Σ | the character set Σ is divided into three types:
if | Σ | is a constant, it is called sigma as constant alphabet.
If the size of | Σ | is a polynomial function about the length n of S, then Σ is called integer alphabet,
if | Σ | there is no size limit, Σ is called General alphabet.
apparently, constant alphbet is a type of integer alphabet, while integer alphabet is a type of general alphabet. the complexity of constructing a suffix array is irrelevant to the character set because it is a direct algorithm for general alphabet. for a common method to construct a suffix tree, if we use the son-brother method to express the tree structure, the time complexity will reach O (N * | Σ |), which is obviously inefficient for integer alphabet and general alphabet, yes | Σ | the large constant alphabet is not applicable. the solution is to use a balanced binary tree to save the pointer pointing to the son, so that the complexity changes to O (N * log | Σ | ). it can be seen that the suffix tree has a speed advantage over the suffix array in some cases, but it is not obvious. for a small string | Σ |, the speed advantage of the suffix tree over the suffix array is considerable. especially for common 0-1 strings.
A suffix array can be viewed as a suffix tree in which all leaf nodes are arranged from left to right and placed in an array. Therefore, the suffix array cannot be used beyond the suffix tree range. it can even be said that, without LCP, the application scope of the suffix array is very narrow. however, the suffix array in combination with the LCP function is very powerful and can complete the tasks that most suffix trees can accomplish, because the LCP function actually provides the nearest common ancestor of any two leaf nodes, you can study this content on your own
. the suffix tree and suffix array are both excellent data structures in string processing. It cannot be said that one is definitely better than the other. We should flexibly apply different conditions in different scenarios, select a suitable one. algorithms and data structures are dead, and those who use them are the real protagonists. They are proficient in classical algorithms and data structures and can be used properly to exert their greatest strength, this is the greatest wisdom in Informatics Research and competition, and also the charm of the Competition
.