Suffix array, prefix Array

Source: Internet
Author: User

Suffix array, prefix Array

Suffix Array

A suffix array is an array of all suffixes of a text string from small to large. For details, see Liu rujia's algorithm competition training guide.

The AC automatic machine can handle text matching of multiple templates, while the suffix array can also handle text matching of multiple templates. So what are their differences?

The AC automatic machine needs to know all templates in advanceAnd then (Input online) Text strings are matched with multiple templates. That is to say, the template must be fully known in advance and the text to be matched can be dynamically input.

The suffix array needs to know the entire text string in advanceTemplates Can be input dynamically one by one. In practice, you often cannot know the template to be queried in advance (such as search engine ). Suppose you want to find a phrase (Template), You can pre-process the text, calculate its suffix array, and then use the phrase you entered (Template), Perform a binary search for the suffix array of the text (because all suffixes have been sorted in Lexicographic Order), and finally use O (mlogn) (n is the text length, m is the template length, we will introduce the time complexity of O (m + logn) algorithms later. You can know the time complexity of this phrase (Template. (If KMP is used to find the matching point at this time, the complexity is O (n + m). The price is too high when the length of the text string n is much greater than the length of the template string m.)

The code for the suffix array is as follows:

Suffix array (comment)

# Include <cstdio> # include <cstring> # include <algorithm> using namespace std; const int maxn = 20000 + 1000; struct SuffixArray {// string formed after saving the original string + '\ 0' // that is, the original string represents the range [0, N-2] in s, // then s [n-1] is actually a manually added '\ 0' character char s [maxn]; // rank (suffix) array, sa [I] = j indicates that the suffix of the Lexicographic Order I is the suffix j // where I ranges from 0 to n-1, and j ranges from 0 to n-1 int sa [maxn]; // ranking array. rank [I] = j indicates that the lexicographic ranking of suffix I is j int rank [maxn]; int height [maxn]; // The secondary array is used for the x and y arrays int t1 [maxn], t2 [maxn]; // c [I] = j indicates that the keyword <= I has j I Nt c [maxn]; // The length after the original string + '\ 0' // Since the end 0 is added, n is generally greater than or equal to int n of 2; // n> = 2, cannot be equal to 1; otherwise, build_height () the function may issue a BUG // m is greater than the int value of any character in the s [] array void build_sa (int m) {int I, * x = t1, * y = t2; // pre-process each suffix with a prefix of 1, returns the array x and the array sa. // at this time, x [I] = j indicates the absolute value of the I character (which can be considered as the ranking array) // but it is possible that x [1] = 2 and x [3] = 2, indicating that the 1 character is exactly the same as the 3 character. // The sa [I] = j calculated at this time indicates the ranking array of the string with the current length of 1, // The rank array values are not the same. // even if x [1] = x [3] = 2, but sa [1] = 1, and sa [2] = 3. // That is, even if the characters 1 and 3 are exactly the same, // However, in the ranking, the 1st characters are the 1 character, and the 2nd characters are the 3 characters for (I = 0; I <m; I ++) c [I] = 0; for (I = 0; I <n; I ++) c [x [I] = s [I] ++; // c [I] indicates a total of c [I] for (I = 1; I <m; I ++) keywords for <= I) c [I] + = c [I-1]; // calculate the rank array of the current length (1) for (I = n-1; I> = 0; I --) sa [-- c [x [I] = I; // before each round of loop starts, we obtain x [] and sa [] through the previous calculation: // The length of each suffix is a ranking array of k prefixes (that is, the first [0, k-1] characters of each suffix) x [], // We also know the ranking array sa [] with a k prefix for each suffix, // then we can obtain the [k, 2 * K-1] character ranking array y [], // then through k character x [] with k character y [], // we can obtain each Sa [] ranking array with a 2 k prefix string suffix // then rank the array with this sa [] and x array with k characters, we can obtain the x [] array of 2 k characters // The x [] ranking array of each round above may have duplicate values, however, the sa [] value will not be repeated. // For example, if x [1] = 2, x [4] = 2, // indicates [1, the k + 1] string is exactly the same as the [4, k + 4] character string, and the ranking is 2 (the highest ranking is 0) // when the x [] array obtained in which round is exactly composed of n values (that is, all values are not repeated) // It indicates that all suffixes have been sorted for (int k = 1; k <= n; k <= 1) {// calculate the [k, 2 * K-1] character ranking array y // that is, y is the second keyword of each suffix long as 2 k prefix int p = 0; // y [p] = The second keyword of the I table is the suffix I for the p name. // because the current processing is the [k, 2 * K-1] character // while the suffix n-k to the suffix n-1 does not have the k character (Think About It) // so Their second keyword names naturally give priority to (I = n-k; I <n; I ++) y [p ++] = I; // except those suffixes that do not have the second keyword. // The 1st keyword ranking for the x + k suffix-k equals the 2nd keyword ranking for the x suffix (I = 0; I <n; I ++) if (sa [I]> = k) y [p ++] = sa [I]-k; // y [], (x [] the previous round of array has been calculated) // The following uses the 1st keyword x [] ranking array and the 2nd keyword y [] ranking array // calculates the sa [] array with a 2 k prefix for each suffix after synthesis (I = 0; I <m; I ++) c [I] = 0; for (I = 0; I <n; I ++) c [x [y [I] ++; for (I = 1; I <m; I ++) c [I] ++ = c [I-1]; for (I = n-1; I> = 0; I --) sa [-- c [x [y [I] = y [I]; // exchange x and y, so that y represents the ranking array // after calculation, the x [] array with a 2 k prefix for each suffix Swap (x, y); // at this time, p is used to record the number of different values in the x [] array p = 1; x [sa [0] = 0; for (I = 1; I <n; I ++) x [sa [I] = y [sa [I] = y [sa [I-1] & y [sa [I] + k] = y [sa [I-1] + k]? P-1: p ++; // The sa [I] + k of y [sa [I] + k <= n-1, as long as there are two different suffixes that must be split out of size // so in their y [sa [I] = y [sa [I-1], that is, if the first keyword of the two suffixes is the same as that of the long k, // they must have the second key to compare if (p> = n) break; m = p ;}} // For details about this function, see Liu rujia <training guide> P222 // height [I] indicates the maximum public prefix length of the sa [I-1] suffix and sa [I] suffix/ /that indicates the maximum public prefix LCP length of the suffix of the ranking I-1 and ranking I // so the height array only has [1, n-1] is a valid subscript void build_height () // n cannot be equal to 1, otherwise the BUG {int I, j, k = 0; for (I = 0; I <n; I ++) rank [sa [I] = I; for (I = 0; I <n; I ++) {if (k) k --; j = sa [rank [I]-1]; while (s [I + k] = s [j + k]) k ++; height [rank [I] = k ;}} sa;

Suffix array RMQ version:

# Include <cstdio> # include <cstring> # include <algorithm> using namespace std; const int maxn = 1000000 + 100; struct SuffixArray {char s [maxn]; int sa [maxn], rank [maxn], height [maxn]; int t1 [maxn], t2 [maxn], c [maxn], n; int dmin [maxn] [20]; void build_sa (int m) {int I, * x = t1, * y = t2; for (I = 0; I <m; I ++) c [I] = 0; for (I = 0; I <n; I ++) c [x [I] = s [I] ++; for (I = 1; I <m; I ++) c [I] + = c [I-1]; for (I = n-1; I> = 0; I --) sa [-- c [x [I] = I; for (int k = 1; k <= n; k <= 1) {int p = 0; for (I = n-k; I <n; I ++) y [p ++] = I; for (I = 0; I <n; I ++) if (sa [I]> = k) y [p ++] = sa [I]-k; for (I = 0; I <m; I ++) c [I] = 0; for (I = 0; I <n; I ++) c [x [y [I] ++; for (I = 1; I <m; I ++) c [I] + = c [I-1]; for (I = n-1; I> = 0; I --) sa [-- c [x [y [I] = y [I]; swap (x, y); p = 1, x [sa [0] = 0; for (I = 1; I <n; I ++) x [sa [I] = y [sa [I] = y [sa [I-1] & y [sa [I] + k] = y [sa [I-1] + k]? P-1: p ++; if (p> = n) break; m = p ;}} void build_height () // n cannot be equal to 1; otherwise, a BUG {int I, j, k = 0; for (I = 0; I <n; I ++) rank [sa [I] = I; for (I = 0; I <n; I ++) {if (k) k --; j = sa [rank [I]-1]; while (s [I + k] = s [j + k]) k ++; height [rank [I] = k ;}} void initMin () {for (int I = 1; I <= n; I ++) dmin [I] [0] = height [I]; for (int j = 1; (1 <j) <= n; j ++) for (int I = 1; I + (1 <j)-1 <= n; I ++) dmin [I] [j] = min (dmin [I] [J-1], dmin [I + (1 <(J-1)] [J-1]);} int RMQ (int L, int R) // obtain the minimum value of the range {int k = 0; while (1 <(k + 1) <= R-L + 1) k ++; return min (dmin [L] [k], dmin [R-(1 <k) + 1] [k]);} int LCP (int I, int j) // obtain the longest common LCP prefix of the suffixes I and j {int L = rank [I], R = rank [j]; if (L> R) swap (L, R); L ++; // note that return RMQ (L, R) ;}} sa;

The following describes how to use the O (1) time complexity LCP (I, j) Operation of the suffix array.Locate an algorithm with a long m template string in the O (m + logn) time complexity.. The essence of this algorithm is still the use of the binary method. Check the relative size of the mid suffix and template T, and then determine whether the range of the next query is [L, mid-1] or [mid + 1, r]. However, we do not need to compare the size of mid and T from the beginning. We use ans to save the ranking of the suffix that best matches T, use max_match to save the maximum public prefix length of ans and T. Then we can use LCP (ans, mid) to greatly reduce the number of times that the mid and template strings T are compared each time. For details, refer to the following :(Introduced by Xu zhilei's paper "suffix array")

Multi-mode string pattern matching

Given a fixed string to be matched, the length is n, and then input a pattern string P each time, the length is m, requires that a matching of P in S be returned or a matching failure is returned. The so-called match means that a position I satisfies 1 ≤ I ≤ n-m + 1 so that S [I .. (I + s-1)] = P, that is, the long m prefix of Suffix (I) = P.

We know that if there is only one mode string, the best algorithm is the KMP algorithm. The time complexity is O (n + m), but if there are multiple mode strings, we need to consider making appropriate preprocessing so that it takes less time to match each pattern string. The simplest preprocessing method is to create a suffix array of S (Add '$' After S '), then, each search is converted into a suffix that finds the longest public prefix with P in SA using the binary search method, and determines whether the longest public prefix is equal to m.

In this way, the complexity of comparing P and a suffix is O (m), because m characters may be compared in the worst case. For binary search, the number of comparisons to be called is O (logn). Therefore, the total complexity is O (mlogn). Therefore, the complexity of each matching changes from O (n + m) to O (mlogn ), it can be said that it has improved a lot.

However, this still cannot satisfy us. As mentioned above, LCP can increase the power of Suffix Arrays,

Let's try to solve this problem.

We analyze the original binary search algorithm, which consists of the following steps:

In step 1, left = 1, right = n, and max_match = 0.

Step 2: Set mid = (left + right)/2 (here "/" indicates the entire division method ).

Step 3 sequentially compare Suffix (SA [mid]) and P, find the longest public

Prefix r and determine their size relationship. If r> max_match, max_match = r, ans = mid.

Step 4 If Suffix (SA [mid]) <P, set left = mid + 1. If Suffix (SA [mid])> P, set right = mid-1, if Suffix (SA [mid]) = P, go to Step 6.

Step 5: if left <right, go to Step 2; otherwise, go to Step 6.

Step 6 Output ans if max_match = m; otherwise, "no matching" is output ".

Attention is quickly concentrated on Step 3. If we can avoid comparing the correspondence between Suffix (SA [mid]) and P from the beginning each time, the complexity may be further reduced.

Similar to the preceding height array, we consider using the longest common prefix obtained previously as the "basis" for comparison to avoid redundant character comparison.

Before comparing Suffix (SA [mid]) and P, we calculate LCP (mid, ans) with constant time, and then compare LCP (mid, ans) and max_match:

Scenario 1: LCP (mid, ans) <max_match indicates that the longest common prefix of Suffix (SA [mid]) and P is LCP (mid, ans ), that is, you can directly determine the r = LCP (mid, ans) in Step 3, so you can directly compare the r + 1 characters of the two (the results will not be equal) determine the Suffix (SA [mid]) and P sizes. In this case, the number of character comparisons is 1.

Case 2: If LCP (mid, ans) is ≥max_match, it means that the first max_match characters of Suffix (SA [mid]) and Suffix (SA [ans]) must be the same, therefore, Suffix (SA [mid]) and P are the same as the first max_match character. Therefore, the corresponding characters of the two can start with max_match + 1, the final r must be greater than or equal to the original max_match, and the number of character comparisons is rmax_match + 1. It is not difficult to see that after Step 3 is executed, max_match will be equal to r.

Set the number of max_match values increased after Step 3 is executed to limit max. In Case 1, between max = 0, the number of character comparisons is 1 = between max + 1; in Case 2, between max = r-max_match, the number of character comparisons is r-max_match + 1, it is also limit max + 1. In summary, the number of character comparisons for each Step 3 is limit max + 1.

The total number of character comparisons is the total number of merge max values plus the number of Step 3 executions. The result of all merge max values is obviously the final max_match value, which does not exceed len (P) = m, and Step 3 executes O (logn ), therefore, the total number of character comparisons is O (m + logn ). The complexity of the entire algorithm is obviously at the same level as the number of character comparisons, which is O (m + logn ).

At this point, the problem is solved successfully. The O (nlogn) time is used for preprocessing (constructing Suffix Arrays, noun arrays, calculating height arrays, and RMQ preprocessing ), then you can
For a pattern string of m in time


Strongly recommendedLuo Sui's suffix array-a powerful tool for processing stringsTo understand the application scenarios of Suffix Arrays.

Suffix Array application

POJ 1743 Musical Theme (suffix array): find two identical substrings that do not overlap.Solution report!

POJ 3261 Milk Patterns (suffix array): Find substrings that can overlap at least K times.Solution report!

SPOJ 694 Distinct Substrings (suffix array): number of different Substrings of a string. the string length is <= 1000.Solution report!

SPOJ 705New DistinctSubstrings (suffix array): returns the number of different substrings of a string. the string length is <= 50000.Solution report!

URAL1297 Palindrome (longest echo string: suffix array): searches for the longest response string.Solution report!

POJ 2406 Power Strings (suffix array): The suffix array method times out and uses KMP directly for faster processing.Solution report!

POJ 2774 Long Message (suffix array: Public substrings): calculates the length of the longest public continuous substrings of two strings.Solution report!

URAL1517. Freedom of choice (suffix array: longest public continuous substring): calculates the length of the longest public continuous substring of two strings.Solution report!

POJ 3294 Life Forms (suffix array): Evaluate the longest common continuous string among more than half of n strings. If there are multiple solutions, output them in lexicographically.Solution report!

SPOJ 220. Relevant Phrases of Annihilation (suffix array): returns the maximum length of a substring that appears at least twice and does not overlap in each substring.Solution report!

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.