"Goto" suffix array

Source: Internet
Author: User

Transferred from: http://www.cppblog.com/superKiki/archive/2010/05/15/115421.html

The implementation of the suffix array

This section mainly introduces two implementations of the suffix array: the multiplication algorithm (doubling algorithm) and the DC3 algorithm (difference Cover), and compares the two algorithms. Perhaps some readers will think that these two algorithms are difficult to understand, even if it is difficult to understand the implementation of the program. In this section, we give a concise and efficient code based on the introduction of the two algorithms. Where the multiplication algorithm has only 25 rows, the DC3 algorithm has only 40 rows.

1.1. Basic definition

  Substring : string s substring r[i. J],i≤j, which represents the string from I to J in the R string, which is the sequence of r[i],r[i+1],...,r[j] formed.

  suffix: suffix refers to a special substring that ends at the end of the entire string from the beginning of a position I. The suffix of the string R starting from the I character is expressed as suffix (i), which is suffix (i) =r[i. Len (R)].

  size comparison: about the size of the string comparison, refers to the commonly referred to as "dictionary order" comparison, that is, for two strings u, V, I starting from 1 to compare u[i] and v[i], if u[i]=v[i] then I plus 1, otherwise if u[i]<v[i] U<v,u[i]>v[i] is considered to be u>v (that is, v<u), the comparison is concluded. If the I>len (U) or I>len (v) is still less than the result, if Len (u) <len (v) is considered u<v, if Len (u) =len (v) is considered u=v, if Len (u) >len (v) u>v.

From the definition of the size comparison of a string, the result of the comparison of the two starting positions of s with different suffixes U and V cannot be equal, because the necessary condition of u=v len (u) =len (v) cannot be satisfied here.

  suffix array: the suffix array sa is a one-dimensional array that holds 1. An arrangement of N Sa[1],sa[2],......,sa[n], and guaranteed Suffix (Sa[i]) <suffix (sa[i+1]), 1≤i<n. That is, the n suffixes of s are sorted from small to large, and the beginning position of the ordered suffix is placed in the SA sequentially.

  Number of names group: rank[i] Suffix (i) is the "rank" from small to large in all suffixes.

Simply put, the suffix array is "who is the first?" "The rank array is" What's your rank? ”。 It is easy to see that the suffix array and the rank array are mutually inverse.

The length of the set string is N. To make it easier to compare sizes, you can add a character after a string that doesn't appear in the preceding characters and is smaller than the previous characters. After finding the name group, you can compare the size of any two suffixes with only O (1) time. After finding one of the suffix arrays or the rank array, you can use O (n) time to find the other one. Any two suffixes if you compare the size directly, you need to compare the characters n times, that is, at the latest in the comparison of the nth character will be able to separate the "winner".

1.2. Multiplication algorithm

The main idea of the multiplication algorithm is to sort the substring with a length of 2k each character starting with a multiplication method, and to find the rank value. K starting from 0, each plus 1, when 2k is greater than N, each character starts with a length of 2k substring equivalent to all suffixes. And these substrings must have been compared to the size, that is, the rank value does not have the same value, then the rank value is the final result. Each time the sort takes advantage of the rank value of the last string of length 2k-1, then a string of length 2k can be represented as a keyword using the rank of the two-2k-1 string, and then the cardinality is sorted, resulting in the rank value of the string with a length of 2k. Take the string "Aabaaaab" as an example, as shown in the whole procedure 2. where x and Y are two keywords representing a string of length 2k.

1.3. DC3 algorithm

The DC3 algorithm is divided into 3 steps:

  (1), the suffix is divided into two parts, and then the first part of the suffix sort.

The suffix is divided into two parts, the first part is the suffix K (k modulo 3 is not equal to 0), the second part is the suffix K (k modulo 3 equals 0). First of all the starting position modulo 3 is not equal to 0 of the suffix to order, that is, suffix (1), suffix (2), suffix (4), suffix (5), suffix (7) ... To sort. The practice is to connect suffix (1) and suffix (2), if the length of the two suffixes is not a multiple of 3, then the end of each add 0 so that the length becomes a multiple of 3. Then every 3 characters are set, the cardinality is sorted, and each group of characters is "merged" into a new character. It then uses a recursive method to find the suffix array for this new string. As shown in 3. After the SA for the new string is obtained, it is possible to calculate the SA with the suffix of the original string all starting position modulo 3 not equal to 0. It is important to note that the original string must end with a minimum and no preceding characters to ensure that the result is correct (ask the reader to think about why).

(2), using the results of (1), the second part of the suffix sort.

The remaining suffix is the suffix of the starting position modulo 3 equals 0, and these suffixes can be regarded as a character plus a suffix of rank in (1), so that the remainder of the suffix can be calculated once the cardinality is sorted.

  (3), the results of (1) and (2) are merged, that is, to complete the sorting of all suffixes.

This merge operation is the same as the merge operation in the merge sort. You need to compare the size of two suffixes at a time. In two cases, the first case is the comparison of suffix (3*i) and suffix (3*j+1), which can be expressed as suffix (3*i) and suffix (3*j+1):

Suffix (3*i) = r[3*i] + suffix (3*i+1)
Suffix (3*j+1) = r[3*j+1] + suffix (3*j+2)

Comparisons of suffix (3*i+1) and suffix (3*j+2) can be obtained quickly using the results of (2). The second case is the comparison of suffix (3*i) and suffix (3*j+2), which can be expressed as suffix (3*i) and suffix (3*j+2):

Suffix (3*i) = R[3*i] + r[3*i+1] + suffix (3*i+2)
Suffix (3*j+2) = r[3*j+2] + r[3*j+3] + suffix (* * (j+1) +1)

By the same token, comparisons of suffix (3*i+2) and suffix (j+1) +1 can be obtained quickly using the results of (2). So each time the comparison can be done efficiently, which is also before each 3 words to meet and, instead of every 2 words to meet and the reason.

"Goto" suffix array

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.