"Classic data structure" suffix array

Source: Internet
Author: User

  Transferred from: http://www.acmerblog.com/suffix-array-6150.html

In string processing, the suffix tree and the suffix array are very powerful tools, in which the suffix tree is known more, the suffix array is very rare in the domestic data. In fact, the suffix array is a very clever alternative to the suffix tree, it is easier to program than the suffix tree implementation, can achieve many functions of the suffix tree and time complexity is not very inferior, and, it occupies a lot less space than the suffix tree.

The suffix Tree Group is a sorted array of all suffixes of a string. A suffix is a substring that ends at the end of an entire string from a position I begins. The suffix of the string R starting from the I character is expressed as Suffix (i), which is Suffix (i) =r[i. Len (R)].

Example:

1String:"Banana"The following are all the suffixes:2 3 0Banana5a4 1Anana to sort all suffixes3Ana5 2Nana---------------->1Anana6 3Ana Dictionary sequence0Banana7 4Na4na8 5A2Nana9 TenSo"Banana"The suffix array sa is: {5,3,1,0,4,2}

Rank array: Rank array Rank[i] holds the rank of the suffix starting with I, which is inverse to SA. Simply put, the suffix array is "who is ranked in the number of who", the ranking array is "you row number."

Construction Algorithm

There are two main algorithms for solving the suffix array: the multiplication algorithm and the DC3 algorithm . The multiplication algorithm of Xu Zhilei is used here, and the complexity is NLOGN.

For a detailed algorithm to solve the suffix array, see Xu Zhilei 2004 National Training Team paper.

  Only the most straightforward algorithm is given here, which is to get all the suffix substrings first and then order them again.

  

1 //a simple construction algorithm of suffix Tree Group2#include <iostream>3#include <cstring>4#include <algorithm>5 using namespacestd;6 7 //represents a suffix, index is the starting subscript position of the suffix8 structsuffix9 {Ten     intindex; One     Char*Suff; A }; -  - //dictionary order comparison suffix the intcmpstructSuffix A,structsuffix b) - { -     returnstrcmp (A.suff, B.suff) <0?1:0; - } +  - //construct a suffix array for txt + int*buildsuffixarray (Char*txt,intN) A { at     //Results -     structsuffix suffixes[n]; -  -      for(inti =0; I < n; i++) -     { -Suffixes[i].index =i; inSuffixes[i].suff = (txt+i); -     } to  +     //Sort -Sort (suffixes, suffixes+N, CMP); the  *     //who's the first one in the line ? $     int*suffixarr =New int[n];Panax Notoginseng      for(inti =0; I < n; i++) -Suffixarr[i] =Suffixes[i].index; the  +     returnSuffixarr; A } the  + //Print - voidPrintarr (intArr[],intN) $ { $      for(inti =0; I < n; i++) -cout << Arr[i] <<" "; -cout <<Endl; the } - Wuyi intMain () the { -     CharTxt[] ="Banana"; Wu     intn =strlen (TXT); -     int*suffixarr =buildsuffixarray (TXT, n); Aboutcout <<"following is suffix array for"<< txt <<Endl; $ Printarr (Suffixarr, n); -     return 0; -}

Output:

  

1  is  for Banana 2 5 3 1 0 4 2

How do I use a suffix array to match a string?

In return to that classic string matching question, how do I find pattern string patterns in text? With the suffix array, we can search by binary lookup. Here are the specific algorithms:

  

1 voidSearchChar*pat,Char*txt,int*suffarr,intN)2 {3     intm =strlen (PAT); 4 5     intL =0, R = N1; 6      while(L <=R)7     {8         //see if ' Pat ' is the prefix string for the middle suffix9         intMID = L + (r-l)/2;Ten         intres = STRNCMP (PAT, txt+Suffarr[mid], m); One  A         if(res = =0) -         { -cout <<"Pattern found at index"<<Suffarr[mid]; the             return; -         } -         if(Res <0) R = Mid-1; -         ElseL = mid +1; +     } -cout <<"Pattern not found"; + } A  at intMain () - { -     CharTxt[] ="Banana";//text -     CharPat[] ="nan";//Pattern String -  -     //construct a suffix array in     intn =strlen (TXT); -     int*suffarr =buildsuffixarray (TXT, n); to  +     //Search in TXT for whether Pat appears - Search (Pat, TXT, Suffarr, n); the     return 0; *}

Above the complexity of the search algorithm is O (MLOGN), in fact, there are more efficient basic suffix array algorithm, followed by the discussion.

application of suffix array

The longest public prefix of the height array , height[i] = suffix (sa[i-1]) and suffix (sa[i]) is defined first, which is the longest common prefix of the two suffixes that rank next to each other.

Example 1: Longest public prefix
Given a string, the longest public prefix of any two suffixes is asked.
Solution: The rank I and J (I<j) of the two suffixes is determined by rank first, and the minimum value is found between the height array i+1 and J. (Can be optimized with RMQ)

Example 2: Longest repeating substring (non-overlapping) (poj1743)
Solution: Binary length, according to the length of Len Grouping, if the maximum value of the SA in a group and the difference between the minimum value of >=len, then there is a length of Len does not overlap a repeating substring.

Example 3: Longest repeating substring (can overlap)
Solution: The maximum value in the height array. This problem is equivalent to finding the longest common prefix between two suffixes.

Example 4: The oldest string (overlapping) at least repeated K-Times (poj3261)
Solution: Binary length, according to the length of Len Group, if the number of >=k in a group, then the existence of the length of Len at least repeat K second son string.

Example 5: Longest palindrome string (ural1297)
Given a string, for one of its substrings, is coming over to write and reverse write, called palindrome substring.
Solution: Enumerate each bit to calculate the longest palindrome substring centered on this bit (note that the string length is considered odd and even). Writes the entire string reversal after the original string, separated by $. This turns the problem into the longest common prefix for a two suffix.

Example 6: Longest common substring (poj2774)
Given two strings S1 and S2, find the longest common substring of S1 and S2.
Solution: After connecting the S2 to S1, the middle is separated by $. This translates to the longest common prefix for the two suffixes, note that the maximum value in the height is not, and that the sa[i-1] and sa[i] cannot belong to either S1 or S2.

Example 7: Number of common substrings with a length not less than K (poj3415)
Given two strings S1 and S2, find out the number of common substrings of S1 and S2 that are not less than K (can be the same).
Solution: Concatenate two strings, separating the middle with $. Scan again, each encounter a S2 suffix to statistics and the suffix of the preceding S1 can produce how many lengths are not less than k of common substrings, where the S1 suffix needs to be maintained with a monotonic stack. And then do it again for S1.

Example 8: The eldest string at least in K-strings (poj3294)
Given n strings, the oldest string that appears at least k in a string of N.
Concatenate n strings, separating them by $. The binary length, according to the length Len grouping, determines whether each group suffix appears in not less than the K original string.

  Related articles:

1. http://www.geeksforgeeks.org/suffix-array-set-1-introduction/

"Classic data structure" suffix array

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.