Use the suffix array to obtain the longest common substring and suffixlongest

Last Update:2014-09-03 Source: Internet

Author: User

Tags bcbc

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Use the suffix array to obtain the longest common substring and suffixlongest

　　Summary: This article discusses the time complexity of Related Algorithms for longest common substrings. Then, based on the suffix array, a time complexity o (n ^ 2 * logn) is proposed ), the space complexity is o (n. Although this algorithm is less complex than dynamic planning and suffix tree algorithm, it has the advantage of simple coding and easy-to-understand code, and is suitable for fast implementation.

First of all, LCS usually refers to the Longest Common Subsequence. For more information about the source, see p223, version 3rd, in introduction to algorithms ), instead of the longest public substring ).

The longest common substring is the longest substring in the text string and pattern string. For example, the text string text = "abcbcedf", pattern = "ebcbcdf ", the longest common substring is "bcbc" and the length is 4.

There are many solutions for longest common substrings, including brute force search, dynamic programming, suffix array, and suffix tree. This article focuses on the suffix array method. Other methods can be Baidu.

　　Brute Force Search

 1 int enum_longestCommonSubstring(char *text,char *pattern) 2  { 3     if(!text || !pattern)  return 0;     //nullptr 4     int tlen=strlen(text),plen=strlen(pattern); 5     if(0==tlen || 0==plen) return 0; //empty string 6     int maxLEN=0,i=0,j=0,ofs=0; 7     for(i=0;i<tlen && (tlen-i>=maxLEN);++i) 8         for(j=0;j<plen && (plen-j>=maxLEN); ++j) 9             if( *(text+i)==*(pattern+j) )10             {    11                 ofs=1;12                 while((i+ofs)<tlen&&(j+ofs)<plen&&*(text+ofs)==*(pattern+ofs))13                     {    ++ofs;   }14                 if(ofs>maxLEN)　　maxLEN=ofs;  //update15             }16     return maxLEN;17 }

Note that the length of the text string is m and the length of the mode string is n. the time complexity of the brute force search method is o (m * n * Min (m, n) and the space complexity is o (1 ). If the KMP algorithm is used for substring matching, the algorithm efficiency can be improved.

　　Dynamic Planning

　　The time complexity for solving the longest common substring in dynamic programming is o (m * n). The optimized dynamic programming algorithm can achieve the space complexity of o (m, n ).

See http://www.cnblogs.com/ider/p/longest-common-substring-problem-optimization.html

　　Suffix Array

　Use the sorted suffix array (suffix array) to solve the longest common substring:

1. concatenate a text string and a pattern string to obtain a new string X;

2. Store all Suffix Arrays of X into sa. (The text string length is m and the pattern string length is n. Step 2 time complexity o (m + n)

3. Sort the sa;

4. Calculate the longest common prefix length of adjacent substrings in sa (time complexity o (m + n) * Min (m, n )))

Note: To avoid obtaining the longest duplicate substring of a single string, the two substrings involved in the comparison in Step 4 should be a substring of a text string and the other be a substring of a pattern string. Therefore, record bits should be added to Step 1 and Step 2 for processing.

Suffix array-powerful tool for processing strings-Luo Sui describes how to sort Suffix Arrays by base sorting. The sorting time complexity (m + n) * log (m + n ). Therefore, the time complexity of the algorithm obtained by using the suffix array + base sorting is o (m + n) * Min (m, n) (Step 4 determines the maximum time complexity ). However, this method is complicated and difficult to grasp. Here, I propose an algorithm for sorting the sort of the suffix array + C standard library. The sorting time complexity is o (Min (m, n) * (m + n) * log (m + n). Therefore, the overall time complexity of the algorithm is o (Min (m, n) * (m + n) * log (m + n) (Step 3 determines the maximum time complexity). In addition, the space complexity of this algorithm is o (m + n ). The time complexity of the suffix array + fast sort algorithm is lower than that of the suffix array + base sort algorithm. However, the advantage is that the standard library sort + strcmp is used for sorting, and the code is simple, algorithms are easier to understand. The Code is as follows:

 1 #include<stdio.h> 2 #include<iostream> 3 #include<string.h> 4 #include<algorithm> 5 using namesapce std; 6 int suffixArrayQsort_longestCommonSubstring(char *text,char *pattern) 7 { 8     if(!text || !pattern)  return 0;     //nullptr 9     int tlen=strlen(text),plen=strlen(pattern),i,j;10     if(0==tlen || 0==plen) return 0; //empty string11 12     enum ATTRIB{TEXT,PATTERN};13     struct absInfo14     {15         char *head;16         ATTRIB attr;  //tag17         int len;18         absInfo():head(NULL),attr(TEXT),len(0){}19         absInfo(char *phead,ATTRIB attrib,int length):head(phead),attr(attrib),len(length){}20         bool operator < (const absInfo &b)21         {22             return  strcmp(head,b.head)<0;23         }24         static void display(const absInfo &a)25         {26             printf("size:%d type:%-7s    ",a.len, (a.attr==TEXT?"TEXT":"PATTERN") );27             printf("%s\n",a.head);28         }29     }*sa;30 31     //step 2:build the suffix array32     sa=new absInfo[tlen+plen];33     for(i=0;i<tlen;++i)34     {35         sa[i].head=text+i;36         sa[i].attr=TEXT;37         sa[i].len=tlen-i;38     }39     for(j=0;j<plen;++j)40     {41         sa[j+tlen].head=pattern+j;42         sa[j+tlen].attr=PATTERN;43         sa[j+tlen].len=plen-j;44     }45 46     //step 3:use sort() to sort the sa47     puts("before sort, the sa is:"); for_each(sa,sa+tlen+plen,absInfo::display);48     sort(sa,sa+tlen+plen);49     puts("after sort, the sa is:"); for_each(sa,sa+tlen+plen,absInfo::display);50 51     //step 4:compare52     int maxLEN=0,rec=0;53     for(i=0;i<tlen+plen-1;i++)54     {55         if(sa[i].attr==sa[i+1].attr) continue;56         if(sa[i].len<=maxLEN || sa[i+1].len<=maxLEN) continue;57         rec=0;58         while(rec<sa[i].len && rec<sa[i+1].len && *(sa[i].head+rec)==*(sa[i+1].head+rec) )59           ++rec;60         if(rec>maxLEN)  maxLEN=rec; //update61     }62     //release memory resource and return63     delete [] sa; sa=NULL;64     return maxLEN;65 }

Note: 1. The len field in the absInfo structure is not required. You only need to set this field to perform a search pruning operation at line 56 of the Code.

2. slightly modify the code to give the public substring value in the algorithm (for example, "bcbc "), the len field and maxLEN value of absInfo can also be used to calculate the position of the public substring in the text string and the mode string respectively in the time complexity of o (1 ).

Running result:

When the text string text = "abcbcedf" and pattern = "ebcbcdf", the code runs as shown in:

From the code, we can see that "suffix array + qsort sorting" achieves the longest common substring with simple encoding, and the space complexity is o (m + n)

　　Suffix tree

You can search for the suffix tree and the generalized suffix tree algorithm by yourself.

Why do the longest common substrings of multiple strings in the suffix array need to be connected with special characters?

If this parameter is left blank, you cannot figure out which part belongs to the first string and which part belongs to the 2nd string...

What is suffix array string matching?

Suffix Array
In string processing, the suffix tree and suffix array are both powerful tools. The suffix tree is widely known, and the suffix array is rarely seen in China. In fact, the suffix array is a very sophisticated alternative to the suffix tree. It is easier to program and implement than the suffix tree. It can implement many functions of the suffix tree, and the time complexity is not inferior, it is much smaller than the space occupied by the suffix tree. It can be said that in the informatics competition, Suffix Arrays are more practical than suffix trees. Therefore, in this article, I want to introduce the basic concepts and construction methods of the suffix array, as well as the construction method of the longest common prefix array combined with the suffix array. Finally, I will talk about the application of the suffix array in combination with some examples.
Basic Concepts
First, define some necessary definitions:
Character Set a character set Σ is a set that establishes a fully ordered relationship. That is to say, any two different elements α and β in Σ can be compared in size, either α <β, either β <α (or α> β ). The elements in the character set Σ are called characters.
A string S is an array consisting of n characters in sequence. n is the length of S, expressed as len (S ). The I character of S is represented as S.
Substring of string S [I .. j], I ≤ j, indicates the S string from I to j, that is, sequential arrangement of S, S [I + 1],..., string formed by S [j.
A suffix refers to a special substring from a position I to the end of the entire string. The Suffix starting with I of string S is Suffix (S, I), that is, Suffix (S, I) = S [I.. len (S)].
The comparison of the string Size usually refers to the comparison of the "dictionary order", that is, for the two strings u, v, I start from 1 to compare the u and v sequentially, if the values are equal, 1 is added to I. Otherwise, u <v is considered u <v, u> v (that is, v <u). The comparison ends. If I> len (u) or I> len (v) still do not compare the results, then if len (u) <len (v) is considered u <v, if len (u) = len (v), it is considered u = v. If len (u)> len (v), u> v.
According to the definition of string size comparison, The comparison results of the two suffixes u and v with different starting positions of S cannot be equal, because the necessary condition len (u) of u = v) = len (v) cannot be satisfied here.
Next we agree on a character set Σ and a string S, set len (S) = n, and S [n] = '$', that is, S ends with a special character '$, and '$' is less than any character in Σ. Except for S [n], all other characters in S belong to Σ. For the specified string S, the Suffix starting with position I is directly written as Suffix (I), saving the parameter S.
Suffix array suffix array SA is a one-dimensional array, which stores 1 .. A certain arrangement of n SA [1], SA [2],... SA [n] and Suffix (SA) <Suffix (SA [I + 1]), 1 ≤ I <n. That is to say, after sorting the n suffixes of S from small to large, the starting positions of the suffixes sorted in order are sequentially placed into SA.
Rank = SA-1, that is to say, if SA = j, Rank [j] = I, it is not difficult to see that Rank stores Suffix (I) the "ranking" in the ascending order of all suffixes ".
Constructor
How to construct a suffix array? The most direct and simple method is to regard the suffix of S as some common strings and sort them from small to large according to the general string sorting method.
It is not hard to see that this method is very clumsy, because it does not take advantage of the organic connection between various suffixes, so it is not very efficient. Even if the Multi-key Quick Sort is used in string sorting, the time complexity in the worst case is still O (n2), which cannot meet our needs.
The following describes the Multiplication Algorithm (Doubling Algorithm ).

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More