Use the suffix array to obtain the longest common substring)

Source: Internet
Author: User
Tags bcbc

  Summary: This article discusses the time complexity of Related Algorithms for longest common substrings. Then, based on the suffix array, a time complexity O (N ^ 2 * logn) is proposed ), the space complexity is O (n. Although this algorithm is less complex than dynamic planning and suffix tree algorithm, it has the advantage of simple coding and easy-to-understand code, and is suitable for fast implementation.

 

First of all, LCs usually refers to the longest common subsequence. For more information about the source, see p223, version 3rd, in introduction to algorithms ), instead of the longest public substring ).

The longest common substring is the longest substring in the text string and pattern string. For example, the text string text = "abcbcedf", pattern = "ebcbcdf ", the longest common substring is "bcbc" and the length is 4.

There are many solutions for longest common substrings, including brute force search, dynamic programming, suffix array, and suffix tree. This article focuses on the suffix array method. Other methods can be Baidu.

  Brute Force Search

  

 1 int enum_longestCommonSubstring(char *text,char *pattern) 2  { 3     if(!text || !pattern)  return 0;     //nullptr 4     int tlen=strlen(text),plen=strlen(pattern); 5     if(0==tlen || 0==plen) return 0; //empty string 6     int maxLEN=0,i=0,j=0,ofs=0; 7     for(i=0;i<tlen && (tlen-i>=maxLEN);++i) 8         for(j=0;j<plen && (plen-j>=maxLEN); ++j) 9             if( *(text+i)==*(pattern+j) )10             {    11                 ofs=1;12                 while((i+ofs)<tlen&&(j+ofs)<plen&&*(text+ofs)==*(pattern+ofs))13                     {    ++ofs;   }14                 if(ofs>maxLEN)  maxLEN=ofs;  //update15             }16     return maxLEN;17 }

Note that the length of the text string is m and the length of the mode string is n. the time complexity of the brute force search method is O (M * n * min (m, n) and the space complexity is O (1 ). If the KMP algorithm is used for substring matching, the algorithm efficiency can be improved.

  Dynamic Planning

  The time complexity for solving the longest common substring in dynamic programming is O (M * n). The optimized dynamic programming algorithm can achieve the space complexity of O (m, n ).

See http://www.cnblogs.com/ider/p/longest-common-substring-problem-optimization.html

  

  Suffix Array

 Use the sorted suffix array (suffix array) to solve the longest common substring:

1. concatenate a text string and a pattern string to obtain a new string X;

2. Store all Suffix Arrays of X into SA. (The text string length is m and the pattern string length is N. Step 2 time complexity O (m + n)

3. Sort the SA;

4. Calculate the longest common prefix length of adjacent substrings in SA (time complexity O (m + n) * min (m, n )))

Note: To avoid obtaining the longest duplicate substring of a single string, the two substrings involved in the comparison in Step 4 should be a substring of a text string and the other be a substring of a pattern string. Therefore, record bits should be added to Step 1 and Step 2 for processing.

Suffix array-powerful tool for processing strings-Luo Sui describes how to sort Suffix Arrays by base sorting. The sorting time complexity (m + n) * log (m + n ). Therefore, the time complexity of the algorithm obtained by using the suffix array + base sorting is O (m + n) * min (m, n) (Step 4 determines the maximum time complexity ). However, this method is complicated and difficult to grasp. Here, I propose an algorithm for sorting the sort of the suffix array + C standard library. The sorting time complexity is O (min (m, n) * (m + n) * log (m + n). Therefore, the overall time complexity of the algorithm is O (min (m, n) * (m + n) * log (m + n) (Step 3 determines the maximum time complexity). In addition, the space complexity of this algorithm is O (m + n ). The time complexity of the suffix array + fast sort algorithm is lower than that of the suffix array + base sort algorithm. However, the advantage is that the standard library sort + strcmp is used for sorting, and the code is simple, algorithms are easier to understand. The Code is as follows:

  

 1 #include<stdio.h> 2 #include<iostream> 3 #include<string.h> 4 #include<algorithm> 5 using namesapce std; 6 int suffixArrayQsort_longestCommonSubstring(char *text,char *pattern) 7 { 8     if(!text || !pattern)  return 0;     //nullptr 9     int tlen=strlen(text),plen=strlen(pattern),i,j;10     if(0==tlen || 0==plen) return 0; //empty string11 12     enum ATTRIB{TEXT,PATTERN};13     struct absInfo14     {15         char *head;16         ATTRIB attr;  //tag17         int len;18         absInfo():head(NULL),attr(TEXT),len(0){}19         absInfo(char *phead,ATTRIB attrib,int length):head(phead),attr(attrib),len(length){}20         bool operator < (const absInfo &b)21         {22             return  strcmp(head,b.head)<0;23         }24         static void display(const absInfo &a)25         {26             printf("size:%d type:%-7s    ",a.len, (a.attr==TEXT?"TEXT":"PATTERN") );27             printf("%s\n",a.head);28         }29     }*sa;30 31     //step 2:build the suffix array32     sa=new absInfo[tlen+plen];33     for(i=0;i<tlen;++i)34     {35         sa[i].head=text+i;36         sa[i].attr=TEXT;37         sa[i].len=tlen-i;38     }39     for(j=0;j<plen;++j)40     {41         sa[j+tlen].head=pattern+j;42         sa[j+tlen].attr=PATTERN;43         sa[j+tlen].len=plen-j;44     }45 46     //step 3:use sort() to sort the sa47     puts("before sort, the sa is:"); for_each(sa,sa+tlen+plen,absInfo::display);48     sort(sa,sa+tlen+plen);49     puts("after sort, the sa is:"); for_each(sa,sa+tlen+plen,absInfo::display);50 51     //step 4:compare52     int maxLEN=0,rec=0;53     for(i=0;i<tlen+plen-1;i++)54     {55         if(sa[i].attr==sa[i+1].attr) continue;56         if(sa[i].len<=maxLEN || sa[i+1].len<=maxLEN) continue;57         rec=0;58         while(rec<sa[i].len && rec<sa[i+1].len && *(sa[i].head+rec)==*(sa[i+1].head+rec) )59           ++rec;60         if(rec>maxLEN)  maxLEN=rec; //update61     }62     //release memory resource and return63     delete [] sa; sa=NULL;64     return maxLEN;65 }

Note: 1. The LEN field in the absinfo structure is not required. You only need to set this field to perform a search pruning operation at line 56 of the Code.

2. slightly modify the code to give the public substring value in the algorithm (for example, "bcbc "), the LEN field and maxlen value of absinfo can also be used to calculate the position of the public substring in the text string and the mode string respectively in the time complexity of O (1 ).

Running result:

When the text string text = "abcbcedf" and pattern = "ebcbcdf", the code runs as shown in:

  

From the code, we can see that "suffix array + qsort sorting" achieves the longest common substring with simple encoding, and the space complexity is O (m + n)

  Suffix tree

You can search for the suffix tree and the generalized suffix tree algorithm by yourself.

  

Use the suffix array to obtain the longest common substring)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.