Summary: This article discusses the time complexity of Related Algorithms for longest common substrings. Then, based on the suffix array, a time complexity O (N ^ 2 * logn) is proposed ), the space complexity is O (n. Although this algorithm is less complex than dynamic planning and suffix tree algorithm, it has the advantage of simple coding and easy-to-understand code, and is suitable for fast implementation.
First of all, LCs usually refers to the longest common subsequence. For more information about the source, see p223, version 3rd, in introduction to algorithms ), instead of the longest public substring ).
The longest common substring is the longest substring in the text string and pattern string. For example, the text string text = "abcbcedf", pattern = "ebcbcdf ", the longest common substring is "bcbc" and the length is 4.
There are many solutions for longest common substrings, including brute force search, dynamic programming, suffix array, and suffix tree. This article focuses on the suffix array method. Other methods can be Baidu.
Brute Force Search
1 int enum_longestCommonSubstring(char *text,char *pattern) 2 { 3 if(!text || !pattern) return 0; //nullptr 4 int tlen=strlen(text),plen=strlen(pattern); 5 if(0==tlen || 0==plen) return 0; //empty string 6 int maxLEN=0,i=0,j=0,ofs=0; 7 for(i=0;i<tlen && (tlen-i>=maxLEN);++i) 8 for(j=0;j<plen && (plen-j>=maxLEN); ++j) 9 if( *(text+i)==*(pattern+j) )10 { 11 ofs=1;12 while((i+ofs)<tlen&&(j+ofs)<plen&&*(text+ofs)==*(pattern+ofs))13 { ++ofs; }14 if(ofs>maxLEN) maxLEN=ofs; //update15 }16 return maxLEN;17 }
Note that the length of the text string is m and the length of the mode string is n. the time complexity of the brute force search method is O (M * n * min (m, n) and the space complexity is O (1 ). If the KMP algorithm is used for substring matching, the algorithm efficiency can be improved.
Dynamic Planning
The time complexity for solving the longest common substring in dynamic programming is O (M * n). The optimized dynamic programming algorithm can achieve the space complexity of O (m, n ).
See http://www.cnblogs.com/ider/p/longest-common-substring-problem-optimization.html
Suffix Array
Use the sorted suffix array (suffix array) to solve the longest common substring:
1. concatenate a text string and a pattern string to obtain a new string X;
2. Store all Suffix Arrays of X into SA. (The text string length is m and the pattern string length is N. Step 2 time complexity O (m + n)
3. Sort the SA;
4. Calculate the longest common prefix length of adjacent substrings in SA (time complexity O (m + n) * min (m, n )))
Note: To avoid obtaining the longest duplicate substring of a single string, the two substrings involved in the comparison in Step 4 should be a substring of a text string and the other be a substring of a pattern string. Therefore, record bits should be added to Step 1 and Step 2 for processing.
Suffix array-powerful tool for processing strings-Luo Sui describes how to sort Suffix Arrays by base sorting. The sorting time complexity (m + n) * log (m + n ). Therefore, the time complexity of the algorithm obtained by using the suffix array + base sorting is O (m + n) * min (m, n) (Step 4 determines the maximum time complexity ). However, this method is complicated and difficult to grasp. Here, I propose an algorithm for sorting the sort of the suffix array + C standard library. The sorting time complexity is O (min (m, n) * (m + n) * log (m + n). Therefore, the overall time complexity of the algorithm is O (min (m, n) * (m + n) * log (m + n) (Step 3 determines the maximum time complexity). In addition, the space complexity of this algorithm is O (m + n ). The time complexity of the suffix array + fast sort algorithm is lower than that of the suffix array + base sort algorithm. However, the advantage is that the standard library sort + strcmp is used for sorting, and the code is simple, algorithms are easier to understand. The Code is as follows:
1 #include<stdio.h> 2 #include<iostream> 3 #include<string.h> 4 #include<algorithm> 5 using namesapce std; 6 int suffixArrayQsort_longestCommonSubstring(char *text,char *pattern) 7 { 8 if(!text || !pattern) return 0; //nullptr 9 int tlen=strlen(text),plen=strlen(pattern),i,j;10 if(0==tlen || 0==plen) return 0; //empty string11 12 enum ATTRIB{TEXT,PATTERN};13 struct absInfo14 {15 char *head;16 ATTRIB attr; //tag17 int len;18 absInfo():head(NULL),attr(TEXT),len(0){}19 absInfo(char *phead,ATTRIB attrib,int length):head(phead),attr(attrib),len(length){}20 bool operator < (const absInfo &b)21 {22 return strcmp(head,b.head)<0;23 }24 static void display(const absInfo &a)25 {26 printf("size:%d type:%-7s ",a.len, (a.attr==TEXT?"TEXT":"PATTERN") );27 printf("%s\n",a.head);28 }29 }*sa;30 31 //step 2:build the suffix array32 sa=new absInfo[tlen+plen];33 for(i=0;i<tlen;++i)34 {35 sa[i].head=text+i;36 sa[i].attr=TEXT;37 sa[i].len=tlen-i;38 }39 for(j=0;j<plen;++j)40 {41 sa[j+tlen].head=pattern+j;42 sa[j+tlen].attr=PATTERN;43 sa[j+tlen].len=plen-j;44 }45 46 //step 3:use sort() to sort the sa47 puts("before sort, the sa is:"); for_each(sa,sa+tlen+plen,absInfo::display);48 sort(sa,sa+tlen+plen);49 puts("after sort, the sa is:"); for_each(sa,sa+tlen+plen,absInfo::display);50 51 //step 4:compare52 int maxLEN=0,rec=0;53 for(i=0;i<tlen+plen-1;i++)54 {55 if(sa[i].attr==sa[i+1].attr) continue;56 if(sa[i].len<=maxLEN || sa[i+1].len<=maxLEN) continue;57 rec=0;58 while(rec<sa[i].len && rec<sa[i+1].len && *(sa[i].head+rec)==*(sa[i+1].head+rec) )59 ++rec;60 if(rec>maxLEN) maxLEN=rec; //update61 }62 //release memory resource and return63 delete [] sa; sa=NULL;64 return maxLEN;65 }
Note: 1. The LEN field in the absinfo structure is not required. You only need to set this field to perform a search pruning operation at line 56 of the Code.
2. slightly modify the code to give the public substring value in the algorithm (for example, "bcbc "), the LEN field and maxlen value of absinfo can also be used to calculate the position of the public substring in the text string and the mode string respectively in the time complexity of O (1 ).
Running result:
When the text string text = "abcbcedf" and pattern = "ebcbcdf", the code runs as shown in:
From the code, we can see that "suffix array + qsort sorting" achieves the longest common substring with simple encoding, and the space complexity is O (m + n)
Suffix tree
You can search for the suffix tree and the generalized suffix tree algorithm by yourself.
Use the suffix array to obtain the longest common substring)