Longest public substring and public substring
Longest Common Substring is a classic problem. Its basic description is "given two strings, find the Longest and the same Substring between them (consecutive) ". For example, the following two strings S and T have the longest common substring "howmuchiloveyoumydearmother" and the length is 27.
S="yeshowmuchiloveyoumydearmotherreallyicannotbelieveit"T="yeaphowmuchiloveyoumydearmother"
The LCS Method for Finding N strings with the longest length of L can be roughly divided into the following categories:
- The enumeration method is obviously a simple but extremely inefficient algorithm. Some improved algorithms use each Suffix of a string to partially match all other strings and use the KMP algorithm, the time complexity is O (NL2 ).
- Dynamic Planning solution with time complexity of O (N2)
- Suffix array and height array solution, using the binary search technology, the time complexity is O (NLlogL ).
- Generalized suffix tree method. The time complexity is O (NL ).
At the time of this article, CodeVS: 3160 longest public substrings provide some good test cases for this problem. Dynamic Planning
As a simplification of the longest common subsequence problem, the general solution is dynamic planning (DP). The reference code is as follows.
int LCS(string S,string T){if(S==""||T=="")return 0;if(S.length()>T.length())swap(S,T);int ls=S.length(),MAX=-1;int now=0,former=1;vector<vector<int>> memo(2,vector<int>(ls));for(auto t:T){memo[now][0]=(t==S[0]);for(int i=1;i<ls;++i){if(S[i]!=t)memo[now][i]=0; else MAX=max(MAX,memo[now][i]=memo[former][i-1]+1);}swap(now,former);}return MAX;}
Suffix array algorithm
Suffix array SA [I] indicates the position of the suffix of the ranking I, height array (LCP array) height [I] indicates the Longest Common Prefix (Longest Common Prefix, LCP) of the suffix SA [I] and SA [I-1 ), height [I] = LCP (SA [I], SA [I-1]). For details, see Suffix array (Suffix Arrary ). evaluate the longest common substring of two strings S and T. You can construct a new string S + T, create a suffix array, and obtain the LCP array. As you can know, the longest common prefix must appear in the two adjacent substrings In the suffix array (otherwise, the farther and farther, the smaller the number ). Using this, we can find a string that satisfies sa [I] and sa [I + 1] That belongs to S1 and S2, and then the maximum value in the corresponding LCP array is the answer. The code for finding two strings of LCS is given below, and the Suffix tree (SA) provided in Suffix Arrary is used. SA has two member variables sa and height, returns the suffix array and height array of the string S.
Int LCS (const string & S, const string & T) {string R = S + T; int len = R. size (), ls = S. size (), MAX = 0; // construct the suffix array, the second parameter is to construct the SA algorithm (such as Prefix Doubling, DC3, SA-IS etc.) SA mySA (R, & SA:: prefixDoubling); for (int I = 0; I <len; ++ I) {if (mySA. sa [I] <ls )! = (MySA. sa [I + 1] <ls) MAX = max (MAX, mySA. height [I]);} return MAX ;}
For LCS that require multiple strings, byvoid provides an understanding method in the article "suffix array solution to the longest public substring problem", which converts the longest public substring of N strings
Calculate the maximum value of the longest common prefix of some suffixes., These suffixes should belong to N strings. Set N strings to S1, S2, S3,..., and SN, as follows:
So the problem is concentrated on how to verify whether the given length A is A feasible solution. The method is to find the consecutive Height [I .. j], so that I <= k <= j all meet the Height [k]> = A, and the I-1 <= k <= j, SA [k] points belong to the original N strings s1 .. SN. If such A segment can be found, A is the feasible solution; otherwise, A is not the feasible solution. Specific Search I .. if Height [I]> = A is found, start enumerating the position of j from I, until the Height [j + 1] <A is found and [I .. j] Does SA in this interval belong to s1 .. SN. If yes, A is A feasible solution, and then return directly. Otherwise, I = j + 1 is further enumerated. Each character in S is accessed for O (1) times, and the length of S is NL + N-1, so the verification time complexity is O (NL ). The suffix tree algorithm first constructs string S as SAM and then uses T to run the automatic machine according to the following rules.
- Use the variable lcs to record the longest common substring. The initial value is 0.
- Set the current state node to p and the character to be matched to c. If go [c] has an edge, it indicates it can be transferred, and then it is transferred and lcs ++;
- If it cannot be transferred, move the status to the p's par. If it still cannot be transferred, repeat the process until p returns to the root node and sets lcs to 0;
- If the status is transferred during the previous process, set lcs to the Current Status val.
Why do they move to the par after the mismatch? Because the mismatch in status p indicates that the [min, max] In this status indicates that the strings are not substrings in B, but the suffixes shorter than them may still be substrings of B, the par Pointer Points to the suffix of this state.
#define N 200005struct Node { int f,next[26],l; }G[N];class SAM {char *st;int n,len,last,cnt;public:int MAX;SAM(string S):len(1),cnt(0),last(1),MAX(0) {st=(char*)malloc(S.size());for(int i=0;i<S.size();++i)add(S[i]-'a');last=1;}void add(int ch) {int i,x,p;G[++len].l=G[last].l+1;for(i=last;i&&!G[i].next[ch];i=G[i].f)G[i].next[ch]=len;p=G[i].next[ch],last=len,x=i;if(!x){G[1].next[ch]=len,G[len].f=1; return;}if(G[p].l==G[x].l+1) G[len].f=p;else {G[++len].l=G[x].l+1,G[len].f=G[p].f,G[p].f=G[len-1].f=len;for(i=0;i<26;i++)G[len].next[i]=G[p].next[i];for(i=x;i&&G[i].next[ch]==p;i=G[i].f)G[i].next[ch]=len;}}void explore(int ch) {if(G[last].next[ch])++cnt;else {for(;last&&!G[last].next[ch];last=G[last].f);if(!last){last=1,cnt=0; return;}cnt=G[last].l+1;}last=G[last].next[ch],MAX=max(MAX,cnt);}};int LCS(string S,string T) {SAM mySAM(S);for(int i=0;i<T.size();++i)mySAM.explore(T[i]-'a');return mySAM.MAX;}
Longest public substring
Again the landlord? It seems that this question has been answered.
The following procedures meet the requirements of the landlord
// Author: hacker
// Time: 9.12.2006
# Include <stdio. h>
# Include <string. h>
Void main ()
{
Char * x = "aabcdabce ";
Char * y = "12 abcabcdace ";
Int m = strlen (x );
Int n = strlen (y );
Int I, j, k, l;
Int maxlength = 0;
Int start = 0;
Int count = 0; // used to determine whether a variable matches
For (I = 1; I <= n; I ++) // a loop that matches the length
For (j = 0; j <n-I + 1; j ++) // The cycle at the starting position of y
For (k = 0; k <m-I + 1; k ++) // The cycle at the starting position of x
{
Count = 0;
For (l = 0; l <I; l ++) // you can determine whether the matching result is correct and the code can be optimized.
If (y [j + l] = x [k + l])
Count ++;
If (count = I & I> maxlength)
{
Maxlength = I; // maximum record length
Start = j; // start position of the maximum record length
}
}
If (maxlength = 0)
Printf ("No Answer ");
Else
For (I = 0; I <maxlength; I ++)
Printf ("% c", y [start + I]);
}
The following program is the real Longest Common substring.
// Author: hacker
// Time: 9.12.2006
# Include <stdio. h>
# Include <string. h>
Int B [50] [50];
Int c [50] [50];
Void lcs (x, m, y, n)
Char * x;
Int m;
Char * y;
Int n;
{
Int I;
Int j;
For (I = 1; I <= m; I ++) c [I] [0] = 0;
For (I = 1; I <= n; I ++) c [0] [I] = 0;
C [0] [0] = 0;
For (I = 1; I <= m; I ++)
For (j = 1; j <= n; j ++)
{
If (x [I-1] = y [J-1])
{
C [I] [j] = c [I-1] [J-1] + 1;
B [I] [j] = 1;
}
Else
If (c [I-1] [j]> c [I] [J-1])
{
C [I] [j] = c [I-1] [j];
B [I] [j] = 2;
}
Else
{
C [I] [j] = c [I] [J-1];
B [I] [j] = 3;
}
}
}
Void sho ...... remaining full text>
Longest public substring
Var n, I, j, k, kk, tot, min, max, max1, l, q, ll: longint;
A: array [1 .. 10] of ansistring;
P: array [1 .. 500000] of longint;
B: ansistring;
Begin
Assign (input, 'pow. in'); reset (input );
Assign (output, 'pow. out'); rewrite (output );
L: = maxlongint;
Readln (n );
For I: = 1 to n do
Begin
Readln (a [I]);
If length (a [I]) <l then
Begin
L: = length (a [I]);
Q: = I;
End;
End;
B: = a [q];
Max: = 0;
For k: = 1 to l do
Begin
Ll: = length (B );
P [1]: = 0;
J: = 0;
For I: = 2 to ll do
Begin
While (j> 0) and (B [j + 1] <> B [I]) do j: = p [j];
If B [j + 1] = B [I] then j: = j + 1;
P [I]: = j;
End;
Min: = maxlongint;
For kk: = 1 to n do
If kk <> q then
Begin
Max1: = 0;
J: = 0;
For I: = 1 to length (a [kk]) do
Begin
If j> max1 then max1: = j;
While (j> 0) and (B [j + 1] <> a [kk, I]) do j: = p [j];
If B [j + 1] = a [kk, I] then j: = j + 1;
If j = ll then
Begin
Max1: = j; break;
End;
End;
If j> max1 then max1: = j;
If max1 <min then min: = max1;
End;
If min> max then max: = min;
If max = ll then break;
Delete (B, 1, 1 );
End;
Writeln (max );
Close (input); close (output );
End... remaining full text>