ProgramMember programming art Chapter 1: Longest Common subsequence (LCS)
0. Preface
The programmer's programming Art series has been created again (for the first ten chapters, referProgrammer programming Art 1 ~ Chapter 10 highlights and summary). Review the previous 10 chapters, someCodeIt is debatable, because the code at that time only focuses on elaboration.AlgorithmTherefore, many issues related to code specifications are not perfect. In the future, we will focus on improving the system.
I searched the internet to explain this LCS ProblemArticleCountless, but most give readers a unfriendly feeling, slightly obscure, and the code is not clear enough. This article tries to avoid this situation. Lipo is widely used and detailed. At the same time, chapter 3 of the classical algorithm research series (III. Dynamic Programming) is extremely poorly written, so it is also a compensation for that article. If you have any questions, please kindly advise.
Section 1. Problem Description
What is the longest common subsequence? Like a seriesSIf they are subsequences of two or more known sequences and are the longest of all the sequences that meet the conditionSThe longest common subsequence of a known sequence.
For example, if there are two random sequences, such as 1 3 4 5 5, and 2 4 5 5 7 6, their longest common subsequences are: 4 5 5.
Section II. Solutions to LCS Problems
When solving the longest common subsequence problem, the easiest algorithm to come up with is the exhaustive search method, that is, to check whether it is also a subsequence of Y for each subsequence of X, determine whether it is a public subsequence of X and Y, and select the longest public subsequence during the check. After all the subsequences of X and Y are checked, the longest common subsequences of X and Y are obtained. A subsequence of X corresponds to the subscript sequence {1, 2 ,..., M}. Therefore, X has 2 m different subsequences (Y, for example, 2 ^ N ), therefore, the exhaustive search method requires an exponential time (2 ^ m * 2 ^ N ).
- Dynamic Planning Algorithm
In fact, the longest common subsequence problem also has the optimal sub-structure.
Note:
Xi = <X1, medium, xi> that is, the first I character (1 ≤ I ≤ m) of the X sequence (prefix)
YJ = <Y1, Prior, YJ> is the first J character of the Y sequence (1 ≤ j ≤ n) (prefix)
Assume z = <Z1, Jun, ZK> ε LCS (x, y ).
IfXM = YN(The last character is the same), it is not difficult to prove using the reverse proof: This character must be the last character of any of the longest common subsequences of X and y z (set length to K, there is zk = XM = YN and obviously there is a Zk-1 in LCS (Xm-1, Yn-1) That is Z prefixZk-1 is the longest common subsequence of Xm-1 and Yn-1. At this time, the question is to find the Xm-1 and Yn-1 LCS (The length of LCS (X, Y) is equal to that of LCS (Xm-1, Yn-1).
IfXM =yn, It is not difficult to use the reverse proof method to prove: either Z, LCS (Xm-1, Y), or Z, LCS (x, Yn-1 ). Because ZK and ZK have at least one of them must be true, ZK and XM has Z, LCS (Xm-1, Y), similar, if ZK is not YN, there is Z in LCS (x, Yn-1 ). At this time, the question is to find the Xm-1 and Y LCs and X and Yn-1 LCS. The length of LCS (X, Y) is: the length of max {LCS (Xm-1, Y), the length of LCS (x, Yn-1 }.
SinceXM =ynIn the case of finding the length of LCS (Xm-1, Y) and the length of LCS (x, Yn-1), the two problems are not mutually independent: both require LCS (Xm-1, yn-1) length. The LCS of the other two sequences contains the LCS of the prefix of the two sequences. Therefore, the dynamic programming method is used for the problem of optimal sub-structure.
That is to say, to solve this LCS problem, you need three things:1, LCS (Xm-1, Yn-1) + 1;2, LCS (Xm-1, Y), LCS (x, Yn-1 );3, Max {LCS (Xm-1, Y), LCS (x, Yn-1 )}.
At this point, the dynamic planning solution for this LCS has been described completely. However, in order to make the book necessary, I will try to elaborate on this issue in detail below.
Section 3. Dynamic Planning Algorithm for LCs
3.1 Structure of the longest common subsequence
The structure of the longest common subsequence is as follows:
Set sequence X = <x1, x2 ,..., XM> and Y = <Y1, Y2 ,..., One of the longest common subsequences of YN> Z = <Z1, Z2 ,..., ZK>, then:
- If XM = YN, zk = XM = YN and the Zk-1 is the longest common subsequence of Xm-1 and Yn-1;
- If XM =yn and ZK =xm, z is the longest common subsequence of Xm-1 and Y;
- If XM =yn and ZK =yn, z is the longest common subsequence of x and Yn-1.
The Xm-1 = <x1, x2 ,..., Xm-1>, Yn-1 = <Y1, Y2 ,..., Yn-1>, Zk-1 = <Z1, Z2 ,..., Zk-1>.
3. 2. recursive structure of subproblems
According to the optimal sub-structure nature of the longest common subsequence problem, we need to find x = <x1, x2 ,..., XM> and Y = <Y1, Y2 ,..., The longest common subsequence of YN> can be recursively performed in the following way: When XM = YN, find the longest common subsequence of Xm-1 and Yn-1, add xm (= yn) to the end of the sequence to obtain the longest common subsequence of X and Y. When XM =yn, two subproblems must be solved: finding one of the longest common subsequences of Xm-1 and Y and one of the longest common subsequences of x and Yn-1. The elders of the two common subsequences are the longest common subsequences of X and Y.
From this recursive structure, it is easy to see that the longest common subsequence problem has subproblem overlapping nature. For example, when calculating the longest common subsequences of X and Y, the longest common subsequences of x and Yn-1 and Xm-1 and Y may be calculated. Both of these subproblems contain a common subproblem, that is, the longest common subsequence for calculating Xm-1 and Yn-1.
Similar to the optimal calculation order of Matrix Products, we establish a recursive relationship between the Optimal Values of subproblems. Use C [I, j] to record the length of the longest common subsequence of sequence XI and YJ. Xi = <x1, x2 ,..., Xi>, YJ = <Y1, Y2 ,..., YJ>. When I = 0 or J = 0, the null sequence is the longest common subsequence of Xi and YJ, So C [I, j] = 0. In other cases, the theorem can establish a recursive relationship as follows:
3. Calculate the optimal value
Using the recursive formula at the end of the last section, we can easily write a recursive algorithm for computing C [I, j], but its computing time increases with the input length index. Because of the limited sub-problem spaceθ(M * n). Therefore, using the dynamic planning algorithm to calculate the optimal value from the bottom up can improve the efficiency of the algorithm.
The Dynamic Programming Algorithm lcs_length (x, y) used to calculate the longest common sub-sequence length ,..., XM> and Y = <Y1, Y2 ,..., YN> as input. Output two arrays C [0 .. m, 0 .. n] and B [1 .. m, 1 .. n]. C [I, j] stores the length of the longest common subsequence of Xi and YJ, and B [I, j] records indicate C [I, j] The value is obtained by the subproblem, which is used to construct the longest common subsequence. Finally, the length of the longest common subsequences of X and Y is recorded in C [M, N.
Procedure lcs_length (x, y); <br/> begin <br/> M: = length [X]; <br/> N: = length [y]; <br/> for I: = 1 to M do C [I, 0]: = 0; <br/> for J: = 1 to n do C [0, j]: = 0; <br/> for I: = 1 to M do <br/> for J: = 1 to n do <br/> If X [I] = Y [J] Then <br/> begin <br/> C [I, j]: = C [I-1, J-1] + 1; <br/> B [I, j]: = ""; <br/> end <br/> else if C [I-1, j] ≥c [I, J-1] Then <br/> begin <br/> C [I, j]: = C [I-1, J]; <br/> B [I, j]: = "begin"; <br/> end <br/> else <br/> begin <br/> C [I, j]: = C [I, J-1]; <br/> B [I, j]: = "begin" <br/> end; <br/> return (C, b); <br/> end;
Computing cost per array UnitBytes(1) time, algorithm lcs_length time consumptionBytes(Mn).
3. 4. Construct the longest common subsequence
Array B Calculated by the algorithm lcs_length can be used to quickly construct the sequence X = <x1, x2 ,..., XM> and Y = <Y1, Y2 ,..., The longest common subsequence of YN>. First, from B [M, N], search in array B in the direction indicated by the arrow. When B [I, j] Encounters (It means that xi = Yi is an element of LCS.), Indicates that the longest common subsequence of Xi and YJ is the subsequence obtained by adding XI at the end of the longest common subsequence of Xi-1 and Yj-1; when B [I, in J], the longest common subsequence of Xi and YJ and the longest common subsequence of Xi-1 and YJ are the same. When B [I, j] indicates that the longest common subsequences of Xi and YJ are the same as the longest common subsequences of Xi and Yj-1. This method finds every element of LCS in reverse order.
The following algorithm LCS (B, X, I, j) prints the longest common subsequence of Xi and YJ based on the content of B. By calling the LCS (B, X, length [X], length [y]) algorithm, the longest common subsequences of X and Y can be printed.
Procedure LCS (B, X, I, j); <br/> begin <br/> If I = 0 or J = 0 then return; <br/> If B [I, j] = "" Then <br/> begin <br/> LCS (B, X, I-1, J-1 ); <br/> Print (X [I]); {print X [I]} <br/> end <br/> else if B [I, j] = "lead" then LCS (B, X, I-1, j) <br/> else LCS (B, X, I, J-1); <br/> end;
In algorithm LCS, each recursive call reduces I or J by 1, so the algorithm computing time isO(M + n ).
For example, the given two sequences are x = <A, B, C, B, D, a, B> and Y = <B, D, C, A, B, a>. The results calculated by the algorithms lcs_length and LCs are shown in:
Let me explain this figure (refer to introduction to algorithms). In sequence X = {A, B, C, B, D, a, B} and Y = {B, D, C, A, B,, the tables C and B calculated by lcs_length. The blocks in row I and column J contain the values of C [I, j] and arrows pointing to B [I, j. In item 4 of C [7, 6], the bottom right corner of the table is the length of an LCS <B, C, B, A> of X and Y. For I, j> 0, item C [I, j] depends only on whether there is xi = Yi, and item C [I-1, J] and C [I, the value of J-1], which is calculated before C [I, j. To reconstruct an LCS element, follow the arrows of B [I, j] From the bottom right corner. This path is marked as shadow, each "" in this path corresponds to an item that makes xi = Yi a member of an LCS (highlighted ).
Therefore, based on the results shown in the preceding figure, the program will eventually output "B c B ".
3. 5. Algorithm Improvement
For a specific problem, an algorithm designed according to the general algorithm design strategy can be improved in terms of the algorithm's time and space requirements. This improvement usually utilizes the particularity of specific problems.
For example, in the algorithms lcs_length and LCs, You can further Save Array B. In fact, the value of the array element C [I, j] is determined only by one of the three values c [I-1, J-1], C [I-1, J] and C [I, J-1, the array element B [I, j] is only used to indicate the value of C [I, j. Therefore, in the algorithm LCS, We can temporarily judge the value of C [I, j] from C [I-1, J-1] without the aid of array B. which numeric element in C [I-1, J] and C [I, J-1] is determined at the costBytes(1) time. Since B is not necessary for the algorithm LCS, the algorithm lcs_length does not have to be saved. This savesθ(Mn) space, while the time required by lcs_length and LCS isBytes(Mn) andBytes(M + n ). However, array C still needsBytes(Mn) space, so the improvement here is only an improvement on the constant factor of spatial complexity.
In addition, if you only need to calculate the length of the longest common subsequence, the space requirement of the algorithm can be greatly reduced. In fact, when calculating C [I, j], we only use row I and row I-1 of array C. Therefore, the length of the longest common subsequence can be calculated using the array space of two rows. Further analysis can also reduce the space requirement to min (m, n ).
Section 4 encoding implementation LCs
The following describes how to calculate the longest common subsequence in dynamic planning.X,YFor example:
Two-dimensional arrayF[I] [J]Indicates
XOfIBitwise AND
YOfJThe length of the longest common subsequence before the bit is:
-
F[1] [1] =
Same(1, 1)
-
F[
I] [
J] =
Max{
F[I− 1] [J− 1] +Same(I,J),
F[I− 1] [J],
F[I] [J
− 1]}
Where,Same(A,B)When
XTheABitwise AND
YTheBThe bits are completely "1" at the same time; otherwise, the bits are "0 ".
At this time,F[I] [J]The maximum number in is
XAndYThe length of the longest common subsequence. Based on this array backtracking, you can find the longest common subsequence.
The space and time complexity of this algorithm areO(N2)After optimization, the space complexity can beO(N)The time complexity isO(NLogN).
The following is the Java code of this algorithm:
<Br/> Import Java. util. random; </P> <p> public class LCS {<br/> Public static void main (string [] ARGs) {</P> <p> // set the string length <br/> int substringlength1 = 20; <br/> int substringlength1 = 20; // you can set the specific size </P> <p> // randomly generate a string <br/> string x = getrandomstrings (substringlength1 ); <br/> string y = getrandomstrings (substringlengh2); </P> <p> long starttime = system. nanotime (); <br/> // construct the LCS length of the subproblems X [I] and y [I] of two-dimensional array records <B R/> int [] [] Opt = new int [substringlength1 + 1] [substringlengh2 + 1]; </P> <p> // All subproblems related to dynamic planning and calculation <br/> for (INT I = substringlength1-1; I> = 0; I --) {<br/> for (Int J = substringlength1-1; j> = 0; j --) {<br/> If (X. charat (I) = y. charat (j) <br/> OPT [I] [J] = OPT [I + 1] [J + 1] + 1; // refer to the above formula. <Br/> else <br/> OPT [I] [J] = math. max (OPT [I + 1] [J], OPT [I] [J + 1]); // refer to the above formula. <Br/>}</P> <p> margin </P> <p> understand the previous section. For more information, see the formula I have given above: </P> <p> based on the above conclusions, the following formula is obtained. </P> <p> if we record the length of the LCS of string XI and YJ as C [I, j], we can recursively calculate C [I, j]: </P> <p>/0 if I <0 or j <0 <br/> C [I, j] = C [I-1, J-1] + 1 if I, j> = 0 and xi = XJ <br/>/MAX (C [I, J-1], C [I-1, j] If I, j> = 0 and xi = XJ </P> <p> partition </P> <p> system. out. println ("substring1:" + x); <br/> system. out. println ("substring2:" + Y); <br/> system. out. print ("LCS:"); </P> <p> int I = 0, j = 0; <br/> while (I <substringlength1 & J <substringleng22) {<br/> If (X. charat (I) = y. charat (j) {<br/> system. out. print (X. charat (I); <br/> I ++; <br/> J ++; <br/>} else if (OPT [I + 1] [J]> = OPT [I] [J + 1]) <br/> I ++; <br/> else <br/> J ++; <br/>}< br/> long endtime = system. nanotime (); <br/> system. out. println ("totle time is" + (endtime-starttime) + "ns "); <br/>}</P> <p> // obtain a random string with a fixed length. <br/> Public static string getrandomstrings (INT length) {<br/> stringbuffer buffer = new stringbuffer ("abcdefghijklmnopqrstuvwxyz"); <br/> stringbuffer sb = new stringbuffer (); <br/> random r = new random (); <br/> int range = buffer. length (); <br/> for (INT I = 0; I <length; I ++) {<br/> Sb. append (buffer. charat (R. nextint (range); <br/>}< br/> return sb. tostring (); <br/>}< br/>}
Section 5: hot spots during interview
In many interviews, details are covered. LCS problems described in this article often appear in interviews of major companies. Please refer to the following questions:
56. Longest Public String
Question: If all the characters of string 1 appear in the second string in the order of the strings, then string 1 is called a substring of string 2. Note that the character of a substring (string 1) must appear in string 2 consecutively. Compile a function, enter two strings, calculate their longest public substrings, and print the longest public substrings.
For example, if two strings bdcaba and abcbdab are input, and both bcba and bdab are their longest common substrings, the output length is 4 and any substring is printed.
Analysis: finding the longest common subsequence (LCS) is a very classic dynamic programming question. Therefore, some companies that place importance on algorithms, such as microstrategy, regard it as an interview question.
OK. The analysis of this problem has been exhausted and the code has been provided. Just fill in the C/C ++ version of the LCS issue, as shown below (please correct me if you have any questions or errors ):
# Include <iostream> <br/> # include <stdio. h> <br/> # include <stdlib. h> <br/> # define Len 501 <br/> using namespace STD; <br/> int B [Len] [Len]; <br/> void LCS (int I, Int J, char X [], int B [] [Len]); <br/> int lcslength (char X [], char y [], int B [] [Len]) {<br/> int lenx = strlen (x), leny = strlen (y ); <br/> int C [Len] [Len] = {0}; <br/> for (INT I = 1; I <= lenx; I ++) {<br/> for (Int J = 1; j <= leny; j ++) {<br/> If (X [I-1] = Y [J-1]) {<br/> C [I] [J] = C [I-1] [J-1] + 1; <br/> B [I] [J] = 1; <br/>}< br/> else {<br/> If (C [I-1] [J]> = C [I] [J-1]) {<br/> C [I] [J] = C [I-1] [J]; <br/> B [I] [J] = 2; <br/>}< br/> else {<br/> C [I] [J] = C [I] [J-1]; <br/> B [I] [J] = 3; <br/>}< br/> return C [lenx] [leny]; <br/>}</P> <p> int main () {<br/> char X [Len], Y [Len]; <br/> while (scanf ("% S % s", x, y )! = EOF) {<br/> printf ("% d \ n", lcslength (X, Y, B); <br/> LCS (strlen (x ), strlen (Y), X, B); <br/>}< br/> return 0; <br/>}< br/> void LCS (int I, Int J, char X [], int B [] [Len]) {<br/> if (I = 0 | j = 0) return; <br/> If (B [I] [J] = 1) {<br/> LCS (I-1, J-1, X, B ); <br/> printf ("% C", X [I-1]); <br/>}< br/> else if (B [I] [J] = 2) LCS (I-1, J, X, B); <br/> else LCS (I, J-1, X, B); <br/>}
Section 6 Improved Algorithms
Next we will understand a New Method for Solving the longest common subsequence problem, which is different from dynamic programming, this algorithm converts the problem of solving common strings to the problem of solving matrix L (P, M). In the process of solving matrix elements using theorem (1) while (I <K), L (K, I) = NULL,
(2) While (L (K, I) = K), L (K, I + 1) = L (K, I + 2) =... L (K, M) = K;
Find the element in each column and exit the loop when row p + 1 is null. After the matrix L (K, M) is obtained, B [L (1, m-p + 1)] B [L (2, m-P + 2)]… B [L (P, m)] is the LCS of A and B, where p is the length of LCS.
6.1 main definitions and Theorems
- define subsequence: Specify the string a = A [1] A [2]… A [m], (a [I] is the I-th letter of a, a [I] ε Character Set Σ, L <= I
- define 2 common subsequence: Given strings A, B, and C, C is called A and B. The common subsequence is that C is both a subsequence, it is also a subsequence of B.
- defines 3 Longest Common subsequences (LCS): Given strings a, B, c, the longest common subsequence C is called a and B is the public subsequence C is a and B, and for any public subsequence D of A and B, all have d <= C. Given the strings A and B, A = m, B = N, it is recommended to set m <= n. The LCS problem is that the LCS of A and B are required.
- define 4 given string a = A [1] A [2]… A [m] and string B = B [1] B [2]… [N], A (1: I) indicates a continuous subsequence A [1] A [2]… A [I], similarly B (1: J) represents the continuous subsequence B of B [1] B [2]… [J]. Li (k) indicates the minimum value of J in all strings B (L: J) with a (1: I) LCS with a length of K. The formula is Li (K) = minj (LCS (A (1: I), B (L: J) = k) [3].
Theorem 1 when I ε [1, m], with Li (l) <li (2) <li (3) <... <Li (m ).
Theorem 2 when I ε [M-1], then k ε [l, m], with I 1 L + (k) <= I L (k ).
Theorem 3 when I ε [M-1], then k ε [l, m-L], with I l (k) <I 1 L + (K + l ).
None of the above three theorems considers the absence of Li (k) definitions.
Theorem 4 [3] If I 1 L + (k) exists, its value must be: I 1 L + (K) = min (J, I L (k )). Here J is the smallest integer that meets the following conditions: A [I + L] = B [J] And j> I L (k-1 ).
Element L (K, I) = Li (k) in the Matrix. Here (1 <I <= m, 1 <k <= m), null indicates L (K, I) does not exist. When I <K, it is clear that L (K, I) does not exist.
If P = maxk (L (K, M) ≠ null) is set, the diagonal line of L (P, m) in the L matrix can be proved. L (1, m-p + 1), L (2, m-P + 2 )... Sub-sequence B corresponding to L (p-1 m-1), L (P, m) [L (1, m-p + 1)] B [L (2, m-P + 2)]… B [L (P, m)] is the LCS of A and B, and P is the length of the LCS. In this way, the solution to the LCS problem is transformed into the solution to the m l × matrix.
6.2 algorithm ideas
According to the theorem, the first step is to find the first line of elements, L (), L ),... L (1, m), the second step is to find the second row until the p + 1 row is found to be null. When I <K, L (K, I) = NULL, and L (K, I) = K, L (K, I + 1) = L (K, I + 2) =... L (K, M) = K. In this way, the time complexity of each line is O (n), and the whole time complexity is O (PN ). You do not need to store the entire matrix during the process of finding the L matrix. You only need to store the current row and the previous row. The space complexity is O (m + n ).
The following example shows the given strings A and B, A = acdabbc, B = cddbacaba, (M = A = 7, n = B = 9 ). Based on the recursive formula given by the theorem, obtain the L Matrix 2 of A and B, where $ represents null.
Then, the LCS of A and B is B [1] B [2] B [4] B [6] = cdbc, And the LCS length is 4.
6.3 algorithm pseudocode
Algorithm L (a, B, L)
Strings A and B whose input lengths are M and N respectively
Output The Longest Common subsequence LCS of A and B
L (a, B, L) {// string a, B, the obtained matrix L <br/> for (k = 1; k <= m; k ++) {// m is the length of a <br/> for (I = 1; when I <= m; I ++) {<br/> if (I <k) L [k] [I] = N; // I <K, L (K, I) = NULL, N represents infinity <br/> If (L [k] [I] = k) // L (K, I) when K is used, L (K, I + 1) = L (K, I + 2) =... L (K, M) = k <br/> for (L = I + 1; L <= m; l ++) <br/> {L [k] [l] = K; <br/> Break ;}< br/> for (j = 1; j <= N; j ++) {// Implementation of theorem 4 <br/> if (a [I + 1] = B [J] & J> L [k-1] [I]) {<br/> L [k] [I + 1] = (j <L [k] [I]? J: L [k] [I]); <br/> break; <br/>}< br/> If (L [k] [I + 1] = 0) <br/> L [k] [I] = N; <br/>}< br/> If (L [k] [m] = N) <br/> {P = K-1; break ;} <br/>}< br/> P = K-1; <br/>}< br/>
6.4 conclusion
This section describes a new method for solving the longest common subsequence problem, which is different from the dynamic programming method. It increases the speed of sequence matching without affecting the accuracy, according to the theorem I 1 L + (K) = min (J, I L (k), the matrix is obtained. In the process of solving the matrix, the most time-consuming L (P, m) optimize the conditions. The test results in Intel (r) core (TM) 2 quad dual-core processor, 1 GB memory, and software environment: Windows XP prove that the algorithm in this paper is compared with other classic comparison algorithms, not only can accurate results be obtained, but the speed has been greatly improved (This section is based on Ms. Liu jiamei's paper ).
If you have any questions, please correct them. Thank you. .