Common subsequences and common substrings

Source: Internet
Author: User

1. Common subsequence Problems

There are many questions about public subsequences on the Internet, which are similar and difficult to understand. Here we share a connection. I personally think it is clear and easy to understand.

Http://blog.csdn.net/v_july_v/article/details/6695482

Here, I will briefly explain my understanding. It is a very common problem to find public subsequences. The worst method is brute force matching, the first step of the brute-force matching algorithm is to describe the combination of all sequences of a short string, and then there are common sequences that are the same from the length to the length of one by one. Even if such pruning is used, this algorithm is very inefficient.

Of course, the most popular algorithms are dynamic planning. The core of dynamic planning is to find the transfer equation. Transfers complex problems to subproblems through the transfer equation.

  • Dynamic Planning Algorithm

In fact, the longest common subsequence problem also has the optimal sub-structure.

Note:

Xi = <X1 ,?, Xi> that is, the first I characters (1 ≤ I ≤ m) of the X sequence (prefix)

YJ = <Y1 ,?, YJ> that is, the first J characters (1 ≤ j ≤ n) of the Y sequence (prefix)

Assume z = <Z1 ,?, ZK ≥ LCS (x, y ).

  • IfXM = YN(The last character is the same), it is not difficult to prove using the reverse proof: This character must be the last character of any of the longest common subsequences of X and y z (set length to K, there is zk = XM = YN and obviously there is a Zk-1 in LCS (Xm-1, Yn-1) That is Z prefixZk-1 is the longest common subsequence of Xm-1 and Yn-1. At this time, the question is to find the Xm-1 and Yn-1 LCS (The length of LCS (X, Y) is equal to that of LCS (Xm-1, Yn-1).

  • IfXM =yn, It is not difficult to use the reverse proof method to prove: either Z, LCS (Xm-1, Y), or Z, LCS (x, Yn-1 ). Because ZK and ZK have at least one of them must be true, ZK and XM has Z, LCS (Xm-1, Y), similar, if ZK is not YN, there is Z in LCS (x, Yn-1 ). At this time, the question is to find the Xm-1 and Y LCs and X and Yn-1 LCS. The length of LCS (X, Y) is: the length of max {LCS (Xm-1, Y), the length of LCS (x, Yn-1 }.

SinceXM =ynIn the case of finding the length of LCS (Xm-1, Y) and the length of LCS (x, Yn-1), the two problems are not mutually independent: both require LCS (Xm-1, yn-1) length. The LCS of the other two sequences contains the LCS of the prefix of the two sequences. Therefore, the dynamic programming method is used for the problem of optimal sub-structure.

That is to say, to solve this LCS problem, you need three things:1, LCS (Xm-1, Yn-1) + 1;2, LCS (Xm-1, Y), LCS (x, Yn-1 );3, Max {LCS (Xm-1, Y), LCS (x, Yn-1 )}.

So the dynamic transfer equation to solve this problem is:

IfXM = YNLCS (XM, yn) = LCS (Xm-1, Yn-1) + 1;
IfXM! = YNLCS (XM, yn) = max {LCS (Xm-1, yn), LCS (XM, Yn-1 )};

The Code is as follows:

# Include <stdio. h> # include <string. h>/* C [I] [J] Stores strings from 1 to I, if str1 [I] = str2 [J] C [I] = C [I-1] [J-1] + 1; if str1 [I]! = Str2 [J] C [I] [J] = max {C [I-1] [J], C [I] [J-1]} */INT LCS (char * str1, char * str2, int len1, int len2, int C [100] [100]) {If (str1 = NULL | str2 = NULL) {return-1; // input string error} // two-dimensional array of the initialization record DP for (INT I = 0; I <= len1; I ++) {for (Int J = 0; j <= len2; j ++) {C [I] [J] = 0 ;}// DP operation for (INT I = 1; I <= len1; I ++) {for (Int J = 1; j <= len2; j ++) {If (str1 [I-1] = str2 [J-1]) {c [I] [J] = C [I-1] [J-1] + 1;} els E {C [I] [J] = C [I-1] [J]> C [I] [J-1]? C [I-1] [J]: C [I] [J-1] ;}}// print the content stored in the DP array for (INT I = 0; I <= len1; I ++) {for (Int J = 0; j <= len2; j ++) {printf ("% d", C [I] [J]);} printf ("\ n") ;}// print out the common subsequence char STR [100] = {0}; int Index = C [len1] [len2]-1; for (INT I = len1, j = len2; I> 0 & J> 0;) {If (str1 [I-1] = str2 [J-1]) {STR [index --] = str1 [I-1]; I --; j --;} else {If (C [I] [J-1]> C [I-1] [J]) {J --;} else {I -- ;}} printf ("Public subsequence: % s \ n", STR); Return C [len1] [len2];} int main (INT argc, char ** argv) {char str1 [] = {"abcbdab"}; char str2 [] = {"bdcaba "}; int C [100] [100]; int len1 = strlen (str1); int len2 = strlen (str2); int num = LCS (str1, str2, len1, len2, c); printf ("length of common subsequences: % d \ n", num); Return 0 ;}

Running result

 

2. maximum public substrings

First, distinguish between the public string and the Public sub-sequence. The public sub-sequence is in the whole string as long as it can be in order without consecutive, but the public sub-string is a must continuous string, for example

 

ABCBDAB
BDCABA

The common subsequence is bcba.

The common string is AB.

It is a little easier to calculate a Public String than a public subsequence. If the public substring is described above, you can use the brute force matching method to find all the substrings of a short string, then, we can use the KMP string matching algorithm to obtain the public substrings from the length to the short, and also add pruning. However, the violent matching efficiency of words is always relatively poor, the best way is to use dynamic planning.

Based on the above method of dynamic planning of public subsequences, we can find that public substrings and public subsequences are very similar.

The state transfer equation is slightly different,

In fact, the longest common substring problem also has the optimal sub-structure.

Note:

Xi = <X1 ,?, Xi> that is, the first I characters (1 ≤ I ≤ m) of the X sequence (prefix)

YJ = <Y1 ,?, YJ> that is, the first J characters (1 ≤ j ≤ n) of the Y sequence (prefix)

Assume z = <Z1 ,?, ZK ≥ LCS (x, y ).

  • IfXM = YN(The last character is the same), it is not difficult to prove using the reverse proof: This character must be the last character of any of the longest common substrings of X and Y, Z (set to K, there is zk = XM = YN and obviously there is a Zk-1 in LCS (Xm-1, Yn-1) That is Z prefixZk-1 is the longest common substring between Xm-1 and Yn-1. At this time, the question is to find the Xm-1 and Yn-1 LCS (The length of LCS (X, Y) is equal to that of LCS (Xm-1, Yn-1).

  • The important difference is:
    • IfXM =ynBecause ZK, XM, and ZK, and YN indicate that the same strings cannot be connected. At this time, the LCS (x, y) returns to 0 to find the longest public substring.

Therefore, the dynamic transfer equation for the longest common substring is:

IfXM = YNLCS (XM, yn) = LCS (Xm-1, Yn-1) + 1;
IfXM! = YNLCS (XM, yn) = 0;

The Code is as follows:

# Include <stdio. h> # include <string. h>/* C [I] [J] Stores strings from 1 to I, if str1 [I] = str2 [J] C [I] = C [I-1] [J-1] + 1; if str1 [I]! = Str2 [J] C [I] [J] = 0 */INT LCS (char * str1, char * str2, int len1, int len2, int C [100] [100]) {If (str1 = NULL | str2 = NULL) {return-1; // input string error} // two-dimensional array of the initialization record DP for (INT I = 0; I <= len1; I ++) {for (Int J = 0; j <= len2; j ++) {C [I] [J] = 0 ;}// DP operation int max =-1; int Col = 0, row = 0; For (INT I = 1; I <= len1; I ++) {for (Int J = 1; j <= len2; j ++) {If (str1 [I-1] = str2 [J-1]) {C [I] [J] = C [I-1] [J-1] + 1; if (C [I] [J]> MAX) {ROW = I; Col = J; max = C [I] [J] ;}} else {C [I] [J] = 0 ;}}// print the content stored in the DP array for (INT I = 0; I <= len1; I ++) {for (Int J = 0; j <= len2; j ++) {printf ("% d", C [I] [J]);} printf ("\ n") ;}// print out the public substring printf ("Longest public substring:"); For (INT I = row-Max; I <row; I ++) {printf ("% C", str1 [I]) ;}printf ("\ n"); Return Max ;}int main (INT argc, char ** argv) {char str1 [] = {"abcbdab"}; char str2 [] = {"bdcaba"}; int C [100] [100]; int len1 = strlen (str1); int len2 = strlen (str2); printf ("string 1: % s \ n", str1); printf ("string 2: % s \ n ", str2); int num = LCS (str1, str2, len1, len2, c); printf (" length of common subsequences: % d \ n ", num); Return 0 ;}

Result:

Common subsequences and common substrings

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.