Algorithm series note 6 (Dynamic Planning-Longest Common subsequence/string lcs), algorithm lcs

Source: Internet
Author: User

Algorithm series note 6 (Dynamic Planning-Longest Common subsequence/string lcs), algorithm lcs

The sub-sequence requires that the element order be consistent, and the string must be continuous. For example, ABCBDAB and BDCABA are two strings. The longest common subsequences include BCBA, BDAB, and BCAB, while the longest common strings include AB and BD <consecutive>. Of course, the solution here is only one, but it is usually said that the longest common substring and subsequence should be one of the most accurate.

Longest Common subsequence

Method 1: Exhaustion

Check all character sequences of string x. There are 2 ^ m characters in total. Check whether it appears in string y. Each requires O (n) and the time complexity is exponential.

Method 2: Dynamic Planning (DP)

Convert two strings x [1... M] AND y [1... N] on the x and Y axes, a two-dimensional array c [I, j] is obtained to record x [1... I] and y [1... J] the maximum number of common subsequences.

When x [I] = y [j] c [I, j] = c [I-1, J-1] + 1; when not equal, c [I, j] = max {c [I-1, j], c [I, J-1]}.

The bottom-up approach is adopted, so that the time complexity is equal to the number of independent sub-problems (mn) of lcs. Otherwise, the sub-problems need to be computed repeatedly, and the time complexity is still exponential.

The Code is as follows:

 

// The Longest Common subsequence (discontinuous) requires an array of tags for backtracking void lcs_sequences (const char * str1, const char * str2, int len1, int len2) {int ** c = new int * [len1 + 1]; int ** B = new int * [len1 + 1]; int I, j; for (I = 0; I <len1 + 1; I ++) {c [I] = new int [len2 + 1]; B [I] = new int [len2 + 1];} for (I = 0; I <= len1; I ++) for (j = 0; j <= len2; j ++) c [I] [j] = 0; for (I = 1; I <= len1; I ++) {for (j = 1; j <= len2; j ++) {if (str1 [I-1] = str2 [J-1]) {c [I] [j] = c [I-1] [J-1] + 1; B [I] [j] = 0; // from top left} else {if (c [I-1] [j]> c [I] [J-1]) {c [I] [j] = c [I-1] [j]; B [I] [j] = 1; // from top} else {c [I] [j] = c [I] [J-1]; // from left side B [I] [j] = 2; // from top }}}cout <"Longest Common subsequence length:" <c [len1] [len2] <endl; // return the solution path I = len1; j = len2; char * x = new char [c [len1] [len2]; int k = 0; /* while (I> 0 & j> 0) {if (B [I] [j] = 0) // from top left {x [k ++] = str1 [I-1]; // cout <str1 [I-1]; I --; j --;} else if (B [I] [j] = 1) I --; else j --;} * /// use str1, str2, and c [I] [j] to obtain the result while (I> 0 & j> 0) without using a tag array for backtracking) {if (str1 [I-1] = str2 [J-1]) {x [k ++] = str1 [I-1]; I --; j --;} else if (c [I] [j] = c [I] [J-1]) j --; else I --;} cout <"the lcs_opt is :"; for (I = c [len1] [len2]-1; I> = 0; I --) {cout <x [I] ;} cout <endl; for (I = 0; I <len1; I ++) delete [] c [I]; delete [] c; delete [] x ;}

The Code mentioned above is marked with an array to track the source. Of course, numbers cannot be marked. c [I, j], str1, and str2 can be used directly for determination, here we can save O (mn) space, but only improve the constant factor of space complexity.

If only the length of a Public String is required, the space complexity can be reduced to O (min {m, n }). A two-dimensional array is used here, but the number of rows is fixed to 2.

The Code is as follows:

 

// The longest public subsequence is used to optimize the space. Two-dimensional arrays of two rows are used, but only the length of the longest public subsequence can be obtained. The void swap (int ** c, int len2) {for (int I = 0; I <len2; I ++) {int temp = c [0] [I]; c [0] [I] = c [1] [I]; c [1] [I] = temp;} void lcs_sequences_opt (const char * str1, const char * str2, int len1, int len2) {int * c [2]; int I, j; for (I = 0; I <2; I ++) c [I] = new int [len2]; for (j = 0; j <len2; j ++) c [0] [j] = 0; for (I = 0; I <len1; I ++) {for (j = 0; j <len2; j ++) {if (str1 [I ] = Str2 [j]) {if (j = 0) c [1] [j] = 1; else c [1] [j] = c [0] [J-1] + 1;} else {if (j = 0) c [1] [j] = c [0] [j]; else c [1] [j] = c [0] [j]> c [1] [J-1]? C [0] [j]: c [1] [J-1];} swap (c, len2 ); // you can assign c [1] to c [0] directly without switching.} cout <"the longest length of the common subsequence is: "<c [0] [len2-1] <endl; for (I = 0; I <2; I ++) delete [] c [I];}


Longest public substring

The solution is to use a matrix to record the matching conditions between the two characters at all positions in two strings. If it matches, it is 1; otherwise, it is 0. Then we can find the longest 1 series of diagonal lines. The corresponding position is the longest position matching the substring.

Optimization: when matching characters, we do not simply assign 1 to the corresponding element, but add a value to the element in the upper left corner. We use two marking variables to mark the position of the element with the largest median value in the Matrix. During the matrix generation process, we can determine whether the value of the currently generated element is the largest. Based on this, we can change the value of the marking variable, by the time the matrix is complete, the longest position and length of the matched substring have come out.

That is:

When x [I] = y [j] c [I, j] = c [I-1, J-1] + 1; when not equal, c [I, j] = 0.

The Code is as follows:

 

// Obtain the longest public substring void lcs_string (const char * str1, const char * str2, int len1, int len2) {int ** c = new int * [len1]; int I, j; int maxC = 0; // maximum int position = 0; // position for (I = 0; I <len1; I ++) {c [I] = new int [len2]; for (j = 0; j <len2; j ++) {if (str1 [I] = str2 [j]) {if (I = 0 | j = 0) {c [I] [j] = 1 ;} else {c [I] [j] = c [I-1] [J-1] + 1 ;}} else {c [I] [j] = 0 ;} if (c [I] [j]> maxC) {maxC = c [I] [j]; position = j ;}}} cout <"Maximum length of Public substrings:" <maxC <endl; cout <"the lcs is:"; for (I = position-maxC + 1; I <= position; I ++) {cout <str2 [I] ;}cout <endl; for (I = 0; I <len1; I ++) delete [] c [I]; delete [] c ;}

At this time, both the time complexity and the space complexity are O (mn ). Of course, you can optimize the space complexity of the Code to O (min {m, n}). Here you only need to use a one-dimensional array, but traverse from the back to the front, in this way, the previous result is displayed before c [j]. Otherwise, errors such as DBB and AB may occur.

The Code is as follows:

// Obtain the longest public substring to optimize the space complexity. Use a one-dimensional array to obtain void lcs_string_opt (const char * str1, const char * str2, int len1, int len2) {int * c = new int [len2]; int I, j; int maxC = 0; // maximum int position = 0; // location memset (c, 0, sizeof (int) * len2); for (I = 0; I <len1; I ++) {for (j = len2-1; j> = 0; j --) // traverse from the back to the front so that the result is the previous result before c [j]. Otherwise, errors such as DBB and AB will occur. {if (str1 [I] = str2 [j ]) {if (j = 0) c [j] = 1; else c [j] = c [J-1] + 1; // unique difference} else c [j] = 0; if (c [j]> maxC) {maxC = c [j]; position = j ;}}} cout <"Maximum length of Public substrings:" <maxC <endl; cout <"the lcs_opt is:"; for (I = position-maxC + 1; I <= position; I ++) // output the longest Public String {cout <str2 [I] ;}cout <endl; delete [] c ;}

Dynamic Planning

Dynamic Planning has two major features. We use the longest common subsequence as an example.

1: optimal sub-structure

It means that the optimal solution includes the optimal solution of the subproblem.

For example, in lcs, x [1... I] and y [1... The longest common subsequence of j]. When x [I] = y [j], it can be converted to x [1... I-1] AND y [1... The longest common subsequence of J-1. When x [I] is not equal to y [j], x [1... I] and y [1... J-1] and x [1... I-1] AND y [1... J. Both of these subproblems contain a public subproblem, that is, computing x [1... I-1] AND y [1... The longest common subsequence of J-1.

2: overlapping subproblems

We can also see that all the sub-problems contain a public sub-problem, that is, there will be overlapping issues.

This means that a recursive problem contains a few independent subproblems that are repeatedly calculated. The lcs issue contains m * n Independent subproblems.

References

1: http://blog.csdn.net/steven30832/article/details/8260189

2: http://blog.csdn.net/imzoer/article/details/8031478

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.