[Data Structure] lcss--longest common sub-sequence and longest common substring

Source: Internet
Author: User

1. What is LCSS?

What is LCSS? Many Bo friends see these letters may be confused, because this is my own on the two common questions, They are the longest common sub-sequence problem (longest-common-subsequence) and the longest common substring (longest-common-substring) problem, respectively. These two problems are very similar, so for unfamiliar classmates, sometimes it is easy to be confused. Let's take a good look at the difference between the two.

1.1 sub-sequence vs substring

  The subsequence is ordered, but not necessarily contiguous, and the action object is a sequence.

For example: sequence X = <b, C, D, b> is sequence Y = <a, B, C, B, D, A, b> subsequence, corresponding subscript sequence is <2, 3, 5, 7>.

  Substrings are ordered and contiguous, and the left and right objects are strings.

For example A = ABCD is a substring of C = aaabcdddd, but B = acdddd is not a substring of c.

1.2 Longest common sub-series vs longest common substring

The longest common subsequence and the longest common substring are two common problems, and although they are similar, they can be solved according to the dynamic programming, but the nature of them is different.

The longest common subsequence problem is to find the longest common subsequence of two sequences for the given two sequences.

The longest common substring problem is for the two strings given, the longest common substring of two strings (the string matching correlation algorithm can go to the blog [algorithm] string matching algorithm--KMP algorithm).

2. Dynamic Programming method Solving LCSs

As mentioned earlier, the dynamic programming method can use the longest common subsequence and the longest common substring problem, where we do not solve each one. Taking the longest common subsequence as an example, we introduce how to use the idea of dynamic programming to solve lcss.

Given two sequences, find the length of the oldest sequence that occurs simultaneously in two sequences. For each sequence, it has $a ^{m}$ neutron sequence, so the time complexity of using the brute force algorithm is exponentially, which is obviously not a good solution.

Let's take a look at how to use the idea of dynamic programming to solve the most common subsequence problem.

First, consider whether the maximum common sub-sequence problem satisfies the two basic characteristics of the dynamic programming problem:

  1. Optimal sub-structure:

  The input sequence is x [0.. m-1] and Y [0.. n-1], respectively, and the lengths are M and N. and set the sequence L (x [0. m-1],y[0. n-1]) is the length of the LCS of these two sequences, the following is L (x [0). m-1],y [0.. N-1]) is defined recursively:

1) if the last element of the two sequence matches (i.e. x [M-1] = = Y [N-1])

Then: L (X [0]. m-1],y [0.. N-1]) = 1 + L (X [0). m-2],y [0.. N-1])

2) If the last character of the two sequence does not match (i.e. x [M-1]! = Y [N-1])
Then: L (X [0]. m-1],y [0.. N-1]) = MAX (L (X [0). m-2],y [0.. N-1]), L (X [0]. m-1],y [0.. N-2]))

Get a better understanding of this by following specific examples:

  1) Consider input sub-sequence <AGGTAB> and <GXTXAYB>. A string that matches the last character. The length of such LCS can be written as:

L (<AGGTAB>, <GXTXAYB>) = 1 + L (<aggta>, <GXTXAY>)

2) Consider the input string "Abcdgh" and "AEDFHR". The last character does not match a string. The length of such LCS can be written as:

L (<ABCDGH>, <AEDFHR>) = MAX (L (<abcdg>, <AEDFHR>), L (<abcdgh>, <AEDFH>))

Therefore, the LCS problem has the optimal substructure properties.

  2. Overlapping sub-issues:

Obviously, based on the above analysis, many sub-problems of LCS also share sub-sub-problems, so it can be recursively solved. The specific algorithm time is O (m*n), which can be optimized to O (m+n).

The process of finding the LCS by backtracking is given:

The specific C + + implementation code is as follows:

/* Dynamic planning to implement LCS issues */#include<stdio.h>#include<stdlib.h>intMaxintAintb);/*Returns length of LCS for X[0..m-1], y[0..n-1]*/intLCsChar*x,Char*y,intMintN) {   intl[m+1][n+1]; intI, J; /*following steps build l[m+1][n+1] in bottom up fashion. Note that l[i][j] contains length of LCS of X[0..i-1] and Y[0..j-1]*/    for(i=0; i<=m; i++)   {      for(j=0; j<=n; J + +)     {       if(i = =0|| j = =0) L[i][j]=0; Else if(x[i-1] = = y[j-1]) L[i][j]= l[i-1][j-1] +1; ElseL[i][j]= Max (l[i-1][J], l[i][j-1]); }   }   /*L[m][n] contains length of LCS for x[0..n-1] and Y[0..m-1]*/   returnl[m][n];}/*Utility function to get max of 2 integers*/intMaxintAintb) {    return(A > B)?a:b;}/*test the above function*/intMain () {CharX[] ="Aggtab"; CharY[] ="Gxtxayb"; intm =strlen (X); intn =strlen (Y); printf ("Length of LCS is%d\n", LCS (X, Y, M, N));  GetChar (); return 0;}

The Python implementation code is as follows:

def LCS (A, B): Lena=Len (a) LenB=Len (b) C=[[0  forIinchRange (lenb+1)] forJinchRange (lena+1)] Flag=[[0  forIinchRange (lenb+1)] forJinchRange (lena+1)]   forIinchRange (Lena): forJinchRange (LENB):ifa[i]==B[j]: c[i+1][j+1]=c[i][j]+1Flag[i+1][j+1]='OK'      elifc[i+1][j]>c[i][j+1]: C[i+1][j+1]=c[i+1][j] Flag[i+1][j+1]=' Left'      Else: C[i+1][j+1]=c[i][j+1] Flag[i+1][j+1]=' up'return c,flagdef Printlcs (flag,a,i,j):ifi==0or j==0: Returnifflag[i][j]=='OK': Printlcs (flag,a,i-1, J-1) print (A[i-1],end="')  elifflag[i][j]==' Left': Printlcs (flag,a,i,j-1)  Else: Printlcs (flag,a,i-1, J) a='Abcbdab'b='Bdcaba'C,flag=LCS (A, B) forIinchc:print (i) Print ("') forJinchFlag:print (j) Print ("') Printlcs (Flag,a,len (a), Len (b)) print ("')

The awk command also makes it easy to write the LCS code:

Echo " 123456abcd567234dddabc45678"|awk-vfs= " " ' Nr==1{str=$0}nr==2{n=nf;for (n= 0;n++<n;) {s=""; for (t=n;t<=n;t++) {s=s""$t; if (index (str,s)) {a[n]=t-n; B[n]=s;if (M<=a[n]) m=a[n]}else{t=n}}}}end{for (n=0;n++<n;) if (a[n]==m) print B[n]} '
3. Reference Content

1. "Introduction to Algorithms" dynamic programming of the longest common sub-series;

[Data Structure] lcss--longest common sub-sequence and longest common substring

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.