Smith-waterman algorithm is a dynamic programming algorithm proposed by Smith and Waterman in 1981 to find and compare local similarity regions, and many later algorithms are developed on the basis of this algorithm. This is a two-sequence local comparison algorithm, the two unknown sequence is arranged, through the letter matching, delete and insert operation, so that two series reached the same length, in the course of operation, as far as possible to keep the same letter corresponding to the same position. When two sequences are compared, the optimal alignment of a sub-fragment in a sequence is found. This alignment method may reveal a number of matched sequence segments, which would have been submerged by some completely unrelated residues.
The algorithm is simply described as:
1) for each base pair or residual base pair assignment. Giving positive values of the same or similar, giving negative values to different or vacant spaces;
2) Initialize the edge element of the matrix with 0;
3) The score value in the matrix is added, and any score value less than 0 is replaced by 0;
4) through the dynamic programming method, from the matrix of the largest score unit to start backtracking search;
5) Continue, until the cell with a score of 0 stops, the unit of this backtracking path is the optimal alignment sequence.
From the above, the Smith-waterman algorithm is mainly divided into two steps. Calculates the score matrix and finds the best similar fragment pair. After the scoring matrix is obtained, the local maximal similarity fragment pairs are found by the method of dynamic programming backtracking: first find the largest element in the score matrix. Then follow the element's original path step-by-step backwards until it goes back to 0 o'clock and stops.
Here is an example of the original paper from Smith-waterman.
1) We assume that the two sequences that need to be matched are s1=aaugccauugacgg,s2=acagccucgcuuag.
2) First, compute the matching degree matrix H. Find the tuple H (10,8) with the highest score (3.3) in the matrix and begin the backtracking process.
3) The idea of backtracking is very simple, is to check the tuple above the tuple, the left, and the upper right, to see if its score is equal to the top-4/3, or left-4/3, or left +1, or left-1/3. In short, just look at the tuple as "who's coming from."
4) The critical condition of the backtracking termination is that a tuple has a score of 0, which means that we have not found a substring that matches the two strings.
5) After the entire backtracking process is complete, the following substring is found:
Aaugccauug
Acagcc-ucg
Here is the source code written in the Java language:
Import Java.io.bufferedreader;import Java.io.ioexception;import Java.io.inputstreamreader;import Java.util.arraylist;import Java.util.iterator;import Java.util.stack;public class Swsq {private int[][] H; Private int[][] IsEmpty; private static int SPACE; The space matches the score of the private static int match; Two letters with the same score private static int dismach; Two letters different score private int MAXINDEXM, MAXINDEXN; Private stack<character> stk1, stk2; Public String subSq1, SUBSQ2; The two substrings with the highest similarity are public swsq () {stk1 = new stack<character> (); STK2 = new stack<character> (); SPACE =-4; MATCH = 3; Dismach =-1; } private int Max (int A, int b, int c) {int maxn; if (a >= b) maxn = A; else MAXN = b; if (Maxn < c) MAXN = C; if (MAXN < 0) MAXN = 0; return MAXN; } private void Calculatematrix (string s1, string s2, int m, int n) {//Calculated score Matrix if (M = = 0) H[m][n] = 0; else if (n = = 0) H[m][n] = 0; else{if (isempty[m-1][n-1] = = 1) calculatematrix (S1, S2, m-1, n-1); if (isempty[m][n-1] = = 1) calculatematrix (S1, S2, M, n-1); if (isempty[m-1][n] = = 1) calculatematrix (S1, S2, m-1, N); if (S1.charat (m-1) = = S2.charat (n-1)) h[m][n] = max (h[m-1][n-1] + MATCH, h[m][n-1] + SPACE, h[m-1][n ] + SPACE); else H[m][n] = max (h[m-1][n-1] + Dismach, h[m][n-1] + space, H[m-1][n] + space); } Isempty[m][n] = 0; } private void Findmaxindex (int[][] H, int m, int n) {//find subscript int Curm, Curn, I, J, Max for the highest-scoring tuple in the score matrix H; Curm = 0; Curn = 0; max = h[0][0]; for (i = 0; i < m; i++) for (j = 0; J < N; j + +) if (h[I][J] > Max) {max = h[i][j]; Curm = i; Curn = j; } MAXINDEXM = Curm; Maxindexn = Curn; } private void TraceBack (string s1, string s2, int m, int n) {//backtracking to find the most similar subsequence if (h[m][n] = = 0) return; if (h[m][n] = = H[m-1][n] + SPACE) {Stk1.add (S1.charat (m-1)); Stk2.add ('-'); TraceBack (S1, S2, m-1, N); } else if (h[m][n] = = h[m][n-1] + SPACE) {stk1.add ('-'); Stk2.add (S2.charat (n-1)); TraceBack (S1, S2, M, n-1); } else {Stk1.push (S1.charat (m-1)); Stk2.push (S2.charat (n-1)); TraceBack (S1, S2, m-1, n-1); }} public String altostring (arraylist<character> A) {StringBuilder sb = new StringBuilder (); for (Character a:a) {sb.append (a.tostring ()); } return sb.tostring (); } public void FinD (string s1, string s2) {//initmatrix (S1.length (), s2.length ()); int I, J; H = new Int[s1.length () + 1][s2.length () + 1]; IsEmpty = new Int[s1.length () + 1][s2.length () + 1]; for (i = 0; I<=s1.length (), i++) for (j = 0; J<=s2.length (); j + +) isempty[i][j] = 1; Calculatematrix (S1, S2, S1.length (), s2.length ()); Findmaxindex (H, H.length, h[0].length); TraceBack (S1, S2, MAXINDEXM, MAXINDEXN); arraylist<character> arr1 = new arraylist<> (); arraylist<character> arr2 = new arraylist<> (); while (!stk1.empty ()) Arr1.add (Stk1.pop ()); SUBSQ1 = altostring (arr1); while (!stk2.empty ()) Arr2.add (Stk2.pop ()); SUBSQ2 = altostring (ARR2); public static void Main (string[] args) throws IOException {swsq x = new SWSQ (); String S1 = "AAUGCCAUUGACGG"; String s2 = "Acagccucgcuuag"; X.find (s1, S2); System.out.println ("----------------------------"); System.out.println (S1); SYSTEM.OUT.PRINTLN (S2); System.out.println ("----------------------------"); System.out.println (X.SUBSQ1); System.out.println (X.SUBSQ2); }}
Smith-waterman algorithm and its Java implementation