Edit Distance algorithm

Source: Internet
Author: User

Defined

Given two strings S1 and S2, the editing distance for both is defined as the minimum number of edit operations that convert S1 to S2 (equivalent to the minimum number of edit operations that convert S2 to S1).

There are 3 editing operations: Inserting a character, deleting a character, and substituting a character.

For example, the editing distance for cat and CBT is 1 (replace A with B), Cat to CA is 1 (delete t), and the CT to cat editing distance is 1 (insert a); Xcat to caty the editing distance is 2 (delete x, insert y).

Solution method

Know the definition of editing distance, then how to find the minimum editing distance? The idea of dynamic programming is used here.

As an example, if we were to solve the minimum editing distance for Jary and Jerry, we would first create the following matrix:

    J a r y
  0& nbsp;  1  2  4
J 1        
e        
R        
R        
y        

What does this matrix mean? The first line is the string jary, the first column is the string, Jerry, and each cell labeled with a number represents the minimum editing distance for a substring of two strings. The second row of the second column of 0 means that two strings are taken from the empty string, the editing distance is 0 (substring equal); the second row of the third column of 1 means that when Jerry's substring is emptied, Jary's substring takes J, the minimum editing distance for the two substrings is 1 (insert J for Jerry's substring). Other, and so on, can easily draw the number of the second and second columns in the current matrix.

and the minimum editing distance for the two strings we're asking for is the cell in the lower right corner of the Matrix, which represents the minimum editing distance of two substrings when the jary substring takes jary,jerry substring, which is the minimum editing distance of two strings.

Here I'll say how to ask, then explain the principle. Looking at the matrix below, I marked the center blank position from X1 to x20, where the numbers behind x represent the order in which we solve them.

J A R Y
0 1 2 3 4
J 1 X1 X6 X11 x16
E 2 X2 X7 X12 X17
R 3 X3 x8 X13 x18
R 4 X4 X9 x14 x19
Y 5 X5 X10 x15 X20

If solved in order, then in the solution of each value, its left, top and left three positions of the cell value is definitely known, the three cells in the value of the definition is left, top, lefttop, the solution is required to the value of the cell V is:

cost= if the cell is horizontally corresponding to the character and the vertical corresponding to the word typeface, etc. is 0 otherwise 1

Min (left+1,top+1,lefttop+cost)

The matrix after solving the solution method:

    J a r y
  0& nbsp;  1  2  4
J 1  0
e
R 2 < /td>
R 2  
y 2

The value in the lower right corner, so Jary and Jerry's edit distance is 2 (replace A to E, insert an R).

Principle of solution

We can use the matrix to find the minimum editing distance of two strings, but what is the principle of this? It's really simple when we ask for the string s[1...i] to edit the distance of T[1...J]:

    1. If we know that we can convert s[1...i-1] to T[1...J in a K1 operation, then s[1...i] will be converted to T[1...J] using k1+1 operations, because only one removal operation is required to remove s[i] and convert s[1...i to s[1 ... I-1], and then do a K1 operation to convert to T[1...J].
    2. If we know that we can convert s[1...i] to t[1...j-1 in a K2 operation, then s[1...i] will be converted to T[1...J] using k2+1 operations, because we can turn K2] into s[1...i first with t[1...j-1 operations), Then perform an insert operation at the end of insert t[j] to convert s[1...i] to T[1...J]
    3. If we know that we can convert s[1...i-1] to t[1...j-1] within K3 operations, if S[I]==T[J], then S[1...I] is converted to T[1...J] only K3 operations are required, and if S[I]!=T[J], a replacement operation will be required [i] is replaced with T[J], in which case a k3+1 operation is required.

The K1, K2, and K3 in the 3 cases discussed above correspond to the values in the left, upper, and upper left cells of a cell in the matrix.

The above conclusions can be expressed as follows:

Implementation code

After understanding the principle, it is very simple to write code that simulates the process of computing a matrix using code (Java implementation):

 PackageCommon;ImportOrg.junit.Assert; Public classLevenshteindistance { Public Static intgetdistance (String src, string des) {int[] m=New int[Des.length () +1][];  for(inti = 0; i < m.length; i++) {M[i]=New int[Src.length () +1]; }                         for(intI=0;i<src.length () +1;i++) {m[0][i]=i; }                         for(intI=0;i<des.length () +1;i++) {m[i][0]=i; }                         for(intI=1;i<des.length () +1;i++){                 for(intj = 1; J < Src.length () +1; J + +) {                    intRcost=des.charat (i-1) ==src.charat (j-1)? 0:1; M[I][J]=math.min (Math.min (m[i-1][j]+1, m[i-1][j-1]+rcost), m[i][j-1]+1); }            }                        returnm[des.length ()][src.length ()]; }                 Public Static voidMain (string[] args) {assert.assertequals (Getdistance ("Cat", "Dog"), 3);//ReplaceAssert.assertequals (Getdistance ("Cat", "CBT"), 1);//Replaceassert.assertequals (Getdistance ("Cat", "Ca"), 1);//DeleteAssert.assertequals (Getdistance ("Catx", "Cat"), 1);//Deleteassert.assertequals (Getdistance ("CT", "cat"), 1);//Insertassert.assertequals (Getdistance ("Xcat", "Caty"), 2);//Delete and insertassert.assertequals (Getdistance ("Fast", "Cats"), 3); Assert.assertequals (Getdistance ("Cats", "Fast"), 3); Assert.assertequals (Getdistance ("Kitten", "Sitting"), 3); Assert.assertequals (Getdistance ("Sitting", "Kitten"), 3); Assert.assertequals (Getdistance ("Jary", "Jerry"), 2); Assert.assertequals (Getdistance ("Jerry", "Jary"), 2); }}
Summarize

To learn to edit the distance algorithm is mainly to master two points, the first is to be counted by the matrix of two strings of the editing distance, the second is to understand why this can be calculated. Mastering these two points to write the program is very simple.

Edit Distance algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.