Defined
Given two strings S1 and S2, the editing distance for both is defined as the minimum number of edit operations that convert S1 to S2 (equivalent to the minimum number of edit operations that convert S2 to S1).
There are 3 editing operations: Inserting a character, deleting a character, and substituting a character.
For example, the editing distance for cat and CBT is 1 (replace A with B), Cat to CA is 1 (delete t), and the CT to cat editing distance is 1 (insert a); Xcat to caty the editing distance is 2 (delete x, insert y).
Solution method
Know the definition of editing distance, then how to find the minimum editing distance? The idea of dynamic programming is used here.
As an example, if we were to solve the minimum editing distance for Jary and Jerry, we would first create the following matrix:
|
|
J |
a |
r |
y |
|
0& nbsp; |
1 |
2 |
3 |
4 |
J |
1 |
|
|
|
|
e |
2 |
|
|
|
|
R |
3 |
|
|
|
|
R |
4 |
|
|
|
|
y |
5 |
|
|
|
|
What does this matrix mean? The first line is the string jary, the first column is the string, Jerry, and each cell labeled with a number represents the minimum editing distance for a substring of two strings. The second row of the second column of 0 means that two strings are taken from the empty string, the editing distance is 0 (substring equal); the second row of the third column of 1 means that when Jerry's substring is emptied, Jary's substring takes J, the minimum editing distance for the two substrings is 1 (insert J for Jerry's substring). Other, and so on, can easily draw the number of the second and second columns in the current matrix.
and the minimum editing distance for the two strings we're asking for is the cell in the lower right corner of the Matrix, which represents the minimum editing distance of two substrings when the jary substring takes jary,jerry substring, which is the minimum editing distance of two strings.
Here I'll say how to ask, then explain the principle. Looking at the matrix below, I marked the center blank position from X1 to x20, where the numbers behind x represent the order in which we solve them.
|
|
J |
A |
R |
Y |
|
0 |
1 |
2 |
3 |
4 |
J |
1 |
X1 |
X6 |
X11 |
x16 |
E |
2 |
X2 |
X7 |
X12 |
X17 |
R |
3 |
X3 |
x8 |
X13 |
x18 |
R |
4 |
X4 |
X9 |
x14 |
x19 |
Y |
5 |
X5 |
X10 |
x15 |
X20
|
If solved in order, then in the solution of each value, its left, top and left three positions of the cell value is definitely known, the three cells in the value of the definition is left, top, lefttop, the solution is required to the value of the cell V is:
cost= if the cell is horizontally corresponding to the character and the vertical corresponding to the word typeface, etc. is 0 otherwise 1
Min (left+1,top+1,lefttop+cost)
The matrix after solving the solution method:
|
|
J |
a |
r |
y |
|
0& nbsp; |
1 |
2 |
3 |
4 |
J |
1 |
0 |
1 |
2 |
3 |
e |
2 |
1 |
1 |
2 |
3 |
R |
3 |
2 |
2 |
1 |
2 < /td> |
R |
4 |
3 |
3 |
2 |
2 |
y |
5 |
4 |
4 |
3 |
2 |
The value in the lower right corner, so Jary and Jerry's edit distance is 2 (replace A to E, insert an R).
Principle of solution
We can use the matrix to find the minimum editing distance of two strings, but what is the principle of this? It's really simple when we ask for the string s[1...i] to edit the distance of T[1...J]:
- If we know that we can convert s[1...i-1] to T[1...J in a K1 operation, then s[1...i] will be converted to T[1...J] using k1+1 operations, because only one removal operation is required to remove s[i] and convert s[1...i to s[1 ... I-1], and then do a K1 operation to convert to T[1...J].
- If we know that we can convert s[1...i] to t[1...j-1 in a K2 operation, then s[1...i] will be converted to T[1...J] using k2+1 operations, because we can turn K2] into s[1...i first with t[1...j-1 operations), Then perform an insert operation at the end of insert t[j] to convert s[1...i] to T[1...J]
- If we know that we can convert s[1...i-1] to t[1...j-1] within K3 operations, if S[I]==T[J], then S[1...I] is converted to T[1...J] only K3 operations are required, and if S[I]!=T[J], a replacement operation will be required [i] is replaced with T[J], in which case a k3+1 operation is required.
The K1, K2, and K3 in the 3 cases discussed above correspond to the values in the left, upper, and upper left cells of a cell in the matrix.
The above conclusions can be expressed as follows:
Implementation code
After understanding the principle, it is very simple to write code that simulates the process of computing a matrix using code (Java implementation):
PackageCommon;ImportOrg.junit.Assert; Public classLevenshteindistance { Public Static intgetdistance (String src, string des) {int[] m=New int[Des.length () +1][]; for(inti = 0; i < m.length; i++) {M[i]=New int[Src.length () +1]; } for(intI=0;i<src.length () +1;i++) {m[0][i]=i; } for(intI=0;i<des.length () +1;i++) {m[i][0]=i; } for(intI=1;i<des.length () +1;i++){ for(intj = 1; J < Src.length () +1; J + +) { intRcost=des.charat (i-1) ==src.charat (j-1)? 0:1; M[I][J]=math.min (Math.min (m[i-1][j]+1, m[i-1][j-1]+rcost), m[i][j-1]+1); } } returnm[des.length ()][src.length ()]; } Public Static voidMain (string[] args) {assert.assertequals (Getdistance ("Cat", "Dog"), 3);//ReplaceAssert.assertequals (Getdistance ("Cat", "CBT"), 1);//Replaceassert.assertequals (Getdistance ("Cat", "Ca"), 1);//DeleteAssert.assertequals (Getdistance ("Catx", "Cat"), 1);//Deleteassert.assertequals (Getdistance ("CT", "cat"), 1);//Insertassert.assertequals (Getdistance ("Xcat", "Caty"), 2);//Delete and insertassert.assertequals (Getdistance ("Fast", "Cats"), 3); Assert.assertequals (Getdistance ("Cats", "Fast"), 3); Assert.assertequals (Getdistance ("Kitten", "Sitting"), 3); Assert.assertequals (Getdistance ("Sitting", "Kitten"), 3); Assert.assertequals (Getdistance ("Jary", "Jerry"), 2); Assert.assertequals (Getdistance ("Jerry", "Jary"), 2); }}
Summarize
To learn to edit the distance algorithm is mainly to master two points, the first is to be counted by the matrix of two strings of the editing distance, the second is to understand why this can be calculated. Mastering these two points to write the program is very simple.
Edit Distance algorithm