The editing distance is the minimum number of changes required to transform from one string to another (in characters, such as son to Sun,s without changing, the o->s,n does not change, so the procedure is 1).
To get the editing distance, we draw a two-dimensional table to understand, taking beauty and Batyu as an example:
The figure 1 cell position is the value of the first character [b] of the two word, and its value is determined by the value above it (1), the value on its left (1), and the value in its upper-left corner (0). When the cell is in the row and column corresponding to the character (such as 3 for A and b) is equal, its upper-left corner of the value of +0, otherwise add 1 (such as at 1 , [b]=[b] so the upper left corner of the value plus 0 is 0+0=0, and at 2 [b]!=[b ] so the upper left corner of the value plus 1 is 1+1=2). Then the cell's left cell and the value of the upper cell are added 1 respectively, (and then add the sum of the three results of the minimum value as the value of the cell, such as the top left, left, and above the value of (0,2,2), so the value of 1 cell is 0, and in 3 , the value is (2,3,1), so the value of 3 cell is 1).
Algorithm proof
This algorithm calculates the minimum number of operands (that is, the so-called edit distance) required to convert S[1...I] to T[1...J] (for example, to convert beauty to Batyu), which is saved in D[i,j] (d for the two-dimensional array shown).
- In the first row and the first column is certainly correct, it is also very well understood, for example, we convert beauty to an empty string, we need to do the number of operands is beauty length (the operation is to be beauty all characters discarded).
- There are three possible things we can do with characters:
- Convert S[1...N] to T[1...M] Of course you need to convert all s to all T, so, d[n,m] (bottom right corner of the table) is the result we need.
- If we can use the K operand to convert s[1...i] to t[1...j-1], we just need to add t[j] to the last surface to convert s[1...i] to T[1...J], the operand is k+1
- If we can use the K operand to convert s[1...i-1] to T[1...J], we just need to remove s[i] from the last to complete the conversion, the operand is k+1
- If we can use the K operand to convert s[1...i-1] to t[1...j-1], we just need to replace s[i] with T[J] only if needed (s[i]! = T[j]), and the required operand is k+cost (cost represents whether conversion is required, if S[I]==T[J] , the cost is 0, otherwise 1).
This proving process can only prove that we can get results, but it does not prove that the result is minimal (that is, we get the fewest conversion steps). So we introduced another algorithm, that is, D[i,j], which is one of the smallest operands in the above three operations. This ensures that the result we get is the smallest number of operands
Improvements that may be made
- The algorithm complexity is now O (MN), which can be improved to O (M). Because the algorithm only needs to be stored on the previous line and the current row.
- If you need to reproduce the conversion step, we can save the location of each step and the operation to reproduce.
- If we only need to compare whether the conversion step is less than a specific constant k, then only a rectangle with a width and width of 2k+1 can be computed, so that the algorithm complexity can be simplified to O (KL), and L represents the length of the shortest string participating in the comparison.
- We can give different weights to three operations (add, delete, replace) (the current algorithm is assumed to be 1, we can add 1, delete as 0, replace with 2, etc.) to refine our comparison.
- If we initialize all cells in the first row to 0, this algorithm can be used as a fuzzy character query. We can get the position of the last character of the string that best matches this string (index number), and if we need the starting position of this string, we need to store the steps of each operation and then calculate the starting position of the string by the algorithm.
- This algorithm does not support parallel computing and can not take advantage of parallel computing when dealing with very large strings. But we can also calculate the cost values in parallel (whether the two characters in the same position are equal), and then use this algorithm for the overall calculation.
- The time complexity of this algorithm can be optimized to O (M (1+d)) (d for the result) if only the diagonal is checked instead of the entire row, and delay validation (lazy evaluation) is used. This can greatly increase the speed of the contrast in cases where the two strings are very similar.
Levenshtein Distance (editing distance) algorithm detailed