[Many of my predecessors wrote articles about the levenshtein distance algorithm, which provides in-depth analysis and explanation of the algorithm principles. I want to explain this algorithm in a more easy-to-understand language. Please point it out and forgive me]
Levenshtein distance indicates the minimum number of edits (add, delete, and insert) required to change from string a to string B. It is also widely used. Here we use it to calculate the similarity between two strings.
Algorithm principle I will not explain (note, for the algorithm principle, please refer to the http://en.wikipedia.org/wiki/Levenshtein_distance), here only the implementation of the process.
[Example] assume that the source string "Jary" and the target string "Jerry" are used to determine the distance from source string to target string editing.
The graphic process is as follows:
Step 1: Initialize the following matrix
Step 2: Start from the first character ("J") of the source string and compare it with the target string from top to bottom.
If the two characters are equal, the minimum value is obtained from the left, top, and left positions. If the minimum value is in the left and top positions, 1 is added, if it is on the top left, 0 is added. If the range is not the same, the minimum value is obtained from the left, top, and left positions plus 1;
For the first time, compare the first character "J" of the source string with the target string "J". The minimum value 0 is obtained from the three positions on the left, top, and left, because the two characters are equal, so add 0. Then, compare "J"> "E", "J"> "r", "J"> "r ",, "J" → "Y" to scan the target string.
Step 3: traverse the entire source string and compare the target string:
Step 4: After scanning the last column, the last one is the shortest editing distance:
Obtain the editing distance. The similarity between the two strings is similarity = (max (x, y)-levenshtein)/MAX (x, y), where X, Y is the length of the source and target strings. (Modify calculation formula)
LCS algorithm: used to solve the longest common subsequence between two strings;
[Example] If there are two strings: "Zhang zizhi" and "Zhang zizhi", the solution is as follows:
Step 1: Initialize the following matrix. Then, the words "Zhang zezhi" are compared with the words "Zhang". The same result is 1 + the result of the previous step (left diagonal ), different digits 0;
Step 2: Compare the source string with the target string in sequence. The second step here is the comparison between "Zhang zezhi" and "ze", with the same result of 1 + previous step (left diagonal ), the difference is 0.
Step 3: Compare the complete matrix. The maximum number of scan matrices is the longest common subsequence;
Calculate the editing distance, the longest common subsequence, and use the formula provided by the predecessors of hichina Yike to calculate: S (a, B) = LCS (A, B) /(LD (a, B) + LCS (a, B ))
[Note, in the practical application, this formula will be insufficient, thanks to the predecessors of the wancang yi pointed out that in the practical application, should be combined with the use of LCS algorithm; can refer to http://www.cnblogs.com/grenet/archive/2010/06/04/1751147.html]
Code Implementation (using C)
1. Calculate the editing distance:
public static int LevenshteinDistance(string source, string target) { int cell = source.Length; int row = target.Length; if (cell == 0) { return row; } if (row == 0) { return cell; } int[, ] matrix = new int[row + 1, cell + 1]; for (var i = 0; i <= cell; i++) { matrix[0, i] = i; } for (var j = 1; j <= row; j++) { matrix[j, 0] = j; } var tmp = 0; for (var k = 0; k < row; k++) { for (var l = 0; l < cell; l++) { if (source[l].Equals(target[k])) tmp = 0; else tmp = 1; matrix[k + 1, l + 1] = Math.Min(Math.Min(matrix[k, l] + tmp, matrix[k + 1, l] + 1), matrix[k, l + 1] + 1); } } return matrix[row, cell];}
2. LCS algorithm code:
1 public static int LongestCommonSubsequence(string source, string target) 2 { 3 if (source.Length == 0 || target.Length == 0) 4 return 0; 5 int len = Math.Max(target.Length, source.Length); 6 int[, ] subsequence = new int[len + 1, len + 1]; 7 for (int i = 0; i < source.Length; i++) 8 { 9 for (int j = 0; j < target.Length; j++) 10 {11 if (source[i].Equals(target[j])) 12 subsequence[i + 1, j + 1] = subsequence[i, j] + 1;13 else 14 subsequence[i + 1, j + 1] = 0;15 }16 }17 int maxSubquenceLenght = (from sq in subsequence.Cast < int > () select sq).Max < int > ();18 return maxSubquenceLenght;19 }
2. Calculate the degree of acquaintance between two strings:
1 public static float StringSimilarity(string source, string target) 2 {3 var ld = LevenshteinDistance(source, target);4 var maxLength = Math.Max(source.Length, target.Length);5 return (float)(maxLength - ld) / maxLength;6 }
3. Calculate the similarity between two strings:
1 public static float StringSimilarity(string source, string target) 2 {3 var ld = LevenshteinDistance(source, target);4 var lcs = LongestCommonSubsequence(source, target);5 return ((float)lcs)/(ld+lcs);;6 }
[Modification] When writing the LD algorithm, some code was simplified, leading to some calculation errors, making the results incorrect and corrected. Thanks to the chaotic world for pointing out;