Levenshtein distance (LD)-Two string similarity calculation algorithms
There are many ways to calculate the similarity between the two strings. Now we can summarize the similarity calculation based on the linear distance algorithm.
Levenshtein Distance (LD): LD may measure the similarity between two strings. Their distance is the addition, deletion, and modification of values during the conversion of a string into that string.
Example:
If str1 = "test", str2 = "test", then LD (str1, str2) = 0. Not converted.
If str1 = "test", str2 = "tent", then LD (str1, str2) = 1. Str1's "s" to "n", converts a character, so it is 1.
The larger the distance, the more different they are.
Levenshtein distance was first invented by Russian scientist Vladimir Levenshtein in 1965 and named after him. It can be called edit distance without spelling ).
Levenshtein distance can be used:
Spell checking)
Speech recognition (statement recognition)
DNA analysis)
Plagiarism detection (Plagiarism detection)
LD stores distance values using m * n matrices. Approximate Algorithm process:
The length of str1 or str2 is 0 and returns the length of another string.
Initialize (n + 1) * (m + 1) matrix d and increase the value of the first row and column from 0.
Scan two strings (n * m). If str1 [I] = str2 [j], use temp to record it as 0. Otherwise, the temp value is 1. Then in the matrix d [I] [j] assigned to d [I-1] [j] + 1, d [I] [J-1] + 1, d [I-1] [J-1] + the minimum value of temp.
After scanning, the last value of the returned matrix is d [n] [m].
The distance is returned. How can we find the similarity based on this distance? Because their maximum distance is the maximum length of the two strings. It is not very sensitive to strings. Now I have set the similarity calculation formula to 1-their distance/maximum String Length.
Private Int32 levenshtein (String a, String B)
{
If (string. IsNullOrEmpty ())
{
If (! String. IsNullOrEmpty (B ))
{
Return B. Length;
}
Return 0;
}
If (string. IsNullOrEmpty (B ))
{
If (! String. IsNullOrEmpty ())
{
Return a. Length;
}
Return 0;
}
Int32 cost;
Int32 [,] d = new int [a. Length + 1, B. Length + 1];
Int32 min1;
Int32 min2;
Int32 min3;
For (Int32 I = 0; I <= d. GetUpperBound (0); I + = 1)
{
D [I, 0] = I;
}
For (Int32 I = 0; I <= d. GetUpperBound (1); I + = 1)
{
D [0, I] = I;
}
For (Int32 I = 1; I <= d. GetUpperBound (0); I + = 1)
{
For (Int32 j = 1; j <= d. GetUpperBound (1); j + = 1)
{
Cost = Convert. ToInt32 (! (A [I-1] = B [j-1]);
Min1 = d [I-1, j] + 1;
Min2 = d [I, j-1] + 1;
Min3 = d [I-1, j-1] + cost;
D [I, j] = Math. Min (Math. Min (min1, min2), min3 );
}
}
Return d [d. GetUpperBound (0), d. GetUpperBound (1)];
}