There are many ways to calculate the similarity between two strings.AlgorithmTo calculate the similarity.
Levenshtein distance (LD): LD may measure the similarity between two strings. Their distance is the addition, deletion, and modification of values during the conversion of a string into that string.
Example:
If str1 = "test", str2 = "test", then LD (str1, str2) = 0. Not converted.
If str1 = "test", str2 = "tent", then LD (str1, str2) = 1. Str1's "S" to "N", converts a character, so it is 1.
The larger the distance, the more different they are.
Levenshtein distance was first invented by Russian scientist Vladimir levenshtein in 1965 and named after him. It can be called edit distance without spelling ).
Levenshtein distance can be used:
Spell checking)
Speech recognition (statement recognition)
DNA analysis)
Plagiarism detection (Plagiarism detection)
LD stores distance values using M * n matrices. Approximate Algorithm process:
The length of str1 or str2 is 0 and returns the length of another string.
Initialize (n + 1) * (m + 1) matrix D and increase the value of the first row and column from 0.
Scan two strings (N * m). If str1 [I] = str2 [J], use temp to record it as 0. Otherwise, the temp value is 1. Then in the matrix D [I] [J] assigned to d [I-1] [J] + 1, D [I] [J-1] + 1, D [I-1] [J-1] + the minimum value of temp.
After scanning, the last value of the returned matrix is d [N] [M].
The distance is returned. How can we find the similarity based on this distance? Because their maximum distance is the maximum length of the two strings. It is not very sensitive to strings. Now I have set the similarity calculation formula to 1-their distance/maximum String Length.
Source code:
/**
* Edit the similarity between two strings
*/
Public class Similarity
{
Private int min (INT one, int two, int Three)
{
Int min = one;
If (two <min)
{
Min = two;
}
If (three <min)
{
Min = three;
}
Return min;
}
Public int LD (string str1, string str2)
{
Int [,] D; // Matrix
Int n = str1.length;
Int M = str2.length;
Int I; // traverse str1's
Int J; // traverse str2's
Char character; // str1's
Char CH2; // str2's
Int temp; // records the incremental value at a matrix position with the same characters, either 0 or 1
If (n = 0)
{
Return m;
}
If (M = 0)
{
Return N;
}
D = new int [n + 1, m + 1];
For (I = 0; I <= N; I ++)
{// Initialize the first column
D [I, 0] = I;
}
For (j = 0; j <= m; j ++)
{// Initialize the first line
D [0, J] = J;
}
For (I = 1; I <= N; I ++)
{// Traverse str1
Struct = str1 [I-1];
// De-match str2
For (j = 1; j <= m; j ++)
{
CH2 = str2 [J-1];
If (rows = CH2)
{
Temp = 0;
}
Else
{
Temp = 1;
}
// + 1 on the left, + 1 on the top, and + temp on the top left
D [I, j] = min (d [I-1, J] + 1, D [I, j-1] + 1, D [I-1, j-1] + temp );
}
}
Return d [n, m];
}
Public double SIM (string str1, string str2)
{
Int LD = LD (str1, str2 );
Return 1-(double) LD/Math. Max (str1.length, str2.length );
}
Public static void main (string [] ARGs)
{
Similarity S = New similarity ();
String str1 = "ABC ";
String str2 = "BC ";
Console. writeline ("LD =" + S. LD (str1, str2 ));
Console. writeline ("Sim =" + S. SIM (str1, str2 ));
Console. Readline ();
}
}
This article from the csdn blog, reproduced please indicate the source: http://blog.csdn.net/lkf0217/archive/2009/08/20/4466952.aspx