Levenshtein distance (LD)-Two string similarity calculation algorithms

Source: Internet
Author: User

 

There are many ways to calculate the similarity between two strings.AlgorithmTo calculate the similarity.

Levenshtein distance (LD): LD may measure the similarity between two strings. Their distance is the addition, deletion, and modification of values during the conversion of a string into that string.

Example:

If str1 = "test", str2 = "test", then LD (str1, str2) = 0. Not converted.
If str1 = "test", str2 = "tent", then LD (str1, str2) = 1. Str1's "S" to "N", converts a character, so it is 1.
The larger the distance, the more different they are.

Levenshtein distance was first invented by Russian scientist Vladimir levenshtein in 1965 and named after him. It can be called edit distance without spelling ).

Levenshtein distance can be used:

Spell checking)
Speech recognition (statement recognition)
DNA analysis)
Plagiarism detection (Plagiarism detection)
LD stores distance values using M * n matrices. Approximate Algorithm process:

The length of str1 or str2 is 0 and returns the length of another string.
Initialize (n + 1) * (m + 1) matrix D and increase the value of the first row and column from 0.
Scan two strings (N * m). If str1 [I] = str2 [J], use temp to record it as 0. Otherwise, the temp value is 1. Then in the matrix D [I] [J] assigned to d [I-1] [J] + 1, D [I] [J-1] + 1, D [I-1] [J-1] + the minimum value of temp.
After scanning, the last value of the returned matrix is d [N] [M].
The distance is returned. How can we find the similarity based on this distance? Because their maximum distance is the maximum length of the two strings. It is not very sensitive to strings. Now I have set the similarity calculation formula to 1-their distance/maximum String Length.

Source code:

/**
* Edit the similarity between two strings
*/
Public class Similarity
{

Private int min (INT one, int two, int Three)
{
Int min = one;
If (two <min)
{
Min = two;
}
If (three <min)
{
Min = three;
}
Return min;
}

Public int LD (string str1, string str2)
{
Int [,] D; // Matrix
Int n = str1.length;
Int M = str2.length;
Int I; // traverse str1's
Int J; // traverse str2's
Char character; // str1's
Char CH2; // str2's
Int temp; // records the incremental value at a matrix position with the same characters, either 0 or 1
If (n = 0)
{
Return m;
}
If (M = 0)
{
Return N;
}

D = new int [n + 1, m + 1];

For (I = 0; I <= N; I ++)
{// Initialize the first column
D [I, 0] = I;
}
For (j = 0; j <= m; j ++)
{// Initialize the first line
D [0, J] = J;
}
For (I = 1; I <= N; I ++)
{// Traverse str1
Struct = str1 [I-1];
// De-match str2
For (j = 1; j <= m; j ++)
{
CH2 = str2 [J-1];
If (rows = CH2)
{
Temp = 0;
}
Else
{
Temp = 1;
}
// + 1 on the left, + 1 on the top, and + temp on the top left
D [I, j] = min (d [I-1, J] + 1, D [I, j-1] + 1, D [I-1, j-1] + temp );
}
}
Return d [n, m];
}

Public double SIM (string str1, string str2)
{
Int LD = LD (str1, str2 );
Return 1-(double) LD/Math. Max (str1.length, str2.length );
}

Public static void main (string [] ARGs)
{
Similarity S = New similarity ();
String str1 = "ABC ";
String str2 = "BC ";

Console. writeline ("LD =" + S. LD (str1, str2 ));
Console. writeline ("Sim =" + S. SIM (str1, str2 ));
Console. Readline ();
}
}

 

 

 

This article from the csdn blog, reproduced please indicate the source: http://blog.csdn.net/lkf0217/archive/2009/08/20/4466952.aspx

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.