--levenshtein distance algorithm of string similarity algorithm

Source: Internet
Author: User

The Levenshtein Distance algorithm, also called the edit Distance algorithm, refers to the minimum number of edit operations required between two strings, which are converted from one to the other. permission edits include replacing one character with another character, inserting a character, and deleting a character. in general, the smaller the editing distance, the greater the similarity of the two strings.

Algorithm implementation schematic diagram:

A. First there are two strings, which write a simple ABC and ABEB. Think of the string as the structure below.

A is a mark, in order to facilitate the explanation, is not the contents of this table.

Abc A B C
Abe 0 1 2 3
A 1 A place
B 2
E 3
C. To calculate the value of a at

Its value depends on: 1 on the left, 1 on top, 0 in the upper left corner.

In accordance with the meaning of Levenshtein distance:

The above value is added 1 to get 1+1=2,

The value on the left plus 1, get 1+1=2,

The value in the upper-left corner is the same as the characters, plus 0 and 1. A at the same as two A, the upper left corner of the value plus 0, get 0+0=0.

Then choose the minimum value from the 2,2,0 three values that we calculated above, so the value at a is 0.

D. So the table becomes the following
Abc A B C
Abe 0 1 2 3
A 1 0
B 2 Place b
E 3

At B will also get three values, the left side of the calculation is 3, the top is calculated as 1, at B because the corresponding characters are a, B, unequal, so the upper left corner should be on the basis of the current value plus 1, so get 1+1=2, in (3,1,2) selected the smallest value of B.

E. The table is updated

Abc A B C
Abe 0 1 2 3
A 1 0
B 2 1
E 3 At c

C after calculation: The value above is 2, the left value is 4, the upper left corner: A and E are not the same, so add 1, that is 2+1, the upper left corner is 3.

In (2,4,3), take the smallest value at C.

F. Then, in turn, push
A B C
0 1 2 3
A 1 Place a 0 D at 1 G at 2
B 2 Place B 1 E at 0 H at 1
E 3 C at 2 Place F 1 I at 1

I: For ABC and Abe there are 1 actions that need to be edited (c is replaced by e). This needs to be calculated.

At the same time, some additional information is obtained:

A: Indicates that a and a need to have 0 operations. string-like

B: Indicates that AB and a need to have 1 operations.

C: Indicates that Abe and a need to have 2 operations.

D: Indicates that A and AB need to have 1 operations.

E: Indicates that AB and AB need to have 0 operations. string-like

F: Indicates that Abe and AB need to have 1 operations.

G: Indicates that A and ABC require 2 operations.

H: Indicates that AB and ABC need to have 1 operations.

I: There are 1 operations required to represent Abe and ABC.

G. Calculation of similarity

First take the maximum value of two string length maxlen, with 1-(need to manipulate the number of maxlen), to obtain the similarity degree.

For example, ABC and Abe an operation with a length of 3, so the similarity is 1-1/3=0.666.

The above is the whole algorithm derivation process, but as to why can calculate the similarity degree, still do not understand. And I found that this algorithm has a very pit, that is, sometimes based on the results of the algorithm, it will be different from what we imagined:

For example: The string ABCD and the string def, according to the algorithm to calculate the similarity of 0, but obviously by a same character D, at least there should be a certain similarity, even if very similar is very low, but it should not be 0. But the string ABCD and the string Aert, are also only one of the same character a, but according to the algorithm calculated by the similarity of 0.25.

According to the first introduction of the algorithm, the use of substitution, deletion, insertion of the three operations, the string ABCD and the string Aert at least 3 steps to complete the transformation, and the string ABCD and the string def requires a minimum of 4 steps to complete the transformation, two strings of the maximum length of 4, and the completion of the conversion must be at least to 4 steps, the similarity of 0 seems to be able to say, but from the surface, obviously there is a same character D, the similarity of 0, let me feel very strange, so I think the use of time need to be cautious.

Here's my C # code implementation:

1 usingUnityengine;2 usingSystem.Collections;3 usingSystem;4 5  Public classeditordistance6 {7     /// <summary>8     ///compares the similarity of two strings and returns a similarity rate. 9     /// </summary>Ten     /// <param name= "str1" ></param> One     /// <param name= "str2" ></param> A     /// <returns></returns> -      Public Static floatLevenshtein (stringSTR1,stringstr2) -     { the         Char[] Char1 =str1. ToCharArray (); -         Char[] Char2 =str2. ToCharArray (); -         //calculates the length of a two string.  -         intLen1 =char1. Length; +         intLen2 =char2. Length; -         //build a two-dimensional array, a space larger than the character length +         int[,] dif =New int[Len1 +1, Len2 +1]; A         //assigning an initial value at          for(intA =0; a <= len1; a++) -         { -Dif[a,0] =A; -         } -          for(intA =0; a <= len2; a++) -         { indif[0, A] =A; -         } to         //Calculates whether two characters are the same, calculates the value on the left +         inttemp; -          for(inti =1; I <= len1; i++) the         { *              for(intj =1; J <= Len2; J + +) $             {Panax Notoginseng                 if(Char1[i-1] = = Char2[j-1]) -                 { thetemp =0; +                 } A                 Else the                 { +temp =1; -                 } $                 //take the smallest of three values $Dif[i, J] = Min (Dif[i-1, J-1] + temp, dif[i, J-1] +1, Dif[i-1, J] +1); -             } -         } the         //Calculate similarity degree -         floatSimilarity =1- (float) Dif[len1, Len2]/Math.max (Len1, len2);Wuyi         returnsimilarity; the     } -  Wu     /// <summary> -     ///To find the minimum value About     /// </summary> $     /// <param name= "Nums" ></param> -     /// <returns></returns> -     Private Static intMin (params int[] nums) -     { A         intMin =int. MaxValue; +         foreach(intIteminchnums) the         { -             if(Min >Item) $             { theMin =item; the             } the         } the         returnmin; -     } in}

--levenshtein distance algorithm of string similarity algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.