Turn
It's not difficult to understand, but it's practical.
The core formula is the following:
(1)
1. Introduction of Baidu Encyclopedia:
Levenshtein distance, also known as the editing distance, refers to the minimum number of edit operations required between two strings, converted from one to another.
Permission edits include replacing one character with another character, inserting a character, and deleting a character.
The algorithm of editing distance is first put forward by Russian scientist Levenshtein, so it is called Levenshtein Distance.
2. Use
Fuzzy query
3. Implementation process A. First there are two strings, here write a simple ABC and ABEB. Think of the string as the structure below.
A is a mark, in order to facilitate the explanation, is not the contents of this table.
|
Abc |
A |
B |
C |
Abe |
0 |
1 |
2 |
3 |
A |
1 |
A place |
|
|
B |
2 |
|
|
|
E |
3 |
|
|
|
C. To calculate the value of a at
its value depends on: 1 on the left, 1 on top, 0in the upper left corner.
In accordance with the meaning of Levenshtein distance:
Both the above value and the value on the left require an additional 1, which will get 1+1=2.
A at the same as two A, the upper left corner of the value plus 0. This gets 0+0=0.
This is after three values, the left side of the calculation is 2, the top of the calculation is 2, the upper left corner of the calculation is 0, so a at the bottom of their inside the smallest 0.
D. So the table becomes the following
|
Abc |
A |
B |
C |
Abe |
0 |
1 |
2 |
3 |
A |
1 |
0 |
|
|
B |
2 |
Place b |
|
|
E |
3 |
|
|
|
At B will also get three values, the left side of the calculation is 3, the top is calculated as 1, at B because the corresponding characters are a, B, unequal, so the upper left corner should be on the basis of the current value plus 1, so get 1+1=2, in (3,1,2) selected the smallest value of B.
E. The table is updated
|
Abc |
A |
B |
C |
Abe |
0 |
1 |
2 |
3 |
A |
1 |
0 |
|
|
B |
2 |
1 |
|
|
E |
3 |
At c |
|
|
C After calculation: The value above is 2, the left value is 4, the upper left corner: A and E are not the same, so add 1, that is 2+1, the upper left corner is 3.
In (2,4,3), take the smallest value at C.
F. Then, in turn, push
|
|
A |
B |
C |
|
0 |
1 |
2 |
3 |
A |
1 |
Place a 0 |
D at 1 |
G at 2 |
B |
2 |
Place B 1 |
E at 0 |
H at 1 |
E |
3 |
C at 2 |
Place F 1 |
I at 1 |
I: For ABC and Abe there are 1 actions that need to be edited. This needs to be calculated.
At the same time, some additional information is obtained.
A: Indicates that a and a need to have 0 operations. string-like
B: Indicates that AB and a need to have 1 operations.
C: Indicates that Abe and a need to have 2 operations.
D: Indicates that A and AB need to have 1 operations.
E: Indicates that AB and AB need to have 0 operations. string-like
F: Indicates that Abe and AB need to have 1 operations.
G: Indicates that A and ABC require 2 operations.
H: Indicates that AB and ABC need to have 1 operations.
I: There are 1 operations required to represent Abe and ABC.
G. Calculation of similarity
First, the maximum value of two string length maxlen, with 1-(need operand/maxlen), to obtain the similarity degree.
For example, ABC and Abe an operation with a length of 3, so the similarity is 1-1/3=0.666.
4. Code implementation
Can run directly, copy the past on the line.
The Java code is as follows:
PackageCode;/*** @className: Mylevenshtein.java * @classDescription: Levenshtein Distance Algorithm Implementation * can be used where: DNA analysis Scrabble speech recognition plagiarism detection * @author:d Onghai.wan * @createTime: 2012-1-12*/ Public classMylevenshtein { Public Static voidMain (string[] args) {//the two strings to compareString str1 = "Today Thursday"; String str2= "Today is Friday"; Levenshtein (STR1,STR2); } /*** DNA analysis Spell check speech recognition plagiarism detection * * @createTime 2012-1-12*/ Public Static voidLevenshtein (String str1,string str2) {//calculates the length of a two string. intLen1 =str1.length (); intLen2 =str2.length (); //Create the above-mentioned array, a space larger than the character length int[] dif =New int[Len1 + 1] [Len2 + 1]; //assign the initial value, step b. for(intA = 0; a <= len1; a++) {dif[a][0] =A; } for(intA = 0; a <= len2; a++) {dif[0][a] =A; } //Calculates whether two characters are the same, calculates the value on the left inttemp; for(inti = 1; I <= len1; i++) { for(intj = 1; J <= Len2; J + +) { if(Str1.charat (i-1) = = Str2.charat (j-1) ) {temp= 0; } Else{Temp= 1; } //take the smallest of three valuesDif[i][j] = min (dif[i-1][j-1] + temp, dif[i][j-1] + 1, Dif[i-1][J] + 1); }} System.out.println ("String \" "+str1+" \ "vs \" "+str2+" \ "Comparison"); //The value in the lower-right corner of the array, and the same position represents the comparison of different stringsSystem.out.println ("diff step:" +Dif[len1][len2]); //Calculate similarity degree floatSimilarity = 1-(float) Dif[len1][len2]/Math.max (Str1.length (), str2.length ()); System.out.println ("Similarity:" +similarity); } //get the minimum value Private Static intMinint.. is) { intMin =Integer.max_value; for(intI:is) { if(Min >i) {min=i; } } returnmin; }}
5. Guessing principle
Why is it that we can figure out the similarity?
First, in consecutive equal characters, you can consider the
Red is the order in which values are taken.
1. Today week One days Monday
|
|
Days |
Week |
One |
|
0 |
1 |
2 |
3 |
This |
1 |
1 |
2 |
3 |
Days |
2 |
1 |
2 |
3 |
Week |
3 |
2 |
1 |
3 |
One |
4 |
3 |
3 |
1 |
Implementation is to remove the "Today", one step to complete.
2. I hear it's going to be a holiday.
|
|
You |
Listen |
Said |
To |
Put |
False |
The |
|
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
Listen |
1 |
1 |
1 |
2 |
3 |
4 |
5 |
6 |
Said |
2 |
2 |
2 |
1 |
2 |
3 |
4 |
5 |
Horse |
3 |
3 |
3 |
2 |
2 |
3 |
4 |
5 |
On |
4 |
4 |
4 |
3 |
3 |
3 |
4 |
5 |
On |
5 |
5 |
5 |
4 |
4 |
4 |
4 |
5 |
To |
6 |
6 |
6 |
5 |
4 |
5 |
5 |
5 |
Put |
7 |
7 |
7 |
6 |
5 |
4 |
5 |
6 |
False |
8 |
8 |
8 |
7 |
6 |
5 |
4 |
6 |
The |
9 |
9 |
9 |
8 |
7 |
6 |
6 |
4 |
The two strings are:
Remove "You", plus "immediate", a total of four steps.
Levenshtein Distance distance algorithm to calculate the similarity of strings