Levenshtein Distance distance algorithm to calculate the similarity of strings

Source: Internet
Author: User

Turn

It's not difficult to understand, but it's practical.

The core formula is the following:

(1)

1. Introduction of Baidu Encyclopedia:

Levenshtein distance, also known as the editing distance, refers to the minimum number of edit operations required between two strings, converted from one to another.

Permission edits include replacing one character with another character, inserting a character, and deleting a character.

The algorithm of editing distance is first put forward by Russian scientist Levenshtein, so it is called Levenshtein Distance.

2. Use

Fuzzy query

3. Implementation process A. First there are two strings, here write a simple ABC and ABEB. Think of the string as the structure below.

A is a mark, in order to facilitate the explanation, is not the contents of this table.

Abc A B C
Abe 0 1 2 3
A 1 A place
B 2
E 3
C. To calculate the value of a at

its value depends on: 1 on the left, 1 on top, 0in the upper left corner.

In accordance with the meaning of Levenshtein distance:

Both the above value and the value on the left require an additional 1, which will get 1+1=2.

A at the same as two A, the upper left corner of the value plus 0. This gets 0+0=0.

This is after three values, the left side of the calculation is 2, the top of the calculation is 2, the upper left corner of the calculation is 0, so a at the bottom of their inside the smallest 0.

D. So the table becomes the following
Abc A B C
Abe 0 1 2 3
A 1 0
B 2 Place b
E 3

At B will also get three values, the left side of the calculation is 3, the top is calculated as 1, at B because the corresponding characters are a, B, unequal, so the upper left corner should be on the basis of the current value plus 1, so get 1+1=2, in (3,1,2) selected the smallest value of B.

E. The table is updated

Abc A B C
Abe 0 1 2 3
A 1 0
B 2 1
E 3 At c

C After calculation: The value above is 2, the left value is 4, the upper left corner: A and E are not the same, so add 1, that is 2+1, the upper left corner is 3.

In (2,4,3), take the smallest value at C.

F. Then, in turn, push
A B C
0 1 2 3
A 1 Place a 0 D at 1 G at 2
B 2 Place B 1 E at 0 H at 1
E 3 C at 2 Place F 1 I at 1

I: For ABC and Abe there are 1 actions that need to be edited. This needs to be calculated.

At the same time, some additional information is obtained.

A: Indicates that a and a need to have 0 operations. string-like

B: Indicates that AB and a need to have 1 operations.

C: Indicates that Abe and a need to have 2 operations.

D: Indicates that A and AB need to have 1 operations.

E: Indicates that AB and AB need to have 0 operations. string-like

F: Indicates that Abe and AB need to have 1 operations.

G: Indicates that A and ABC require 2 operations.

H: Indicates that AB and ABC need to have 1 operations.

I: There are 1 operations required to represent Abe and ABC.

G. Calculation of similarity

First, the maximum value of two string length maxlen, with 1-(need operand/maxlen), to obtain the similarity degree.

For example, ABC and Abe an operation with a length of 3, so the similarity is 1-1/3=0.666.

4. Code implementation

Can run directly, copy the past on the line.

The Java code is as follows:
 PackageCode;/*** @className: Mylevenshtein.java * @classDescription: Levenshtein Distance Algorithm Implementation * can be used where: DNA analysis Scrabble speech recognition plagiarism detection * @author:d Onghai.wan * @createTime: 2012-1-12*/ Public classMylevenshtein { Public Static voidMain (string[] args) {//the two strings to compareString str1 = "Today Thursday"; String str2= "Today is Friday";    Levenshtein (STR1,STR2); }    /*** DNA analysis Spell check speech recognition plagiarism detection * * @createTime 2012-1-12*/     Public Static voidLevenshtein (String str1,string str2) {//calculates the length of a two string.         intLen1 =str1.length (); intLen2 =str2.length (); //Create the above-mentioned array, a space larger than the character length        int[] dif =New int[Len1 + 1] [Len2 + 1]; //assign the initial value, step b.          for(intA = 0; a <= len1; a++) {dif[a][0] =A; }         for(intA = 0; a <= len2; a++) {dif[0][a] =A; }        //Calculates whether two characters are the same, calculates the value on the left        inttemp;  for(inti = 1; I <= len1; i++) {             for(intj = 1; J <= Len2; J + +) {                if(Str1.charat (i-1) = = Str2.charat (j-1) ) {temp= 0; } Else{Temp= 1; }                //take the smallest of three valuesDif[i][j] = min (dif[i-1][j-1] + temp, dif[i][j-1] + 1, Dif[i-1][J] + 1); }} System.out.println ("String \" "+str1+" \ "vs \" "+str2+" \ "Comparison"); //The value in the lower-right corner of the array, and the same position represents the comparison of different stringsSystem.out.println ("diff step:" +Dif[len1][len2]); //Calculate similarity degree        floatSimilarity = 1-(float) Dif[len1][len2]/Math.max (Str1.length (), str2.length ()); System.out.println ("Similarity:" +similarity); }    //get the minimum value    Private Static intMinint.. is) {        intMin =Integer.max_value;  for(intI:is) {            if(Min >i) {min=i; }        }        returnmin; }}
5. Guessing principle

Why is it that we can figure out the similarity?

First, in consecutive equal characters, you can consider the

Red is the order in which values are taken.

1. Today week One days Monday

Days Week One
0 1 2 3
This 1 1 2 3
Days 2 1 2 3
Week 3 2 1 3
One 4 3 3 1

Implementation is to remove the "Today", one step to complete.

2. I hear it's going to be a holiday.

You Listen Said To Put False The
0 1 2 3 4 5 6 7
Listen 1 1 1 2 3 4 5 6
Said 2 2 2 1 2 3 4 5
Horse 3 3 3 2 2 3 4 5
On 4 4 4 3 3 3 4 5
On 5 5 5 4 4 4 4 5
To 6 6 6 5 4 5 5 5
Put 7 7 7 6 5 4 5 6
False 8 8 8 7 6 5 4 6
The 9 9 9 8 7 6 6 4

The two strings are:

Remove "You", plus "immediate", a total of four steps.

Levenshtein Distance distance algorithm to calculate the similarity of strings

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.