Levenshtein Distance distance algorithm to calculate the similarity of strings

Last Update:2017-08-30 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Turn

It's not difficult to understand, but it's practical.

The core formula is the following:

(1)

1. Introduction of Baidu Encyclopedia:

Levenshtein distance, also known as the editing distance, refers to the minimum number of edit operations required between two strings, converted from one to another.

Permission edits include replacing one character with another character, inserting a character, and deleting a character.

The algorithm of editing distance is first put forward by Russian scientist Levenshtein, so it is called Levenshtein Distance.

2. Use

Fuzzy query

3. Implementation process A. First there are two strings, here write a simple ABC and ABEB. Think of the string as the structure below.

A is a mark, in order to facilitate the explanation, is not the contents of this table.

	Abc	A	B	C
Abe	0	1	2	3
A	1	A place
B	2
E	3

C. To calculate the value of a at

its value depends on: 1 on the left, 1 on top, 0in the upper left corner.

In accordance with the meaning of Levenshtein distance:

Both the above value and the value on the left require an additional 1, which will get 1+1=2.

A at the same as two A, the upper left corner of the value plus 0. This gets 0+0=0.

This is after three values, the left side of the calculation is 2, the top of the calculation is 2, the upper left corner of the calculation is 0, so a at the bottom of their inside the smallest 0.

D. So the table becomes the following

	Abc	A	B	C
Abe	0	1	2	3
A	1	0
B	2	Place b
E	3

At B will also get three values, the left side of the calculation is 3, the top is calculated as 1, at B because the corresponding characters are a, B, unequal, so the upper left corner should be on the basis of the current value plus 1, so get 1+1=2, in (3,1,2) selected the smallest value of B.

E. The table is updated

	Abc	A	B	C
Abe	0	1	2	3
A	1	0
B	2	1
E	3	At c

C After calculation: The value above is 2, the left value is 4, the upper left corner: A and E are not the same, so add 1, that is 2+1, the upper left corner is 3.

In (2,4,3), take the smallest value at C.

F. Then, in turn, push

		A	B	C
	0	1	2	3
A	1	Place a 0	D at 1	G at 2
B	2	Place B 1	E at 0	H at 1
E	3	C at 2	Place F 1	I at 1

I: For ABC and Abe there are 1 actions that need to be edited. This needs to be calculated.

At the same time, some additional information is obtained.

A: Indicates that a and a need to have 0 operations. string-like

B: Indicates that AB and a need to have 1 operations.

C: Indicates that Abe and a need to have 2 operations.

D: Indicates that A and AB need to have 1 operations.

E: Indicates that AB and AB need to have 0 operations. string-like

F: Indicates that Abe and AB need to have 1 operations.

G: Indicates that A and ABC require 2 operations.

H: Indicates that AB and ABC need to have 1 operations.

I: There are 1 operations required to represent Abe and ABC.

G. Calculation of similarity

First, the maximum value of two string length maxlen, with 1-(need operand/maxlen), to obtain the similarity degree.

For example, ABC and Abe an operation with a length of 3, so the similarity is 1-1/3=0.666.

4. Code implementation

Can run directly, copy the past on the line.

The Java code is as follows:

 PackageCode;/*** @className: Mylevenshtein.java * @classDescription: Levenshtein Distance Algorithm Implementation * can be used where: DNA analysis Scrabble speech recognition plagiarism detection * @author:d Onghai.wan * @createTime: 2012-1-12*/ Public classMylevenshtein { Public Static voidMain (string[] args) {//the two strings to compareString str1 = "Today Thursday"; String str2= "Today is Friday";    Levenshtein (STR1,STR2); }    /*** DNA analysis Spell check speech recognition plagiarism detection * * @createTime 2012-1-12*/     Public Static voidLevenshtein (String str1,string str2) {//calculates the length of a two string.         intLen1 =str1.length (); intLen2 =str2.length (); //Create the above-mentioned array, a space larger than the character length        int[] dif =New int[Len1 + 1] [Len2 + 1]; //assign the initial value, step b.          for(intA = 0; a <= len1; a++) {dif[a][0] =A; }         for(intA = 0; a <= len2; a++) {dif[0][a] =A; }        //Calculates whether two characters are the same, calculates the value on the left        inttemp;  for(inti = 1; I <= len1; i++) {             for(intj = 1; J <= Len2; J + +) {                if(Str1.charat (i-1) = = Str2.charat (j-1) ) {temp= 0; } Else{Temp= 1; }                //take the smallest of three valuesDif[i][j] = min (dif[i-1][j-1] + temp, dif[i][j-1] + 1, Dif[i-1][J] + 1); }} System.out.println ("String \" "+str1+" \ "vs \" "+str2+" \ "Comparison"); //The value in the lower-right corner of the array, and the same position represents the comparison of different stringsSystem.out.println ("diff step:" +Dif[len1][len2]); //Calculate similarity degree        floatSimilarity = 1-(float) Dif[len1][len2]/Math.max (Str1.length (), str2.length ()); System.out.println ("Similarity:" +similarity); }    //get the minimum value    Private Static intMinint.. is) {        intMin =Integer.max_value;  for(intI:is) {            if(Min >i) {min=i; }        }        returnmin; }}

5. Guessing principle

Why is it that we can figure out the similarity?

First, in consecutive equal characters, you can consider the

Red is the order in which values are taken.

1. Today week One days Monday

		Days	Week	One
	0	1	2	3
This	1	1	2	3
Days	2	1	2	3
Week	3	2	1	3
One	4	3	3	1

Implementation is to remove the "Today", one step to complete.

2. I hear it's going to be a holiday.

		You	Listen	Said	To	Put	False	The
	0	1	2	3	4	5	6	7
Listen	1	1	1	2	3	4	5	6
Said	2	2	2	1	2	3	4	5
Horse	3	3	3	2	2	3	4	5
On	4	4	4	3	3	3	4	5
On	5	5	5	4	4	4	4	5
To	6	6	6	5	4	5	5	5
Put	7	7	7	6	5	4	5	6
False	8	8	8	7	6	5	4	6
The	9	9	9	8	7	6	6	4

The two strings are:

Remove "You", plus "immediate", a total of four steps.

Levenshtein Distance distance algorithm to calculate the similarity of strings

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Levenshtein Distance distance algorithm to calculate the similarity of strings

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Levenshtein Distance distance algorithm to calculate the similarity of strings

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support