String similarity algorithm (editing distance algorithm Levenshtein Distance)

Source: Internet
Author: User

In the verification code to identify the need to compare the similarity of character code to the "editing distance algorithm", about the principle and C # implementation to make a record.

According to Baidu Encyclopedia introduction:

The editing distance , also known as the Levenshtein distance (called edit Distance), refers to the minimum number of editing operations required to turn from one to another in two strings, and if they are larger distances, the more they are different. Permission edits include replacing one character with another character, inserting a character, and deleting a character.

For example, turn the kitten word into a sitting:

Sitten (K→s)

Sittin (E→i)

Sitting (→G)

Russian scientist Vladimir Levenshtein introduced the concept in 1965. Therefore also called Levenshtein Distance.

For example

    • If str1= "Ivan", Str2= "Ivan", then calculated equals 0. has not been converted. Similarity degree =1-0/math.max (str1.length,str2.length) =1
    • If str1= "Ivan1", str2= "ivan2", then calculated equals 1. STR1 "1" converts "2", converts a character, so the distance is 1, the similarity =1-1/math.max (str1.length,str2.length) =0.8
Application

DNA analysis

Spell check

Speech recognition

Plagiarism detection

Thank you for the big stone in the comments to give a good connection to the application of this method is added here:

Small-scale string approximation search, demand similar to search engine input keywords, appear similar results list, article connection: "algorithm" string approximation search

Algorithm process

    1. A length of str1 or str2 of 0 returns the length of another string. if (str1.length==0) return str2.length; if (str2.length==0) return str1.length;
    2. Initialize (n+1) * (m+1) of the Matrix D, and let the first row and column values grow from 0 onwards.
    3. Scan two strings (N*m level), if: str1[i] = = Str2[j], record it with temp, 0. Otherwise, temp is recorded as 1. Then in the Matrix D[i,j] is assigned to D[i-1,j]+1, d[i,j-1]+1, d[i-1,j-1]+temp the minimum value.
    4. After scanning, return the last value of the matrix D[n][m] that is their distance.

Calculate similarity formula: 1-The maximum value of their distance/two string length.


For the sake of visualization, I write two strings into rows and columns, which are not required in the actual calculation. We use the string "Ivan1" and "ivan2" examples to see the status of the values in the matrix:

1, the first row and the first column values start from 0 growth

I V A N 1
0 1 2 3 4 5
I 1
V 2
A 3
N 4
2 5

2, I column value generation Matrix[i-1, J] + 1;    Matrix[i, j-1] + 1; Matrix[i-1, J-1] + t

I V A N 1
0+t=0 1+1=2 2 3 4 5
I 1+1=2 Take the minimum value of three =0
V 2 By analogy: 1
A 3 2
N 4 3
2 5 4

3, v-column value generation

I V A N 1
0 1 2
I 1 0 1
V 2 1 0
A 3 2 1
N 4 3 2
2 5 4 3

And so on until the matrix is all generated

I V A N 1
0 1 2 3 4 5
I 1 0 1 2 3 4
V 2 1 0 1 2 3
A 3 2 1 0 1 2
N 4 3 2 1 0 1
2 5 4 3 2 1 1

And finally get their distance =1

Similarity: 1-1/math.max ("ivan1". Length, "ivan2". Length) =0.8

The algorithm is implemented in C #

public class Levenshteindistance {//<summary>/////</summary>/// <param name= "First" ></param>//<param name= "Second" ></param>//<param name= "  Third "></param>///<returns></returns> private int lowerofthree (int first, int second,            int third) {int min = math.min (first, second);        Return Math.min (Min, third);            } private int Levenshtein_distance (string str1, String str2) {int[,] Matrix; int n = str1.            Length; int m = str2.            Length;            int temp = 0;            Char ch1;            Char CH2;            int i = 0;            int j = 0;            if (n = = 0) {return m;            } if (M = = 0) {return n;            } Matrix = new Int[n + 1, m + 1]; for (i = 0; I <= N; i++) {                Initialize first column matrix[i, 0] = i;            } for (j = 0; J <= M; j + +) {//Initialize first row matrix[0, j] = J;                } for (i = 1; I <= n; i++) {ch1 = str1[i-1];                    for (j = 1; j <= M; j + +) {CH2 = str2[j-1]; if (ch1.                    Equals (CH2)) {temp = 0;                    } else {temp = 1;                 } matrix[i, J] = Lowerofthree (Matrix[i-1, J] + 1, matrix[i, j-1] + 1, matrix[i-1, j-1] + temp);                }} for (i = 0; I <= N; i++) {for (j = 0; J <= M; j + +)                {Console.Write ("{0}", Matrix[i, j]);            } Console.WriteLine (""); } return MatriX[n, M]; }///<summary>//Calculate string similarity///</summary>//<param name= "str1" ></para m>//<param name= "str2" ></param>///<returns></returns> public decimal Levenshteindistancepercent (String str1, String str2) {//int maxlenth = str1. Length > str2. Length? Str1. Length:str2.            Length;            int val = levenshtein_distance (str1, str2); Return 1-(decimal) Val/math.max (str1. Length, str2.        Length); }    }

1 <strong>调用</strong>
static void Main (string[] args)        {            string str1 = "Ivan1";            String str2 = "Ivan2";            Console.WriteLine ("String 1 {0}", str1);            Console.WriteLine ("String 2 {0}", str2);            Console.WriteLine ("Similarity of {0}%", new Levenshteindistance (). Levenshteindistancepercent (STR1, str2) *;                      Console.ReadLine ();        }

1 <strong>结果</strong>

Http://www.cnblogs.com/ivanyb/archive/2011/11/25/2263356.html

String similarity algorithm (edit distance algorithm Levenshtein Distance) (RPM)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.