In the verification code to identify the need to compare the similarity of character code to the "editing distance algorithm", about the principle and C # implementation to make a record.
According to Baidu Encyclopedia introduction:
The editing distance , also known as the Levenshtein distance (called edit Distance), refers to the minimum number of editing operations required to turn from one to another in two strings, and if they are larger distances, the more they are different. Permission edits include replacing one character with another character, inserting a character, and deleting a character.
For example, turn the kitten word into a sitting:
Sitten (K→s)
Sittin (E→i)
Sitting (→G)
Russian scientist Vladimir Levenshtein introduced the concept in 1965. Therefore also called Levenshtein Distance.
For example
- If str1= "Ivan", Str2= "Ivan", then calculated equals 0. has not been converted. Similarity degree =1-0/math.max (str1.length,str2.length) =1
- If str1= "Ivan1", str2= "ivan2", then calculated equals 1. STR1 "1" converts "2", converts a character, so the distance is 1, the similarity =1-1/math.max (str1.length,str2.length) =0.8
Application
DNA analysis
Spell check
Speech recognition
Plagiarism detection
Thank you for the big stone in the comments to give a good connection to the application of this method is added here:
Small-scale string approximation search, demand similar to search engine input keywords, appear similar results list, article connection: "algorithm" string approximation search
Algorithm process
- A length of str1 or str2 of 0 returns the length of another string. if (str1.length==0) return str2.length; if (str2.length==0) return str1.length;
- Initialize (n+1) * (m+1) of the Matrix D, and let the first row and column values grow from 0 onwards.
- Scan two strings (N*m level), if: str1[i] = = Str2[j], record it with temp, 0. Otherwise, temp is recorded as 1. Then in the Matrix D[i,j] is assigned to D[i-1,j]+1, d[i,j-1]+1, d[i-1,j-1]+temp the minimum value.
- After scanning, return the last value of the matrix D[n][m] that is their distance.
Calculate similarity formula: 1-The maximum value of their distance/two string length.
For the sake of visualization, I write two strings into rows and columns, which are not required in the actual calculation. We use the string "Ivan1" and "ivan2" examples to see the status of the values in the matrix:
1, the first row and the first column values start from 0 growth
|
|
I |
V |
A |
N |
1 |
|
0 |
1 |
2 |
3 |
4 |
5 |
I |
1 |
|
|
|
|
|
V |
2 |
|
|
|
|
|
A |
3 |
|
|
|
|
|
N |
4 |
|
|
|
|
|
2 |
5 |
|
|
|
|
|
2, I column value generation Matrix[i-1, J] + 1; Matrix[i, j-1] + 1; Matrix[i-1, J-1] + t
|
|
I |
V |
A |
N |
1 |
|
0+t=0 |
1+1=2 |
2 |
3 |
4 |
5 |
I |
1+1=2 |
Take the minimum value of three =0 |
|
|
|
|
V |
2 |
By analogy: 1 |
|
|
|
|
A |
3 |
2 |
|
|
|
|
N |
4 |
3 |
|
|
|
|
2 |
5 |
4 |
|
|
|
|
3, v-column value generation
|
|
I |
V |
A |
N |
1 |
|
0 |
1 |
2 |
|
|
|
I |
1 |
0 |
1 |
|
|
|
V |
2 |
1 |
0 |
|
|
|
A |
3 |
2 |
1 |
|
|
|
N |
4 |
3 |
2 |
|
|
|
2 |
5 |
4 |
3 |
|
|
|
And so on until the matrix is all generated
|
|
I |
V |
A |
N |
1 |
|
0 |
1 |
2 |
3 |
4 |
5 |
I |
1 |
0 |
1 |
2 |
3 |
4 |
V |
2 |
1 |
0 |
1 |
2 |
3 |
A |
3 |
2 |
1 |
0 |
1 |
2 |
N |
4 |
3 |
2 |
1 |
0 |
1 |
2 |
5 |
4 |
3 |
2 |
1 |
1 |
And finally get their distance =1
Similarity: 1-1/math.max ("ivan1". Length, "ivan2". Length) =0.8
The algorithm is implemented in C #
public class Levenshteindistance {//<summary>/////</summary>/// <param name= "First" ></param>//<param name= "Second" ></param>//<param name= " Third "></param>///<returns></returns> private int lowerofthree (int first, int second, int third) {int min = math.min (first, second); Return Math.min (Min, third); } private int Levenshtein_distance (string str1, String str2) {int[,] Matrix; int n = str1. Length; int m = str2. Length; int temp = 0; Char ch1; Char CH2; int i = 0; int j = 0; if (n = = 0) {return m; } if (M = = 0) {return n; } Matrix = new Int[n + 1, m + 1]; for (i = 0; I <= N; i++) { Initialize first column matrix[i, 0] = i; } for (j = 0; J <= M; j + +) {//Initialize first row matrix[0, j] = J; } for (i = 1; I <= n; i++) {ch1 = str1[i-1]; for (j = 1; j <= M; j + +) {CH2 = str2[j-1]; if (ch1. Equals (CH2)) {temp = 0; } else {temp = 1; } matrix[i, J] = Lowerofthree (Matrix[i-1, J] + 1, matrix[i, j-1] + 1, matrix[i-1, j-1] + temp); }} for (i = 0; I <= N; i++) {for (j = 0; J <= M; j + +) {Console.Write ("{0}", Matrix[i, j]); } Console.WriteLine (""); } return MatriX[n, M]; }///<summary>//Calculate string similarity///</summary>//<param name= "str1" ></para m>//<param name= "str2" ></param>///<returns></returns> public decimal Levenshteindistancepercent (String str1, String str2) {//int maxlenth = str1. Length > str2. Length? Str1. Length:str2. Length; int val = levenshtein_distance (str1, str2); Return 1-(decimal) Val/math.max (str1. Length, str2. Length); } }
static void Main (string[] args) { string str1 = "Ivan1"; String str2 = "Ivan2"; Console.WriteLine ("String 1 {0}", str1); Console.WriteLine ("String 2 {0}", str2); Console.WriteLine ("Similarity of {0}%", new Levenshteindistance (). Levenshteindistancepercent (STR1, str2) *; Console.ReadLine (); }
Http://www.cnblogs.com/ivanyb/archive/2011/11/25/2263356.html
String similarity algorithm (edit distance algorithm Levenshtein Distance) (RPM)