From: http://www.lingdonge.com/seo/672.html
The string similarity algorithm is also called the levenshtein distance algorithm ),
It is a thing that judges the degree of similarity between two strings. It is particularly useful in search engines and verification code recognition.
For example, we use collection to process articles. You can use keywords to make similarity judgments and find the articles with the highest matching degree for collection. And so on.
Refer to other blog posts below
Edit distanceLevenshtein distance (also called edit distance) refers to the minimum number of Edit operations required between two strings to be converted from one to another. The larger the distance, the more different they are. Licensed editing operations include replacing one character with another, inserting one character, and deleting one character.
For example, convert kitten to sitting:
Sitten (K → S)
Sittin (e → I)
Sitting (→ G)
Russian scientist Vladimir levenshtein proposed this concept in 1965. Therefore, it is also called levenshtein distance.
For example
- If str1 = "Ivan", str2 = "Ivan", the calculated value is 0. Not converted. Similarity = 1-0/Math. Max (str1.length, str2.length) = 1
- If str1 = "ivan1", str2 = "ivan2", it is equal to 1 after calculation. Str1's "1" Conversion "2", converts a character, so the distance is 1, similarity = 1-1/Math. Max (str1.length, str2.length) = 0.8
Application
DNA analysis
Spell check
Speech Recognition
Plagiarism detection
Small-scale string approximate search, similar to the search engine input keywords, a similar result list appears
Algorithm process
- The length of str1 or str2 is 0 and returns the length of another string. If (str1.length = 0) return str2.length; If (str2.length = 0) return str1.length;
- Initialize (n + 1) * (m + 1) matrix D and increase the value of the first row and column from 0.
- Scan two strings (N * m). If str1 [I] = str2 [J], use temp to record it as 0. Otherwise, the temp value is 1. Then in the matrix D [I, j] assigned to d [I-1, J] + 1, D [I, J-1] + 1, D [I-1, j-1] + temp minimum three.
- After scanning, the last value D [N] [m] of the returned matrix is their distance.
Similarity Calculation formula: 1-their distance/the maximum length of the two strings.
For the sake of intuition, I write two strings to rows and columns separately, which is not required in actual computation. Let's take the strings "ivan1" and "ivan2" as examples to see the median in the matrix:
1. The values of the first row and the first column increase from 0.
2. Matrix [I-1, J] + 1; matrix [I, j-1] + 1; matrix [I-1, j-1] + T
3. Generation of column v values
And so on until all matrices are generated.
The distance between them is 1.
Similarity: 1-1/Math. Max ("ivan1". length, "ivan2". Length) = 0.8
Below is the core Class Library:
Public class levenshteindistance {Private Static levenshteindistance _ instance = NULL; public static levenshteindistance instance {get {If (_ instance = NULL) {return New levenshteindistance ();} return _ instance ;}} /// <summary> /// obtain the smallest single-digit number. /// </Summary> /// <Param name = "first"> </param> /// <Param name = "second"> </param> // <Param name = "third"> </param> // <returns> </returns> Public int lowerofthree (int first, int second, int third) {int min = math. min (first, second); Return math. min (Min, third );} /// <summary> //// </Summary> /// <Param name = "str1"> </param> /// <Param name = "str2 "> </param> // <returns> </returns> Public int levenshtein_distance (string str1, string str2) {int [,] matrix; int n = str1.length; int M = str2.length; int temp = 0; char character; char CH2; int I = 0; int J = 0; If (n = 0) {return m;} If (M = 0) {return N;} matrix = new int [n + 1, m + 1]; for (I = 0; I <= N; I ++) {// initialize the first Matrix [I, 0] = I ;} for (j = 0; j <= m; j ++) {// initialize the first row of Matrix [0, J] = J;} for (I = 1; I <= N; I ++) {substring = str1 [I-1]; for (j = 1; j <= m; j ++) {CH2 = str2 [J-1]; If (ch1.equals (CH2) {temp = 0;} else {temp = 1;} matrix [I, j] = lowerofthree (Matrix [I-1, J] + 1, matrix [I, j-1] + 1, matrix [I-1, j-1] + temp) ;}}for (I = 0; I <= N; I ++) {for (j = 0; j <= m; j ++) {console. write ("{0}", matrix [I, j]);} console. writeline ("");} return matrix [n, m];} /// <summary> /// calculate string similarity /// </Summary> /// <Param name = "str1"> </param> /// <Param name = "str2"> </param> // <returns> </returns> Public decimal levenshteindistancepercent (string str1, string str2) {int val = levenshtein_distance (str1, str2); return 1-(decimal) Val/math. max (str1.length, str2.length );}}
The algorithm is actually quite simple.
The call method is as follows:
Static void main (string [] ARGs) {string str1 = "ivan1"; string str2 = "ivan2"; console. writeline ("string 1 {0}", str1); console. writeline ("string 2 {0}", str2); console. writeline ("similarity {0 }%", new levenshteindistance (). levenshteindistancepercent (str1, str2) * 100); console. readline ();}
Chengdu Seo five think. At the core, we have thoroughly studied some of the algorithm knowledge commonly used in Baidu and other search engines, which is of great benefit for understanding the operation of search engines and making reasonable Seo keyword solutions.
Chengdu Seo xiaowu two sentences:This article is hard to get out (Chengdu Seo 5th). repost the original Seo 5th in Chengdu. Please keep the link: C # string similarity Algorithm in Seo Integration Series, 3q