C # string similarity algorithm of Seo Integration Series -- levenshtein Distance Method

Source: Internet
Author: User

From: http://www.lingdonge.com/seo/672.html

The string similarity algorithm is also called the levenshtein distance algorithm ),

It is a thing that judges the degree of similarity between two strings. It is particularly useful in search engines and verification code recognition.

For example, we use collection to process articles. You can use keywords to make similarity judgments and find the articles with the highest matching degree for collection. And so on.

Refer to other blog posts below

Edit distanceLevenshtein distance (also called edit distance) refers to the minimum number of Edit operations required between two strings to be converted from one to another. The larger the distance, the more different they are. Licensed editing operations include replacing one character with another, inserting one character, and deleting one character.

For example, convert kitten to sitting:

Sitten (K → S)

Sittin (e → I)

Sitting (→ G)

Russian scientist Vladimir levenshtein proposed this concept in 1965. Therefore, it is also called levenshtein distance.

For example

  • If str1 = "Ivan", str2 = "Ivan", the calculated value is 0. Not converted. Similarity = 1-0/Math. Max (str1.length, str2.length) = 1
  • If str1 = "ivan1", str2 = "ivan2", it is equal to 1 after calculation. Str1's "1" Conversion "2", converts a character, so the distance is 1, similarity = 1-1/Math. Max (str1.length, str2.length) = 0.8

Application

DNA analysis

Spell check

Speech Recognition

Plagiarism detection

Small-scale string approximate search, similar to the search engine input keywords, a similar result list appears

Algorithm process

  1. The length of str1 or str2 is 0 and returns the length of another string. If (str1.length = 0) return str2.length; If (str2.length = 0) return str1.length;
  2. Initialize (n + 1) * (m + 1) matrix D and increase the value of the first row and column from 0.
  3. Scan two strings (N * m). If str1 [I] = str2 [J], use temp to record it as 0. Otherwise, the temp value is 1. Then in the matrix D [I, j] assigned to d [I-1, J] + 1, D [I, J-1] + 1, D [I-1, j-1] + temp minimum three.
  4. After scanning, the last value D [N] [m] of the returned matrix is their distance.

Similarity Calculation formula: 1-their distance/the maximum length of the two strings.

For the sake of intuition, I write two strings to rows and columns separately, which is not required in actual computation. Let's take the strings "ivan1" and "ivan2" as examples to see the median in the matrix:

1. The values of the first row and the first column increase from 0.

2. Matrix [I-1, J] + 1; matrix [I, j-1] + 1; matrix [I-1, j-1] + T

3. Generation of column v values

And so on until all matrices are generated.

The distance between them is 1.

Similarity: 1-1/Math. Max ("ivan1". length, "ivan2". Length) = 0.8

Below is the core Class Library:

Public class levenshteindistance {Private Static levenshteindistance _ instance = NULL; public static levenshteindistance instance {get {If (_ instance = NULL) {return New levenshteindistance ();} return _ instance ;}} /// <summary> /// obtain the smallest single-digit number. /// </Summary> /// <Param name = "first"> </param> /// <Param name = "second"> </param> // <Param name = "third"> </param> // <returns> </returns> Public int lowerofthree (int first, int second, int third) {int min = math. min (first, second); Return math. min (Min, third );} /// <summary> //// </Summary> /// <Param name = "str1"> </param> /// <Param name = "str2 "> </param> // <returns> </returns> Public int levenshtein_distance (string str1, string str2) {int [,] matrix; int n = str1.length; int M = str2.length; int temp = 0; char character; char CH2; int I = 0; int J = 0; If (n = 0) {return m;} If (M = 0) {return N;} matrix = new int [n + 1, m + 1]; for (I = 0; I <= N; I ++) {// initialize the first Matrix [I, 0] = I ;} for (j = 0; j <= m; j ++) {// initialize the first row of Matrix [0, J] = J;} for (I = 1; I <= N; I ++) {substring = str1 [I-1]; for (j = 1; j <= m; j ++) {CH2 = str2 [J-1]; If (ch1.equals (CH2) {temp = 0;} else {temp = 1;} matrix [I, j] = lowerofthree (Matrix [I-1, J] + 1, matrix [I, j-1] + 1, matrix [I-1, j-1] + temp) ;}}for (I = 0; I <= N; I ++) {for (j = 0; j <= m; j ++) {console. write ("{0}", matrix [I, j]);} console. writeline ("");} return matrix [n, m];} /// <summary> /// calculate string similarity /// </Summary> /// <Param name = "str1"> </param> /// <Param name = "str2"> </param> // <returns> </returns> Public decimal levenshteindistancepercent (string str1, string str2) {int val = levenshtein_distance (str1, str2); return 1-(decimal) Val/math. max (str1.length, str2.length );}}

The algorithm is actually quite simple.

The call method is as follows:

Static void main (string [] ARGs) {string str1 = "ivan1"; string str2 = "ivan2"; console. writeline ("string 1 {0}", str1); console. writeline ("string 2 {0}", str2); console. writeline ("similarity {0 }%", new levenshteindistance (). levenshteindistancepercent (str1, str2) * 100); console. readline ();}

Chengdu Seo five think. At the core, we have thoroughly studied some of the algorithm knowledge commonly used in Baidu and other search engines, which is of great benefit for understanding the operation of search engines and making reasonable Seo keyword solutions.

Chengdu Seo xiaowu two sentences:This article is hard to get out (Chengdu Seo 5th). repost the original Seo 5th in Chengdu. Please keep the link: C # string similarity Algorithm in Seo Integration Series, 3q

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.