C # string similarity algorithm of Seo Integration Series -- levenshtein Distance Method

Last Update:2018-12-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

From: http://www.lingdonge.com/seo/672.html

The string similarity algorithm is also called the levenshtein distance algorithm ),

It is a thing that judges the degree of similarity between two strings. It is particularly useful in search engines and verification code recognition.

For example, we use collection to process articles. You can use keywords to make similarity judgments and find the articles with the highest matching degree for collection. And so on.

Refer to other blog posts below

Edit distanceLevenshtein distance (also called edit distance) refers to the minimum number of Edit operations required between two strings to be converted from one to another. The larger the distance, the more different they are. Licensed editing operations include replacing one character with another, inserting one character, and deleting one character.

For example, convert kitten to sitting:

Sitten (K → S)

Sittin (e → I)

Sitting (→ G)

Russian scientist Vladimir levenshtein proposed this concept in 1965. Therefore, it is also called levenshtein distance.

For example

If str1 = "Ivan", str2 = "Ivan", the calculated value is 0. Not converted. Similarity = 1-0/Math. Max (str1.length, str2.length) = 1
If str1 = "ivan1", str2 = "ivan2", it is equal to 1 after calculation. Str1's "1" Conversion "2", converts a character, so the distance is 1, similarity = 1-1/Math. Max (str1.length, str2.length) = 0.8

Application

DNA analysis

Spell check

Speech Recognition

Plagiarism detection

Small-scale string approximate search, similar to the search engine input keywords, a similar result list appears

Algorithm process

The length of str1 or str2 is 0 and returns the length of another string. If (str1.length = 0) return str2.length; If (str2.length = 0) return str1.length;
Initialize (n + 1) * (m + 1) matrix D and increase the value of the first row and column from 0.
Scan two strings (N * m). If str1 [I] = str2 [J], use temp to record it as 0. Otherwise, the temp value is 1. Then in the matrix D [I, j] assigned to d [I-1, J] + 1, D [I, J-1] + 1, D [I-1, j-1] + temp minimum three.
After scanning, the last value D [N] [m] of the returned matrix is their distance.

Similarity Calculation formula: 1-their distance/the maximum length of the two strings.

For the sake of intuition, I write two strings to rows and columns separately, which is not required in actual computation. Let's take the strings "ivan1" and "ivan2" as examples to see the median in the matrix:

1. The values of the first row and the first column increase from 0.

2. Matrix [I-1, J] + 1; matrix [I, j-1] + 1; matrix [I-1, j-1] + T

3. Generation of column v values

And so on until all matrices are generated.

The distance between them is 1.

Similarity: 1-1/Math. Max ("ivan1". length, "ivan2". Length) = 0.8

Below is the core Class Library:

Public class levenshteindistance {Private Static levenshteindistance _ instance = NULL; public static levenshteindistance instance {get {If (_ instance = NULL) {return New levenshteindistance ();} return _ instance ;}} /// <summary> /// obtain the smallest single-digit number. /// </Summary> /// <Param name = "first"> </param> /// <Param name = "second"> </param> // <Param name = "third"> </param> // <returns> </returns> Public int lowerofthree (int first, int second, int third) {int min = math. min (first, second); Return math. min (Min, third );} /// <summary> //// </Summary> /// <Param name = "str1"> </param> /// <Param name = "str2 "> </param> // <returns> </returns> Public int levenshtein_distance (string str1, string str2) {int [,] matrix; int n = str1.length; int M = str2.length; int temp = 0; char character; char CH2; int I = 0; int J = 0; If (n = 0) {return m;} If (M = 0) {return N;} matrix = new int [n + 1, m + 1]; for (I = 0; I <= N; I ++) {// initialize the first Matrix [I, 0] = I ;} for (j = 0; j <= m; j ++) {// initialize the first row of Matrix [0, J] = J;} for (I = 1; I <= N; I ++) {substring = str1 [I-1]; for (j = 1; j <= m; j ++) {CH2 = str2 [J-1]; If (ch1.equals (CH2) {temp = 0;} else {temp = 1;} matrix [I, j] = lowerofthree (Matrix [I-1, J] + 1, matrix [I, j-1] + 1, matrix [I-1, j-1] + temp) ;}}for (I = 0; I <= N; I ++) {for (j = 0; j <= m; j ++) {console. write ("{0}", matrix [I, j]);} console. writeline ("");} return matrix [n, m];} /// <summary> /// calculate string similarity /// </Summary> /// <Param name = "str1"> </param> /// <Param name = "str2"> </param> // <returns> </returns> Public decimal levenshteindistancepercent (string str1, string str2) {int val = levenshtein_distance (str1, str2); return 1-(decimal) Val/math. max (str1.length, str2.length );}}

The algorithm is actually quite simple.

The call method is as follows:

Static void main (string [] ARGs) {string str1 = "ivan1"; string str2 = "ivan2"; console. writeline ("string 1 {0}", str1); console. writeline ("string 2 {0}", str2); console. writeline ("similarity {0 }%", new levenshteindistance (). levenshteindistancepercent (str1, str2) * 100); console. readline ();}

Chengdu Seo five think. At the core, we have thoroughly studied some of the algorithm knowledge commonly used in Baidu and other search engines, which is of great benefit for understanding the operation of search engines and making reasonable Seo keyword solutions.

Chengdu Seo xiaowu two sentences:This article is hard to get out (Chengdu Seo 5th). repost the original Seo 5th in Chengdu. Please keep the link: C # string similarity Algorithm in Seo Integration Series, 3q

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

C # string similarity algorithm of Seo Integration Series -- levenshtein Distance Method

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

C # string similarity algorithm of Seo Integration Series -- levenshtein Distance Method

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support