A detailed explanation of the editing distance of Python text similarity calculation

Source: Internet
Author: User
Edit Distance

Edit distance (edit Distance), also known as Levenshtein distance, is the minimum number of editing operations required to turn between two strings, from one to another. Editing involves replacing one character with another character, inserting a character, and deleting a character. In general, the smaller the editing distance, the greater the similarity of the two strings.

For example, convert the word kitten to Sitting: (' kitten ' and ' sitting ' edit distance is 3)

Sitten (K→s)

Sittin (E→i)

Sitting (→G)

The Levenshtein package in Python makes it easy to calculate the editing distance

Installation of the package:pip install python-Levenshtein

Let's use the following:

#-*-coding:utf-8-*-import levenshteintexta = ' allen turing ' Textb = ' alan biography ' Print Levenshtein.distance (TEXTA,TEXTB)

The above program execution result is 3, but only one character is changed, why does this happen?

The reason is that Python regards these two strings as string types, whereas in string type, the default Utf-8 encoding, a Chinese character is represented by three bytes.

The workaround is to convert the string to Unicode format and return the correct result 1.

#-*-Coding:utf-8-*-import levenshteintexta = u ' allen turing ' TEXTB = U ' Alan Psychics ' Print levenshtein.distance (TEXTA,TEXTB)

Next focus on how to take care of the role of several methods:

Levenshtein.distance (str1, STR2)

Calculates the editing distance (also known as Levenshtein distance). Is the number of operations that describe the fewest conversions from one string to another, in which operations include inserting, deleting, and replacing. Algorithm implementation: Dynamic programming.

Levenshtein.hamming (str1, STR2)

Calculates the Hamming distance. Requirements str1 and str2 must be the same length. is to describe the number of different characters in the corresponding position between two equal-length strings.

Levenshtein.ratio (str1, STR2)

Calculates the Levinsteinby. Calculates the formula r = (sum – ldist) / sum , where sum refers to the sum of the lengths of the str1 and str2 strings, and Ldist is the class editing distance. Note Here is the class editing distance, which is deleted in the class editing distance, the insertion remains +1, but replaces +2.

Levenshtein.jaro (S1, S2)

To calculate the Jaro distance, Jaro distance is said to be used to determine whether the two names on the health record are the same or to be used for the census, let's look at the definition of Jaro distance first.

The Jaro distance for the two given strings S1 and S2 are:


where M is S1, S2 matches the number of characters, and T is the number of transposition.

Two characters from S1 and S2, if not more than

, we consider the two strings to be matched, and these matching characters determine the number of transposition T, which is simply that the number of matched characters in different order is half that of a transposition of T. For example, Martha and Marhta characters are matched, but in these matched characters, T and H are swapped to change the Martha to Marhta, then T and H are the matching characters in different order t=2/2=1 .

The Jaro distance of two strings is:


Levenshtein.jaro_winkler (S1, S2)

Calculates the jaro–winkler distance, while Jaro-winkler gives the starting part a higher fraction of the same string, he defines a prefix p, gives two strings, and if the prefix portion has the same length as ι, then Jaro-winkler distance is:


DJ is a jaro of two strings Distance

ι is the same length as the prefix, but the specified maximum is 4

P is the constant of the adjustment fraction, the rule cannot exceed 25, otherwise the DW is more than 1, Winkler This constant is defined as 0.1

Thus, the jaro-winkler distance of Martha and Marhta mentioned above are:

DW = 0.944 + (3 * 0.1 (1−0.944)) = 0.961

Individuals feel that algorithms can be perfected by points:

Remove discontinued words (mainly the effect of punctuation marks)

For the Chinese analysis, according to the word comparison is not better than by the word comparison effect?

Summarize

The above is the entire content of this article, I hope that the content of this article for everyone to learn or use Python can help, if there are questions you can message exchange.

  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.