The editing distance of Python text similarity calculation _python

Source: Internet
Author: User
Tags constant in python

Edit Distance

The edit distance (edit Distance), also known as Levenshtein distance, refers to the minimum number of edits that are required for a two string to be transferred from one to another. An edit operation involves replacing one character with another, inserting a character, and deleting a character. Generally speaking, the smaller the edit distance, the greater the similarity of two strings.

For example, turn kitten word into sitting: (' kitten ' and ' sitting ' edit distance is 3)

Sitten (K→s)

Sittin (E→i)

Sitting (→G)

The Levenshtein package in Python makes it easy to compute the editing distance

Installation of packages:pip install python-Levenshtein

Let's use the following:

#-*-Coding:utf-8-*-
import levenshtein
texta = ' allen turing '
textb = ' alan Psychic '
print levenshtein.distance ( TEXTA,TEXTB)

The above program executes 3, but only one character is changed, why is this happening?

The reason is that python looks at these two strings as string, whereas in the string type, a Chinese character is represented in three bytes by the default Utf-8 encoding.

The workaround is to convert the string to Unicode format and return the correct result 1.

#-*-Coding:utf-8-*-
import levenshtein
texta = u ' allen turing '
textb = U ' alan Psychic '
print levenshtein.distance (TEXTA,TEXTB)

Next, focus on the role of several ways to take care of:

Levenshtein.distance (str1, STR2)

Computes the editing distance (also called Levenshtein distance). is to describe the number of operations that are converted from one string to another, including inserting, deleting, and replacing. Algorithm implementation: Dynamic programming.

Levenshtein.hamming (str1, STR2)

Calculates the Hamming distance. Requirements str1 and str2 must be of the same length. is to describe the number of different characters in the corresponding position between the two equal long strings.

Levenshtein.ratio (str1, STR2)

Calculate Levinsteinby. Calculates the formula r = (sum – ldist) / sum where sum refers to the sum of the lengths of the str1 and str2 strings, Ldist is the class edit distance. Note that this is the class edit distance, delete in the class edit distance, insert still +1, but replace +2.

Levenshtein.jaro (S1, S2)

To calculate Jaro distance, Jaro distance is said to be used to determine whether two names on the health record are the same or for the census, let's look at the definition of Jaro distance first.

The Jaro distance of the two given strings S1 and S2 are:


where M is S1, S2 matches the number of characters, T is the number of transposition.

Two characters from S1 and S2, respectively, if the distance is not more than

, we think that these two strings are matched, and these matching characters determine the number of transposition T, in short, the number of matching characters in different order is half the number of transposition T. For example, Martha and Marhta characters are matched, but in these matching characters, T and H are replaced in order to convert Martha to Marhta, then T and H are different sequences of matching characters t=2/2=1 .

The Jaro distance of two strings is:


Levenshtein.jaro_winkler (S1, S2)

Calculates the jaro–winkler distance, while Jaro-winkler gives the starting part a higher score on the same string, he defines a prefix p, gives two strings, and if the prefix part has the same length as the ι part, then Jaro-winkler distance is:


DJ is a two-string Jaro Distance

The ι is the same length as the prefix, but the maximum is 4.

P is the constant that adjusts the score, the stipulation cannot exceed 25, otherwise may appear the DW is bigger than 1, Winkler this constant to define as 0.1

Thus, the jaro-winkler distance of the Martha and Marhta mentioned above are:

DW = 0.944 + (3 * 0.1 (1−0.944)) = 0.961

Personal feel that the algorithm can improve the point:

Remove deactivated words (mainly the effect of punctuation)

For Chinese analysis, according to the comparison of words is better than according to the word comparison effect?

Summarize

The above is the entire content of this article, I hope this article content for you to learn or use Python can help, if you have questions you can message exchange.

Other references:

Https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance

Http://www.coli.uni-saarland.de/courses/LT1/2011/slides/Python-Levenshtein.html#Levenshtein-inverse

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.