Edit Distance
The edit distance (edit Distance), also known as Levenshtein distance, refers to the minimum number of edits that are required for a two string to be transferred from one to another. An edit operation involves replacing one character with another, inserting a character, and deleting a character. Generally speaking, the smaller the edit distance, the greater the similarity of two strings.
For example, turn kitten word into sitting: (' kitten ' and ' sitting ' edit distance is 3)
Sitten (K→s)
Sittin (E→i)
Sitting (→G)
The Levenshtein package in Python makes it easy to compute the editing distance
Installation of packages:pip install python-Levenshtein
Let's use the following:
#-*-Coding:utf-8-*-
import levenshtein
texta = ' allen turing '
textb = ' alan Psychic '
print levenshtein.distance ( TEXTA,TEXTB)
The above program executes 3, but only one character is changed, why is this happening?
The reason is that python looks at these two strings as string, whereas in the string type, a Chinese character is represented in three bytes by the default Utf-8 encoding.
The workaround is to convert the string to Unicode format and return the correct result 1.
#-*-Coding:utf-8-*-
import levenshtein
texta = u ' allen turing '
textb = U ' alan Psychic '
print levenshtein.distance (TEXTA,TEXTB)
Next, focus on the role of several ways to take care of:
Levenshtein.distance (str1, STR2)
Computes the editing distance (also called Levenshtein distance). is to describe the number of operations that are converted from one string to another, including inserting, deleting, and replacing. Algorithm implementation: Dynamic programming.
Levenshtein.hamming (str1, STR2)
Calculates the Hamming distance. Requirements str1 and str2 must be of the same length. is to describe the number of different characters in the corresponding position between the two equal long strings.
Levenshtein.ratio (str1, STR2)
Calculate Levinsteinby. Calculates the formula r = (sum – ldist) / sum
where sum refers to the sum of the lengths of the str1 and str2 strings, Ldist is the class edit distance. Note that this is the class edit distance, delete in the class edit distance, insert still +1, but replace +2.
Levenshtein.jaro (S1, S2)
To calculate Jaro distance, Jaro distance is said to be used to determine whether two names on the health record are the same or for the census, let's look at the definition of Jaro distance first.
The Jaro distance of the two given strings S1 and S2 are:
where M is S1, S2 matches the number of characters, T is the number of transposition.
Two characters from S1 and S2, respectively, if the distance is not more than
, we think that these two strings are matched, and these matching characters determine the number of transposition T, in short, the number of matching characters in different order is half the number of transposition T. For example, Martha and Marhta characters are matched, but in these matching characters, T and H are replaced in order to convert Martha to Marhta, then T and H are different sequences of matching characters t=2/2=1
.
The Jaro distance of two strings is:
Levenshtein.jaro_winkler (S1, S2)
Calculates the jaro–winkler distance, while Jaro-winkler gives the starting part a higher score on the same string, he defines a prefix p, gives two strings, and if the prefix part has the same length as the ι part, then Jaro-winkler distance is:
DJ is a two-string Jaro Distance
The ι is the same length as the prefix, but the maximum is 4.
P is the constant that adjusts the score, the stipulation cannot exceed 25, otherwise may appear the DW is bigger than 1, Winkler this constant to define as 0.1
Thus, the jaro-winkler distance of the Martha and Marhta mentioned above are:
DW = 0.944 + (3 * 0.1 (1−0.944)) = 0.961
Personal feel that the algorithm can improve the point:
Remove deactivated words (mainly the effect of punctuation)
For Chinese analysis, according to the comparison of words is better than according to the word comparison effect?
Summarize
The above is the entire content of this article, I hope this article content for you to learn or use Python can help, if you have questions you can message exchange.
Other references:
Https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance
Http://www.coli.uni-saarland.de/courses/LT1/2011/slides/Python-Levenshtein.html#Levenshtein-inverse