Edit Distance algorithm (Levenshtein)

edit Distance definition :

The editing distance, also known as the Levenshtein distance , is the minimum number of edit operations required between two strings, from one to another.

Licensing edits include replacing one character with another, inserting a character, and deleting a character.

For example, turn Eeba into ABAC:

    1. EBA (delete First E)
    2. ABA (replace the rest of E with a)
    3. ABAC (insert C at the end)

So Eeba and Abac's editing distance is 3.

Russian scientist Vladimir Levenshtein introduced the concept in 1965.


The algorithm is simple linear dynamic programming (the longest ascending sub-sequence is linear dynamic programming).

Set us to turn S1 into S2.

Defines the state matrix Edit[len1][len2],len1 and Len2 respectively is the string to compare S1 and the length of the string S2 +1 (+1 is considering the case of the dynamic return, a string is empty)

Then, the definition Edit[i][j] is a string consisting of the first I characters in the S1, and the editing distance of the string consisting of the first J characters in the S2

The specific idea is that for each i,j starting from 0 increments, for each J + +, because the first j-1 characters and I of the editing distance has been calculated, so only consider the newly added in the first J characters can be

insert operation: insert a character ch in the S1 of the first I word specifier, so that ch equals the newly added s2[j]. So the editing distance of the insertion character ch is edit[i][j-1]+1

Delete operation: delete s1[i], in order to expect S1[i-1] can match with S2[j] (if S1[I-1] the front of a few characters can be compared with s2[j] a few characters in front of a good match, then this will get better results). In addition, EDIT[I-1][J] has been considered in cases where the characters preceding s1[i-1] match s2[j]. So delete the character Ch's editing distance is edit[i-1][j]+1

Replace operation: expect S1[i] to match with S2[J], or replace s1[i] with S2[j]. The edit distance for the replacement operation is Edit[i-1][j-1]+f (I,J). Where, when S1[i]==s2[j], F (i,j) is 0 and vice versa is 1

So the dynamic programming formula is as follows:

    • If i = = 0 and J = = 0,edit (i, j) = 0
    • If i = = 0 and J > 0,edit (i, j) = J
    • If i > 0 and j = = 0,edit (i, j) = I
    • If 0 < i≤1 and 0 < j≤1, edit (i, j) = = min{Edit (i-1, J) + 1, edit (i, j-1) + 1, edit (i-1, j-1) + f (i, j)}, when the first string When the first character of the second string is not equal to the first J character, F (i, j) = 1; otherwise, f (i, j) = 0.

Python implementations:

Official expansion pack:

Python has an official expansion pack (in PyPI, the Python package index), called Python-levenshtein, which not only calculates the editing distance, but also calculates the Hamming (Hamming) distance, Jaro-winkler distance, links are as follows:


Download python-levenshtein-0.10.2.tar.gz, unzip after CD to unzip the folder, execute:

Python setup.py Build

Python setup.py Install


Note: If you do not have Setuptools installed, you should first install the Setuptools, the link is as follows:


Download Setuptools 0.6c11, unzip, CD to the corresponding directory, execute:

Python setup.py Build

Python setup.py Install


Check for installation success: Enter Python, execute from Levenshtein import *, and install if no error has been achieved

For specific use, see the following blog post (there is a complete link to use the document below this post):


Note: When importing from Levenshtein import *, you do not need to add Levenshtein when calling the function.

For example: Directly call distance (STR1, str2) to calculate the editing distance

Simple implementation Code:

If you want a more lightweight implementation, use the following code:

(selected from Bensuta's blog, http://biansutao.iteye.com/blog/326008)

[Python]View PlainCopy
    1. #!/user/bin/env python
    2. #-*-Coding:utf-8-*-
    3. Class Arithmetic ():
    4. def __init__ (self):
    5. Pass
    6. " " "Edit Distance Algorithm" "Levenshtein Distance" "String similarity Algorithm" "
    7. def Levenshtein (self,first,second):
    8. if Len (first) > Len (second):
    9. First,second = Second,first
    10. if Len (first) = = 0:
    11. return Len (second)
    12. If Len (second) = = 0:
    13. return Len (first)
    14. First_length = Len (first) + 1
    15. Second_length = Len (second) + 1
    16. Distance_matrix = [Range (second_length) for x in Range (first_length)]
    17. #print Distance_matrix
    18. For I in range (1,first_length):
    19. For J in Range (1,second_length):
    20. Deletion = distance_matrix[i-1][j] + 1
    21. insertion = distance_matrix[i][j-1] + 1
    22. substitution = distance_matrix[i-1][j-1]
    23. if first[i-1]! = second[j-1]:
    24. Substitution + = 1
    25. Distance_matrix[i][j] = min (insertion,deletion,substitution)
    26. Print Distance_matrix
    27. return distance_matrix[first_length-1][second_length-1]
    28. if __name__ = = "__main__":
    29. Arith = Arithmetic ()
    30. Print Arith.levenshtein (' Gumbosdafsadfdsafsafsadfasfadsfasdfasdfs ',' Gambol00000000000dfasfasfdafsaf

