Edit Distance algorithm (Levenshtein)

Source: Internet
Author: User
Tags first string pack

edit Distance definition :

The editing distance, also known as the Levenshtein distance , is the minimum number of edit operations required between two strings, from one to another.

Licensing edits include replacing one character with another, inserting a character, and deleting a character.

For example, turn Eeba into ABAC:

    1. EBA (delete First E)
    2. ABA (replace the rest of E with a)
    3. ABAC (insert C at the end)

So Eeba and Abac's editing distance is 3.

Russian scientist Vladimir Levenshtein introduced the concept in 1965.

Algorithm:

The algorithm is simple linear dynamic programming (the longest ascending sub-sequence is linear dynamic programming).

Set us to turn S1 into S2.

Defines the state matrix Edit[len1][len2],len1 and Len2 respectively is the string to compare S1 and the length of the string S2 +1 (+1 is considering the case of the dynamic return, a string is empty)

Then, the definition Edit[i][j] is a string consisting of the first I characters in the S1, and the editing distance of the string consisting of the first J characters in the S2

The specific idea is that for each i,j starting from 0 increments, for each J + +, because the first j-1 characters and I of the editing distance has been calculated, so only consider the newly added in the first J characters can be

insert operation: insert a character ch in the S1 of the first I word specifier, so that ch equals the newly added s2[j]. So the editing distance of the insertion character ch is edit[i][j-1]+1

Delete operation: delete s1[i], in order to expect S1[i-1] can match with S2[j] (if S1[I-1] the front of a few characters can be compared with s2[j] a few characters in front of a good match, then this will get better results). In addition, EDIT[I-1][J] has been considered in cases where the characters preceding s1[i-1] match s2[j]. So delete the character Ch's editing distance is edit[i-1][j]+1

Replace operation: expect S1[i] to match with S2[J], or replace s1[i] with S2[j]. The edit distance for the replacement operation is Edit[i-1][j-1]+f (I,J). Where, when S1[i]==s2[j], F (i,j) is 0 and vice versa is 1

So the dynamic programming formula is as follows:

    • If i = = 0 and J = = 0,edit (i, j) = 0
    • If i = = 0 and J > 0,edit (i, j) = J
    • If i > 0 and j = = 0,edit (i, j) = I
    • If 0 < i≤1 and 0 < j≤1, edit (i, j) = = min{Edit (i-1, J) + 1, edit (i, j-1) + 1, edit (i-1, j-1) + f (i, j)}, when the first string When the first character of the second string is not equal to the first J character, F (i, j) = 1; otherwise, f (i, j) = 0.

Python implementations:

Official expansion pack:

Python has an official expansion pack (in PyPI, the Python package index), called Python-levenshtein, which not only calculates the editing distance, but also calculates the Hamming (Hamming) distance, Jaro-winkler distance, links are as follows:

Https://pypi.python.org/pypi/python-Levenshtein

Download python-levenshtein-0.10.2.tar.gz, unzip after CD to unzip the folder, execute:

Python setup.py Build

Python setup.py Install

Can.

Note: If you do not have Setuptools installed, you should first install the Setuptools, the link is as follows:

https://pypi.python.org/pypi/setuptools/

Download Setuptools 0.6c11, unzip, CD to the corresponding directory, execute:

Python setup.py Build

Python setup.py Install

Can.

Check for installation success: Enter Python, execute from Levenshtein import *, and install if no error has been achieved

For specific use, see the following blog post (there is a complete link to use the document below this post):

Http://www.cnblogs.com/kaituorensheng/archive/2013/05/18/3085653.html

Note: When importing from Levenshtein import *, you do not need to add Levenshtein when calling the function.

For example: Directly call distance (STR1, str2) to calculate the editing distance

Simple implementation Code:

If you want a more lightweight implementation, use the following code:

(selected from Bensuta's blog, http://biansutao.iteye.com/blog/326008)

[Python]View PlainCopy
    1. #!/user/bin/env python
    2. #-*-Coding:utf-8-*-
    3. Class Arithmetic ():
    4. def __init__ (self):
    5. Pass
    6. " " "Edit Distance Algorithm" "Levenshtein Distance" "String similarity Algorithm" "
    7. def Levenshtein (self,first,second):
    8. if Len (first) > Len (second):
    9. First,second = Second,first
    10. if Len (first) = = 0:
    11. return Len (second)
    12. If Len (second) = = 0:
    13. return Len (first)
    14. First_length = Len (first) + 1
    15. Second_length = Len (second) + 1
    16. Distance_matrix = [Range (second_length) for x in Range (first_length)]
    17. #print Distance_matrix
    18. For I in range (1,first_length):
    19. For J in Range (1,second_length):
    20. Deletion = distance_matrix[i-1][j] + 1
    21. insertion = distance_matrix[i][j-1] + 1
    22. substitution = distance_matrix[i-1][j-1]
    23. if first[i-1]! = second[j-1]:
    24. Substitution + = 1
    25. Distance_matrix[i][j] = min (insertion,deletion,substitution)
    26. Print Distance_matrix
    27. return distance_matrix[first_length-1][second_length-1]
    28. if __name__ = = "__main__":
    29. Arith = Arithmetic ()
    30. Print Arith.levenshtein (' Gumbosdafsadfdsafsafsadfasfadsfasdfasdfs ',' Gambol00000000000dfasfasfdafsaf

Edit Distance algorithm (Levenshtein)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.