edit Distance definition :
The editing distance, also known as the Levenshtein distance , is the minimum number of edit operations required between two strings, from one to another.
Licensing edits include replacing one character with another, inserting a character, and deleting a character.
For example, turn Eeba into ABAC:
- EBA (delete First E)
- ABA (replace the rest of E with a)
- ABAC (insert C at the end)
So Eeba and Abac's editing distance is 3.
Russian scientist Vladimir Levenshtein introduced the concept in 1965.
Algorithm:
The algorithm is simple linear dynamic programming (the longest ascending sub-sequence is linear dynamic programming).
Set us to turn S1 into S2.
Defines the state matrix Edit[len1][len2],len1 and Len2 respectively is the string to compare S1 and the length of the string S2 +1 (+1 is considering the case of the dynamic return, a string is empty)
Then, the definition Edit[i][j] is a string consisting of the first I characters in the S1, and the editing distance of the string consisting of the first J characters in the S2
The specific idea is that for each i,j starting from 0 increments, for each J + +, because the first j-1 characters and I of the editing distance has been calculated, so only consider the newly added in the first J characters can be
insert operation: insert a character ch in the S1 of the first I word specifier, so that ch equals the newly added s2[j]. So the editing distance of the insertion character ch is edit[i][j-1]+1
Delete operation: delete s1[i], in order to expect S1[i-1] can match with S2[j] (if S1[I-1] the front of a few characters can be compared with s2[j] a few characters in front of a good match, then this will get better results). In addition, EDIT[I-1][J] has been considered in cases where the characters preceding s1[i-1] match s2[j]. So delete the character Ch's editing distance is edit[i-1][j]+1
Replace operation: expect S1[i] to match with S2[J], or replace s1[i] with S2[j]. The edit distance for the replacement operation is Edit[i-1][j-1]+f (I,J). Where, when S1[i]==s2[j], F (i,j) is 0 and vice versa is 1
So the dynamic programming formula is as follows:
- If i = = 0 and J = = 0,edit (i, j) = 0
- If i = = 0 and J > 0,edit (i, j) = J
- If i > 0 and j = = 0,edit (i, j) = I
- If 0 < i≤1 and 0 < j≤1, edit (i, j) = = min{Edit (i-1, J) + 1, edit (i, j-1) + 1, edit (i-1, j-1) + f (i, j)}, when the first string When the first character of the second string is not equal to the first J character, F (i, j) = 1; otherwise, f (i, j) = 0.
Python implementations:
Official expansion pack:
Python has an official expansion pack (in PyPI, the Python package index), called Python-levenshtein, which not only calculates the editing distance, but also calculates the Hamming (Hamming) distance, Jaro-winkler distance, links are as follows:
Https://pypi.python.org/pypi/python-Levenshtein
Download python-levenshtein-0.10.2.tar.gz, unzip after CD to unzip the folder, execute:
Python setup.py Build
Python setup.py Install
Can.
Note: If you do not have Setuptools installed, you should first install the Setuptools, the link is as follows:
https://pypi.python.org/pypi/setuptools/
Download Setuptools 0.6c11, unzip, CD to the corresponding directory, execute:
Python setup.py Build
Python setup.py Install
Can.
Check for installation success: Enter Python, execute from Levenshtein import *, and install if no error has been achieved
For specific use, see the following blog post (there is a complete link to use the document below this post):
Http://www.cnblogs.com/kaituorensheng/archive/2013/05/18/3085653.html
Note: When importing from Levenshtein import *, you do not need to add Levenshtein when calling the function.
For example: Directly call distance (STR1, str2) to calculate the editing distance
Simple implementation Code:
If you want a more lightweight implementation, use the following code:
(selected from Bensuta's blog, http://biansutao.iteye.com/blog/326008)
[Python]View PlainCopy
- #!/user/bin/env python
- #-*-Coding:utf-8-*-
- Class Arithmetic ():
- def __init__ (self):
- Pass
- " " "Edit Distance Algorithm" "Levenshtein Distance" "String similarity Algorithm" "
- def Levenshtein (self,first,second):
- if Len (first) > Len (second):
- First,second = Second,first
- if Len (first) = = 0:
- return Len (second)
- If Len (second) = = 0:
- return Len (first)
- First_length = Len (first) + 1
- Second_length = Len (second) + 1
- Distance_matrix = [Range (second_length) for x in Range (first_length)]
- #print Distance_matrix
- For I in range (1,first_length):
- For J in Range (1,second_length):
- Deletion = distance_matrix[i-1][j] + 1
- insertion = distance_matrix[i][j-1] + 1
- substitution = distance_matrix[i-1][j-1]
- if first[i-1]! = second[j-1]:
- Substitution + = 1
- Distance_matrix[i][j] = min (insertion,deletion,substitution)
- Print Distance_matrix
- return distance_matrix[first_length-1][second_length-1]
- if __name__ = = "__main__":
- Arith = Arithmetic ()
- Print Arith.levenshtein (' Gumbosdafsadfdsafsafsadfasfadsfasdfasdfs ',' Gambol00000000000dfasfasfdafsaf
Edit Distance algorithm (Levenshtein)