ArticleDirectory
LevenshteinAlgorithmUsed to calculate the levenshtein distance between two strings. Levenshtein distance, also known as the editing distance, refers to the minimum number of edits required to convert two strings from one to the other. Licensed editing operations include replacing one character with another, inserting one character, and deleting one character.
Overview
Levenshtein distance is used to describe the difference between two strings. I am on a Web CrawlerProgramUse this algorithm to compare the versions between two webpages. If there are enough changes to the webpage content, I will update it to my database.
Description
The original algorithm is to create a matrix with the size of strlen1 * strlen2. If all strings are 1000 characters long, the matrix will be 1 MB. If the string is 10000 characters, the matrix will be 100 MB. If the elements are all integers (numbers, int32), then the matrix will be 4*100 m = Mb ......
The current algorithm version only uses 2 * strlen elements, which makes the following example 2*10,000*4 = 80 KB. As a result, not only the memory usage is reduced, but the speed is also faster! As a result, memory allocation takes only a small amount of time. When the two strings are about 1 K in length, the efficiency of the new algorithm is twice that of the old one!
Example
The original version will create a matrix [6 + 1, 5 + 1], and my new algorithm will create two vectors [6 + 1] (yellow element ). In these two algorithm versions, the string sequence is irrelevant and indifferent. That is to say, it can also be a matrix [5 + 1, 6 + 1] and two vectors [5 + 1].
New algorithm steps
Procedure |
Description |
1 |
Set n to the length of string S. ("Gumbo ") Set M to the length of string T. ("Gambol ") If n is equal to 0, M is returned and exit. If M is equal to 0, return N and exit. Construct two vectors V0 [M + 1] and V1 [M + 1] to concatenate all elements between 0 .. M. |
2 |
Initialize V0 to 0 .. M. |
3 |
Check each character in S (I from 1 to n. |
4 |
Check every character in T (J from 1 to m) |
5 |
If s [I] is equal to T [J], the editing cost is 0; If s [I] is not equal to T [J], the edit cost is 1. |
6 |
Set Unit V1 [J] to one of the following minimum values: A, close to the top of the unit + 1: V1 [J-1] + 1 B. Left side + 1: V0 [J] + 1 C. The diagonal line of the unit and the left side + cost: V0 [J-1] + cost |
7 |
After iterations (3, 4, 5, 6) are completed, V1 [m] is the value of the editing distance. |
This section describes how to calculate the levenshtein distance between the "Gumbo" and "gambol" strings.
Steps 1 and 2
|
V0 |
V1 |
|
|
|
|
|
|
G |
U |
M |
B |
O |
|
0 |
1 |
2 |
3 |
4 |
5 |
G |
1 |
|
|
|
|
|
A |
2 |
|
|
|
|
|
M |
3 |
|
|
|
|
|
B |
4 |
|
|
|
|
|
O |
5 |
|
|
|
|
|
L |
6 |
|
|
|
|
|
Step 3-6, when I = 1
|
V0 |
V1 |
|
|
|
|
|
|
G |
U |
M |
B |
O |
|
0 |
1 |
2 |
3 |
4 |
5 |
G |
1 |
0 |
|
|
|
|
A |
2 |
1 |
|
|
|
|
M |
3 |
2 |
|
|
|
|
B |
4 |
3 |
|
|
|
|
O |
5 |
4 |
|
|
|
|
L |
6 |
5 |
|
|
|
|
Step 3-6, when I = 2
|
|
V0 |
V1 |
|
|
|
|
|
G |
U |
M |
B |
O |
|
0 |
1 |
2 |
3 |
4 |
5 |
G |
1 |
0 |
1 |
|
|
|
A |
2 |
1 |
1 |
|
|
|
M |
3 |
2 |
2 |
|
|
|
B |
4 |
3 |
3 |
|
|
|
O |
5 |
4 |
4 |
|
|
|
L |
6 |
5 |
5 |
|
|
|
Step 3-6, when I = 3
|
|
|
V0 |
V1 |
|
|
|
|
G |
U |
M |
B |
O |
|
0 |
1 |
2 |
3 |
4 |
5 |
G |
1 |
0 |
1 |
2 |
|
|
A |
2 |
1 |
1 |
2 |
|
|
M |
3 |
2 |
2 |
1 |
|
|
B |
4 |
3 |
3 |
2 |
|
|
O |
5 |
4 |
4 |
3 |
|
|
L |
6 |
5 |
5 |
4 |
|
|
Step 3-6, when I = 4
|
|
|
|
V0 |
V1 |
|
|
|
G |
U |
M |
B |
O |
|
0 |
1 |
2 |
3 |
4 |
5 |
G |
1 |
0 |
1 |
2 |
3 |
|
A |
2 |
1 |
1 |
2 |
3 |
|
M |
3 |
2 |
2 |
1 |
2 |
|
B |
4 |
3 |
3 |
2 |
1 |
|
O |
5 |
4 |
4 |
3 |
2 |
|
L |
6 |
5 |
5 |
4 |
3 |
|
Step 3-6, when I = 5
|
|
|
|
|
V0 |
V1 |
|
|
G |
U |
M |
B |
O |
|
0 |
1 |
2 |
3 |
4 |
5 |
G |
1 |
0 |
1 |
2 |
3 |
4 |
A |
2 |
1 |
1 |
2 |
3 |
4 |
M |
3 |
2 |
2 |
1 |
2 |
3 |
B |
4 |
3 |
3 |
2 |
1 |
2 |
O |
5 |
4 |
4 |
3 |
2 |
1 |
L |
6 |
5 |
5 |
4 |
3 |
2 |
Step 7
The editing distance is the value in the bottom right corner of the matrix, V1 [m] = 2. The process of switching from "Gumbo" to "gambol" is very simple for me, that is, by replacing "a" with "u ", append "L" at the end (in fact, the replacement process is composed of two operations: Remove and insert ).
Improvement
If you are sure that your string will never exceed 2 ^ 16 (65536) characters, you can use ushort instead of Int. If the string is less than 2 ^ 8 characters, you can also use byte. I think this algorithm is not hosted.CodeIt may be faster, but I have not tried it.
References
- Levenshtein distance, in three flavors
Download the Code go to original: http://www.codeproject.com/Articles/13525/Fast-memory-efficient-Levenshtein-algorithm