A fast and efficient levenshtein Algorithm Implementation

Source: Internet
Author: User
ArticleDirectory
    • Procedure

LevenshteinAlgorithmUsed to calculate the levenshtein distance between two strings. Levenshtein distance, also known as the editing distance, refers to the minimum number of edits required to convert two strings from one to the other. Licensed editing operations include replacing one character with another, inserting one character, and deleting one character.

Overview

Levenshtein distance is used to describe the difference between two strings. I am on a Web CrawlerProgramUse this algorithm to compare the versions between two webpages. If there are enough changes to the webpage content, I will update it to my database.

Description

The original algorithm is to create a matrix with the size of strlen1 * strlen2. If all strings are 1000 characters long, the matrix will be 1 MB. If the string is 10000 characters, the matrix will be 100 MB. If the elements are all integers (numbers, int32), then the matrix will be 4*100 m = Mb ......

The current algorithm version only uses 2 * strlen elements, which makes the following example 2*10,000*4 = 80 KB. As a result, not only the memory usage is reduced, but the speed is also faster! As a result, memory allocation takes only a small amount of time. When the two strings are about 1 K in length, the efficiency of the new algorithm is twice that of the old one!

Example

The original version will create a matrix [6 + 1, 5 + 1], and my new algorithm will create two vectors [6 + 1] (yellow element ). In these two algorithm versions, the string sequence is irrelevant and indifferent. That is to say, it can also be a matrix [5 + 1, 6 + 1] and two vectors [5 + 1].

New algorithm steps
Procedure Description
1 Set n to the length of string S. ("Gumbo ")
Set M to the length of string T. ("Gambol ")
If n is equal to 0, M is returned and exit.
If M is equal to 0, return N and exit.
Construct two vectors V0 [M + 1] and V1 [M + 1] to concatenate all elements between 0 .. M.
2 Initialize V0 to 0 .. M.
3 Check each character in S (I from 1 to n.
4 Check every character in T (J from 1 to m)
5 If s [I] is equal to T [J], the editing cost is 0;
If s [I] is not equal to T [J], the edit cost is 1.
6 Set Unit V1 [J] to one of the following minimum values:
A, close to the top of the unit + 1: V1 [J-1] + 1
B. Left side + 1: V0 [J] + 1
C. The diagonal line of the unit and the left side + cost: V0 [J-1] + cost
7 After iterations (3, 4, 5, 6) are completed, V1 [m] is the value of the editing distance.

This section describes how to calculate the levenshtein distance between the "Gumbo" and "gambol" strings.

Steps 1 and 2
  V0 V1        
    G U M B O
  0 1 2 3 4 5
G 1          
A 2          
M 3          
B 4          
O 5          
L 6          
Step 3-6, when I = 1

 

  V0 V1        
    G U M B O
  0 1 2 3 4 5
G 1 0        
A 2 1        
M 3 2        
B 4 3        
O 5 4        
L 6 5        
Step 3-6, when I = 2
    V0 V1      
    G U M B O
  0 1 2 3 4 5
G 1 0 1      
A 2 1 1      
M 3 2 2      
B 4 3 3      
O 5 4 4      
L 6 5 5      
Step 3-6, when I = 3

 

      V0 V1    
    G U M B O
  0 1 2 3 4 5
G 1 0 1 2    
A 2 1 1 2    
M 3 2 2 1    
B 4 3 3 2    
O 5 4 4 3    
L 6 5 5 4    
Step 3-6, when I = 4

 

        V0 V1  
    G U M B O
  0 1 2 3 4 5
G 1 0 1 2 3  
A 2 1 1 2 3  
M 3 2 2 1 2  
B 4 3 3 2 1  
O 5 4 4 3 2  
L 6 5 5 4 3  
Step 3-6, when I = 5

 

          V0 V1
    G U M B O
  0 1 2 3 4 5
G 1 0 1 2 3 4
A 2 1 1 2 3 4
M 3 2 2 1 2 3
B 4 3 3 2 1 2
O 5 4 4 3 2 1
L 6 5 5 4 3 2
Step 7

The editing distance is the value in the bottom right corner of the matrix, V1 [m] = 2. The process of switching from "Gumbo" to "gambol" is very simple for me, that is, by replacing "a" with "u ", append "L" at the end (in fact, the replacement process is composed of two operations: Remove and insert ).

Improvement

If you are sure that your string will never exceed 2 ^ 16 (65536) characters, you can use ushort instead of Int. If the string is less than 2 ^ 8 characters, you can also use byte. I think this algorithm is not hosted.CodeIt may be faster, but I have not tried it.

References
    • Levenshtein distance, in three flavors

Download the Code go to original: http://www.codeproject.com/Articles/13525/Fast-memory-efficient-Levenshtein-algorithm

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.