string editing distance (Levenshtein distance) algorithm

Source: Internet
Author: User
Tags creative commons attribution

Basic introduction

Levenshtein distance is a string measure (string metric) that calculates the degree of difference between two strings. We can assume that the Levenshtein distance is the minimum number of times required to edit a single character (such as modify, insert, delete) when modifying from one string to another. Russian scientist Vladimir Levenshtein introduced this concept in 1965.

Simple example

Modifying the string "Kitten" to the string "sitting" requires only 3 single-character edit operations, as follows:

      • Sitten (K-s)
      • Sittin (e-i)
      • Sitting (_-G)

Therefore, the Levenshtein distance between "kitten" and "sitting" is 3.

Realize the idea

How to implement this algorithm programmatically? Many people try to use matrices to explain, but in fact the matrix is the final visual tool, with understanding "why" is more convenient, but from the matrix is more difficult to think of "How to do."

We tried to find the sub-solution structure of the problem of "modifying from string $a$ to String $b$". Of course, "Modify from string $b$ to string $a$" and it is the same problem, because the deletion of a character from $a$ to match $b$, is equivalent to inserting a character in $b$ to match $a$, the two operations can be converted to each other.

assuming that the character sequence $a[1\dots i]$, $B [1\dots j]$ are string $a$, $B $ of the former $i$, $j $ Characters of the substring, we get a sub-problem that is "modified from string $a[1\dots i]$ to string $b[1\dots j] $ ":$$\left[\begin{matrix}\begin{align*}&a:&&a[1]&&a[2]&&\cdots&&a [I-2]&&a[i-1]&&a[i]\\\\&b:&&b[1]&&b[2]&&\cdots&&b[j-2] &&b[j-1]&&b[j]\end{align*}\end{matrix}\right]$$

  ① insert Operation :

      • When modifying $a[1\dots i]$ to $b[1\dots j-1]$ requires an operand of $op_1$, then I insert a character $a[i ']=b[i]$ to $a[i]$ and $a[i+1]$ to match $b[i]$, so $a[1\dots i]$ The number of operations required to modify to $b[1\dots j]$ is $op_1+1$. $$\LEFT[\BEGIN{MATRIX}\BEGIN{ALIGN*}&&\CDOTS&&\COLOR{RED}{A[I-2]}&&\COLOR{RED}{A[I-1]} &&\mathbf{\color{red}{a[i]}}&&\mathbf{\color{blue}{a[i ']}}&&\\\\&&\cdots& &\color{red}{b[j-2]}&&\mathbf{\color{red}{b[j-1]}}&&\mathbf{\color{blue}{b[j]}}&&\ phi&&\end{align*}\end{matrix}\right]$$

  ② Delete operation :

      • When modifying $a[1\dots i-1]$ to $b[1\dots j]$ requires an operand of $op_2$, then I delete the character $a[i]$ can also $op_2+1$ the operand to make two substrings match: $$\left[\begin{matrix}\begin {align*}&&\cdots&&\color{red}{a[i-2]}&&\mathbf{\color{red}{a[i-1]}}&&\mathbf{\ COLOR{BLUE}{\PHI}}&&\\\\&&\CDOTS&&\COLOR{RED}{B[J-2]}&&\COLOR{RED}{B[J-1]} &&\mathbf{\color{red}{b[j]}}&&\end{align*}\end{matrix}\right]$$

  ③ Modify the Operation :

      • If $a[1\dots i-1]$ is modified to $b[1\dots j-1]$ the required operand is $op_3$, I replace the character $a[i]$ with $a[i ']=b[j]$, and the operand of $op_3+1$ can be completed: $$\left[\begin{ matrix}\begin{align*}&&\cdots&&\color{red}{a[i-2]}&&\mathbf{\color{red}{a[i-1]}}& &\mathbf{\color{blue}{a[i ']}}&&\\\\&&\cdots&&\color{red}{b[j-2]}&&\mathbf{ \color{red}{b[j-1]}}&&\mathbf{\color{blue}{b[j]}}&&\end{align*}\end{matrix}\right]$$
      • However, if the character $a[i]==b[j]$ at this time, no modification is required and the operand is still $op_3$.

In summary, we change the string $a[1\dots i]$ to string $b[1\dots j]$ the required action is $min\{op_1+1,\ op_2+1,\ op_3+1_{(a_i\neq b_i)}\}$, where $1_{(A_i\neq b_i }$ represents the value $1$ when $a_i\neq b_i$, otherwise the value is $0$.

Mathematical definition

Mathematically, we defined the Levenshtein distance between the two strings $a$ and $b$ to $lev_{a,\ B} (a,\ B) $, where $a$, $b $ were string $a$, $B $ length, and $ $lev _{a,\ B} (i,\ j) =\left\{\ Begin{matrix}\begin{align*}&i&&,\ j=0\\&j&&,\ i=0\\&min\left\{\begin{matrix}lev_{a,\ b} (i,\ j-1) +1\\lev_{a,\ B} (i-1,\ j) +1\\lev_{a,\ B} (i-1,\ j-1) +1_{(a_i\neq b_i)}\end{matrix}\right.&&,\ otherwise\end{align*}\end{matrix}\right.$$

Please refer to wikipedia-levenshtein_distance for more information.

C + + code

With the state transition equation, we can happily DP up, time complexity $o (MN) $, Space complexity $o (MN) $.

1#include <stdio.h>2#include <string.h>3#include <algorithm>4 usingstd::min;5 intLena, LenB;6 Chara[1010], b[1010];7 voidRead () {8scanf"%s%s", A, b);9Lena =strlen (a);TenLenB =strlen (b); One } A  - intdp[1010][1010]; - voidWork () { the      for(intI=1; i<=lena; i++) dp[i][0] =i; -      for(intj=1; j<=lenb; J + +) dp[0][J] =J; -      for(intI=1; i<=lena; i++) -          for(intj=1; j<=lenb; J + +) +             if(a[i-1]==b[j-1]) -DP[I][J] = dp[i-1][j-1]; +             Else ADp[i][j] = min (dp[i-1][j-1], Min (dp[i][j-1], dp[i-1][J]) +1; atprintf"%d\n", Dp[lena][lenb]); - } -  - intMain () { - read (); - Work (); in     return 0; -}

Several small optimizations

1. If the $a[i]==b[j]$ (subscript starting from $1$) is satisfied, you can actually take the $lev (i,\ j) =lev (i-1,\ j-1) $ directly. Because the characters are the same at this time, no editing action is required. This optimization can also be derived from the unequal relations of the above-mentioned transfer equations.

2. If you use a scrolling array, the spatial complexity can be reduced to $o (2*max\{m,\ n\}) $. However, you can also save $lev (i-1,\ j-1) $ to reduce the complexity of the space to $o (max\{m,\ n\}) $, as follows:

1 intdp[1010];2 voidWork () {3      for(intj=1; j<=lenb; J + +) Dp[j] =J;4     intT1, T2;5      for(intI=1; i<=lena; i++) {6T1 = dp[0]++;7          for(intj=1; j<=lenb; J + +) {8t2 =Dp[j];9             if(a[i-1]==b[j-1])TenDP[J] =T1; One             Else ADp[j] = min (t1, min (dp[j-1], Dp[j]) +1; -T1 =T2; -         } the     } -printf"%d\n", Dp[lenb]); -}

The above is the basic introduction of the Levenshtein distance algorithm, if you like, please order a recommendation ~ ~ If you have valuable comments, welcome to the comments below the area proposed OH ~

This article is based on the Creative Commons Attribution-NonCommercial use-Shared 4.0 International License Agreement published, welcome to quote, reprint or deduction, but must retain the attribution Blackstorm and this article link http://www.cnblogs.com/BlackStorm/p/ 5400809.html and cannot be used for commercial purposes without permission. Please contact me if you have any questions or authorization to negotiate.

string editing distance (Levenshtein distance) algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.