string editing distance (Levenshtein distance) algorithm

Source: Internet
Author: User

Basic introduction

Levenshtein distance is a string measure (string metric) that calculates the degree of difference between two strings. We can assume that the Levenshtein distance is the minimum number of times required to edit a single character (such as modify, insert, delete) when modifying from one string to another. Russian scientist Vladimir Levenshtein introduced this concept in 1965.

Simple example

Modifying the string "Kitten" to the string "sitting" requires only 3 single-character edit operations, as follows:

• Sitten (K-s)
• Sittin (e-i)
• Sitting (_-G)

Therefore, the Levenshtein distance between "kitten" and "sitting" is 3.

Realize the idea

How to implement this algorithm programmatically? Many people try to use matrices to explain, but in fact the matrix is the final visual tool, with understanding "why" is more convenient, but from the matrix is more difficult to think of "How to do."

We tried to find the sub-solution structure of the problem of "modifying from string $a$ to String $b$". Of course, "Modify from string $b$ to string $a$" and it is the same problem, because the deletion of a character from $a$ to match $b$, is equivalent to inserting a character in $b$ to match $a$, the two operations can be converted to each other.

assuming that the character sequence $a[1\dots i]$, $B [1\dots j]$ are string $a$, $B$ of the former $i$, $j$ Characters of the substring, we get a sub-problem that is "modified from string $a[1\dots i]$ to string $b[1\dots j]$ ":\left[\begin{matrix}\begin{align*}&a:&&a[1]&&a[2]&&\cdots&&a [I-2]&&a[i-1]&&a[i]\\\\&b:&&b[1]&&b[2]&&\cdots&&b[j-2] &&b[j-1]&&b[j]\end{align*}\end{matrix}\right]

① insert Operation :

• When modifying $a[1\dots i]$ to $b[1\dots j-1]$ requires an operand of $op_1$, then I insert a character $a[i ']=b[i]$ to $a[i]$ and $a[i+1]$ to match $b[i]$, so $a[1\dots i]$ The number of operations required to modify to $b[1\dots j]$ is $op_1+1$. \LEFT[\BEGIN{MATRIX}\BEGIN{ALIGN*}&&\CDOTS&&\COLOR{RED}{A[I-2]}&&\COLOR{RED}{A[I-1]} &&\mathbf{\color{red}{a[i]}}&&\mathbf{\color{blue}{a[i ']}}&&\\\\&&\cdots& &\color{red}{b[j-2]}&&\mathbf{\color{red}{b[j-1]}}&&\mathbf{\color{blue}{b[j]}}&&\ phi&&\end{align*}\end{matrix}\right]

② Delete operation :

• When modifying $a[1\dots i-1]$ to $b[1\dots j]$ requires an operand of $op_2$, then I delete the character $a[i]$ can also $op_2+1$ the operand to make two substrings match: \left[\begin{matrix}\begin {align*}&&\cdots&&\color{red}{a[i-2]}&&\mathbf{\color{red}{a[i-1]}}&&\mathbf{\ COLOR{BLUE}{\PHI}}&&\\\\&&\CDOTS&&\COLOR{RED}{B[J-2]}&&\COLOR{RED}{B[J-1]} &&\mathbf{\color{red}{b[j]}}&&\end{align*}\end{matrix}\right]

③ Modify the Operation :

• If $a[1\dots i-1]$ is modified to $b[1\dots j-1]$ the required operand is $op_3$, I replace the character $a[i]$ with $a[i ']=b[j]$, and the operand of $op_3+1$ can be completed: \left[\begin{ matrix}\begin{align*}&&\cdots&&\color{red}{a[i-2]}&&\mathbf{\color{red}{a[i-1]}}& &\mathbf{\color{blue}{a[i ']}}&&\\\\&&\cdots&&\color{red}{b[j-2]}&&\mathbf{ \color{red}{b[j-1]}}&&\mathbf{\color{blue}{b[j]}}&&\end{align*}\end{matrix}\right]
• However, if the character $a[i]==b[j]$ at this time, no modification is required and the operand is still $op_3$.

In summary, we change the string $a[1\dots i]$ to string $b[1\dots j]$ the required action is $min\{op_1+1,\ op_2+1,\ op_3+1_{(a_i\neq b_i)}\}$, where $1_{(A_i\neq b_i }$ represents the value $1$ when $a_i\neq b_i$, otherwise the value is $0$.

Mathematical definition

Mathematically, we defined the Levenshtein distance between the two strings $a$ and $b$ to $lev_{a,\ B} (a,\ B)$, where $a$, $b$ were string $a$, $B$ length, and lev _{a,\ B} (i,\ j) =\left\{\ Begin{matrix}\begin{align*}&i&&,\ j=0\\&j&&,\ i=0\\&min\left\{\begin{matrix}lev_{a,\ b} (i,\ j-1) +1\\lev_{a,\ B} (i-1,\ j) +1\\lev_{a,\ B} (i-1,\ j-1) +1_{(a_i\neq b_i)}\end{matrix}\right.&&,\ otherwise\end{align*}\end{matrix}\right.

C + + code

With the state transition equation, we can happily DP up, time complexity $o (MN)$, Space complexity $o (MN)$.

1#include <stdio.h>2#include <string.h>3#include <algorithm>4 usingstd::min;5 intLena, LenB;6 Chara[1010], b[1010];7 voidRead () {8scanf"%s%s", A, b);9Lena =strlen (a);TenLenB =strlen (b); One } A  - intdp[1010][1010]; - voidWork () { the      for(intI=1; i<=lena; i++) dp[i][0] =i; -      for(intj=1; j<=lenb; J + +) dp[0][J] =J; -      for(intI=1; i<=lena; i++) -          for(intj=1; j<=lenb; J + +) +             if(a[i-1]==b[j-1]) -DP[I][J] = dp[i-1][j-1]; +             Else ADp[i][j] = min (dp[i-1][j-1], Min (dp[i][j-1], dp[i-1][J]) +1; atprintf"%d\n", Dp[lena][lenb]); - } -  - intMain () { - read (); - Work (); in     return 0; -}

Several small optimizations

1. If the $a[i]==b[j]$ (subscript starting from $1$) is satisfied, you can actually take the $lev (i,\ j) =lev (i-1,\ j-1)$ directly. Because the characters are the same at this time, no editing action is required. This optimization can also be derived from the unequal relations of the above-mentioned transfer equations.

2. If you use a scrolling array, the spatial complexity can be reduced to $o (2*max\{m,\ n\})$. However, you can also save $lev (i-1,\ j-1)$ to reduce the complexity of the space to $o (max\{m,\ n\})$, as follows:

1 intdp[1010];2 voidWork () {3      for(intj=1; j<=lenb; J + +) Dp[j] =J;4     intT1, T2;5      for(intI=1; i<=lena; i++) {6T1 = dp[0]++;7          for(intj=1; j<=lenb; J + +) {8t2 =Dp[j];9             if(a[i-1]==b[j-1])TenDP[J] =T1; One             Else ADp[j] = min (t1, min (dp[j-1], Dp[j]) +1; -T1 =T2; -         } the     } -printf"%d\n", Dp[lenb]); -}

The above is the basic introduction of the Levenshtein distance algorithm, if you like, please order a recommendation ~ ~ If you have valuable comments, welcome to the comments below the area proposed OH ~

string editing distance (Levenshtein distance) algorithm

Related Keywords:

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

• Sales Support

1 on 1 presale consultation

• After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

• Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.