Minimum Edit distance (Levenshtein distance) (minimum editing distance)

Source: Internet
Author: User
Tags first string

definition of minimum editing distance: Edit distance ( Edit Distance ), also known as Levenshtein distance, which is the minimum number of edit operations required to turn between two strings, from one to another. Permission edits include replacing one character with another character, inserting a character, and deleting a character.

For example, the Kitten one word turns into Sitting :

Sitten ( k → s )

Sittin ( e → I )

Sitting (→ g )

Russian Scientists Vladimir Levenshtein in the 1965 This concept was introduced in the year.

Thewords ' computer ' and ' commuter ' is very similar, and a change of just oneletter, P->m would change the first word in to the second. The word ' sport ' can be changed to ' sort ' by the deletion of the ' P ', or equivalently, ' sort ' can is changed into ' Sport ' By the insertion of ' P '.

Theedit distance of the strings, S1 and S2, is defined as the minimum number ofpoint mutations required to change S1 into S2, where a point mutation is oneof:

1. Change a letter,

2. Insert a letter, or

3. Delete a letter

How to solve this problem?

If the algorithm is not often done, then see this problem will have no idea, because the string to edit into another string method should be a lot of Insert , Delete , Substitute There are many kinds of combinations, so how do you measure the minimum editing distance?

The following is a classic algorithm idea: Divide and conquer, solve complex problems into simple sub-problems (and assume that the solution of the sub-problem is known). One of the most common modeling methods for this idea is the mathematical sequence, which uses the previously known entries to introduce unknown items. Recursion or recursion is also called in the computer. For example, the Fibonacci sequence problem.

So in this question, how can we get the recursive formula of the minimum editing distance? We'd better start with the simplest and the most special places to think about problems. We assume that there are two strings, the case is

1. two are empty strings D (",") = 0--"= empty string

2. There is an empty string D (S, ') = D (', s) = |s| --i.e. length of S(continuous deletion or insertion)

3. Two non-empty strings D (S1+CH1,S2+CH2)

at this point,d (s1+ch1, S2+CH2)The result is nothing more than a three-case decision, the first hypothesisd (S1,S2)It is known that we have replaced the last character of the two string, thend (s1+ch1, s2+ch2) = d (S1, S2) + if Ch1=ch2 then 0 else 1the second possibility is to assumed (S1,S2+CH2)known, put the first string ofCH1deleted, thed (s1+ch1, S2+CH2) = d (S1,S2+CH2) +1the third may be the hypothesisd (S1+CH1,S2)known to insert at the end of the first stringCH2, youd (s1+ch1, S2+CH2) = d (s1+ch1,s2) +1, so what kind of situation has beend (S1+CH1,S2+CH2)It must have been the smallest decision, so

D (S1+CH1,S2+CH2) =min[D (S1, S2) + if Ch1=ch2 then 0 else 1, d (s1+ch1, S2) + 1,d (S1,S2+CH2) + 1]

Next we quantify the definition D[i,j] is a length of I the string s and a length of J the string T the minimum editing distance. So

D[0,0]=0

d[0,j]=j; ( The former inserts J letters or the latter to delete J a letter )

d[i,0]=i; ( The former delete I letters or the latter insert I a letter )

d[i,j]=min{d[i-1,j-1]+ (s[i]==t[j]?0:1), d[i-1,j]+1, d[i,j-1]+1}

after getting the recursion, ask D[i,j] it's easy. Define a two-dimensional array distance[][] to store the minimum editing distance and try the Java code below:

Package Algorithms;public class Editdistancecomputer {private int sweight = 1;//Replace the weight of the operation substitute, which is the cost overheadprivate int iweight = 1;//Insert action The right value of the inserts private int dweight = 1;//delete operation The weight of the Delete property public static void Main (string[] args) {String s = "I Ntention "; String t = "execution"; Editdistancecomputer editdc = new Editdistancecomputer (); System.out.println (Editdc.getmineditdistance (S, t));} public void setweight (int sweight, int iweight, int dweight) {this.sweight = Sweight;this.iweight = Iweight;this.dweight = Dweight; }public int Getmineditdistance (string s, string t) {int m = s.length (); int n = t.length ();//Application (m+1) * (n+1) matrix space int[][] Distan CE = new int[m+1][n+1];//Initialize special value for (int i=0;i<m+1;i++) {distance[i][0] = i;} for (int i=0;i<n+1;i++) {Distance[0][i] = i;} Use recursive formula traversal to fill the entire distance matrix for (int i=1;i<=m;i++) {for (int j=1;j<=n;j++) {Distance[i][j] = Getmin (distance[i-1][j]+ Dweight, Distance[i][j-1]+iweight, distance[i-1][j-1]+ (S.charat (i-1) ==t.charat (j-1) 0:sweight));}} Printmatrix (distance,m+1,n+1); retUrn Distance[m][n];} Print matrix public void Printmatrix (int[][] Matrix, int rownum, int. colnum) {for (int. i=rownum-1;i>=0;i--) {for (Int. j=0;j<) colnum;j++) {System.out.print (matrix[i][j]+ "");} System.out.println ();}} private int Getmin (int a, int b, int c) {return (a<b)? ( A&LT;C?A:C):(b<c?b:c);}}

time complexity of the algorithm O (m*n), Complexity of Space O (m*n) .

we have calculated that in addition to the minimum editing distance, then how tosafterDistance[i][j]the secondary operation is converted toTit? Looking at the matrix above, we drawDistance[i][j]there is actually a path, if you write down this path, then we can backtrack and find the corresponding operation. Next we define a backtracking matrix that records each operation.backtrace[][]


Package Algorithms;enum Traceoperator {l,d,s}; L:left D:down S:slantpublic class Editalignment {private int sweight = 1;//Replace the weight of the operation substitute, which is the cost overheadprivate int i Weight = 1;//Insert action The right value of the inserts private int dweight = 1;//delete operation The weight of the deleted delete is private int m = 0;private int n = 0;int[][] Distance = Nu ll traceoperator[][] backtrace = null; StringBuffer sb = null;public static void Main (string[] args) {String s = "intention"; String t = "execution"; Editalignment editdc = new Editalignment (); System.out.println (Editdc.getmineditdistance (S, t)); Editdc.alignment (S, t);} public void setweight (int sweight, int iweight, int dweight) {this.sweight = Sweight;this.iweight = Iweight;this.dweight = Dweight; }public void Alignment (final string s, final string t) {sb = new StringBuffer (s); System.out.println ("Sourcestring StringBuffer before Alignment:" + SB); if (backtrace = = NULL | | Distance = = null) system.exit ( -1); int i = m;int j = n;while (backtrace[i][j]! = NULL) {switch (Backtrace[i][j]) {Case S:if (S.C Harat (i-1)!=t.chArat (j-1)) {sb.replace (i-1, I, "" +t.charat (j-1)); System.out.println ("source string:" + SB); System.out.println ("Target string:" + t); System.out.println ("---------------------------------------");} I--;j--;break;case L:sb.insert (i, T.charat (j-1)); j--; System.out.println ("source string:" + SB); System.out.println ("Target string:" + t); System.out.println ("---------------------------------------"); break;case D:sb.deletecharat (i-1); i--; System.out.println ("source string:" + SB); System.out.println ("Target string:" + t); System.out.println ("---------------------------------------"); Break;default:system.exit (-1);}} System.out.println ("Sourcestring stringbuffer after Alignment:" + SB);}                     public int Getmineditdistance (final string s, final string t) {m = S.length (); As a two-dimensional matrix, m corresponds to the row, that is, the ordinate, n corresponds to the column, that is, the horizontal axis n = t.length (); int a,b,c;distance = new Int[m+1][n+1];backtrace = new Traceoperator[m +1][n+1];initmatrix (m+1, n+1); for (int. i=1;i<=m;i++) {for (int j=1;j<=n;j++) {a = distance[i-1][J]+dweight;//deletion for S operation, the following are the S as the source string b = Distance[i][j-1]+iweight;//insertionc = distance[i-1][j-1]+ (s.charAt (i-1) ==t.charat (j-1) 0:sweight)//substitutionif (A = = Getmin (a,b,c)) {Distance[i][j] = A;BACKTRACE[I][J]=TRACEOPERATOR.D ;//deletion}else if (b = = Getmin (a,b,c)) {Distance[i][j] = B;backtrace[i][j]=traceoperator.l;//insertiodn}else if (c = = Getmin (A,b,c)) {Distance[i][j] = C;backtrace[i][j]=traceoperator.s;//substitution}}}printmatrix (distance,m+1,n+1) ; System.out.println ();p Rintmatrix (backtrace,m+1,n+1); return distance[m][n];} public void Printmatrix (int[][] Matrix, int rownum, int. colnum) {for (int. i=rownum-1;i>=0;i--) {for (int j=0;j<colnum) ; j + +) {System.out.print (matrix[i][j]+ "");} System.out.println ();}} public void Printmatrix (traceoperator[][] Matrix, int rownum, int. colnum) {for (int. i=rownum-1;i>=0;i--) {for (int j=0;j) <colnum;j++) {System.out.print (matrix[i][j]+ "");} System.out.println ();}} private void Initmatrix (int x, int y) {for (int i=0;i<x;i++) {distance[i][0] = i;} for (int i=0;i<y;i++) {Distance[0][i] = i;} for (int i=1;i<x;i++) {backtrace[i][0] = TRACEOPERATOR.D;} for (int i=1;i<y;i++) {Backtrace[0][i] = TRACEOPERATOR.L;}} private int Getmin (int a, int b, int c) {return (a<b)? ( A&LT;C?A:C):(b<c?b:c);}}

First-time improvements to the algorithm:

the original algorithm is to create a size of s*t The matrix. If all the strings add up to be as long as the three characters, then the matrix will be 1M; If the string is 10000 characters, Then the matrix is 100M. If the element is an integer (here is the number,Int32), then the matrix will be 4*100m = = 400MB so large.

Now the algorithm version uses only 2*t 2*10,000*4 = KB 1k

d[i,j]=min{d[i-1,j-1]+ (s[i]==t[j]?0:1), d[i-1,j]+1, d[i,j-1]+1} [i-1, J-1] [i,j-1] and portrait of [i-1,j] cur_row[] pre_row[] two vector spaces. The following is an improved code:

Package Algorithms;public class EditDistanceComputer1 {private int sweight = 1;//Replace the weight of the operation substitute, That's the price. Overheadprivate int iweight = 1;//Insert Action weight private int dweight = 1;//delete operation The weight of the Delete property public static void Main (String [] args) {String s = "GUMBO"; String t = "gambol"; EditDistanceComputer1 editdc = new EditDistanceComputer1 (); System.out.println (Editdc.getmineditdistance (S, t));} public void setweight (int sweight, int iweight, int dweight) {this.sweight = Sweight;this.iweight = Iweight;this.dweight = Dweight; }public int Getmineditdistance (string s, string t) {int m = s.length (); int n = t.length (); int[] Cur_row = new int[n+1];int[ ] Pre_row = new int[n+1];int[] temp = null;for (int i=0;i<n+1;i++) {Pre_row[i] = i;} for (int i=1;i<=m;i++) {cur_row[0] = i;for (int j=1;j<=n;j++) {Cur_row[j] = Getmin (Pre_row[j]+dweight, cur_row[j-1] +iweight, pre_row[j-1]+ (S.charat (i-1) ==t.charat (j-1) 0:sweight));} Printvector (cur_row,n+1);p rintvector (pre_row,n+1); System.out.println ();//Exchange current and previous lines, prepare for the next iteration, free PRE_row location temp = Cur_row;cur_row = Pre_row;pre_row = temp;} return pre_row[n];} public void Printvector (int[] Vector,int colnum) {for (int j=0;j<colnum;j++) {System.out.print (vector[j]+ "");} System.out.println ();} private int Getmin (int a, int b, int c) {return (a<b)? ( A&LT;C?A:C):(b<c?b:c);}}

Improved algorithm time complexity O (m*n), Spatial complexity O (2*n)

is an explanation of the above calculation process:



Finally, the time complexity of this algorithm is O (m*n), Space complexity O (2*n), in fact, there are other algorithms, in some applications more efficient, is now written here first. The most efficient algorithm currently is the business secret of a company. However, about the minimum editing distance application is very wide, small to our usual use of the IDE's code auto-completion, code hints, search engine keyword hints, and so on, large to remote screen update, compression transmission string, and machine recognition distance measurement, etc., have this principle.


Reference:

Minimum Edit Distance  

Http://web.stanford.edu/class/cs124/lec/med.pdf

Dynamic Programmingalgorithm (DPA) for edit-distance

http://www.allisons.org/ll/AlgDS/Dynamic/Edit/

An EXTENSION of Ukkonen ' senhanced DYNAMIC programming ASM ( approximate string matching ) algorithm

http://www.berghel.net/publications/asm/asm.php

Fast approximate String Matching in a Dictionary

Http://citeseer.ist.psu.edu/viewdoc/download?doi=10.1.1.21.3317&rep=rep1&type=pdf

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

Minimum Edit distance (Levenshtein distance) (minimum editing distance)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.