In machine translation, sometimes the similarity ratio of sentences is used, in which the calculation of the distance of editing is needed. Most of the data found on the network use the character as the smallest unit of editing distance calculation. In fact, for sentences, words are often more reasonable as the smallest unit of editing distance. With the method of dynamic programming, we can easily realize the calculation of editing distance.
It is important to note that the recursion boundary is a problem. That is, when the sentence to be translated, there will be 0-0,0-1,..., 0-n (n is the candidate sentence contains the number of words), in which case the number of changes we can know is, 0,1,....,n. Similarly, if the candidate sentence after deletion, the length of the remaining 0, then there will be 0-0,1-0,2-0,...,m-0 match (M is to be translated sentence contains the number of words), in this case, the number of changes is 0,1,...,m.
So at the beginning of the algorithm to do an array initialization, initialize the results of these known operation times.
In the following code, I try to write a recursive and non-recursive method.
Editdistancereverse is recursive, editdistance is a non-recursive method. For Chinese sentences, it is best to add a custom word segmentation algorithm. The reason I said it at the beginning.
//EditDistance.cpp: Defines the entry point of the console application. //#include"StdAfx.h"#include <string> #include <iostream> #include <Vector> UsingnamespaceSTD;intdist[ -][ -];intEditdistance (conststringPattern[],intPattern_size, conststringCandidate[],intCandidate_size) {intR1 =0;intr2 =0;intR3 =0;intI,j;//Because 0-0 editing distance is 0,0-1 to 1, and so on for(i =0; I <= candidate_size; i++) dist[0][i] = i; for(i =0; I <= pattern_size; i++) dist[i][0] = i; for(i =1; I <= pattern_size; i++) { for(j =1; J <= Candidate_size; J + +) {r1 = dist[i-1][J] +1;//Deleter2 = dist[i][j-1] +1;//Insert intDelta = (Pattern[i-1]! = candidate[j-1] ?1:0); R3 = dist[i-1][j-1] + Delta;int min= R1;min=min> r2? R2:min;min=min> r3? R3:min; DIST[I][J] =min; } }returnDist[pattern_size][candidate_size];}intEditdistancecore (conststringPattern[],intPattern_size, conststringCandidate[],intCandidate_size) {intR1 =0;intr2 =0;intR3 =0;intI,j;if(Pattern_size = =0|| Candidate_size = =0)returnDist[pattern_size][candidate_size];if(!dist[pattern_size-1][candidate_size]) Dist[pattern_size-1][candidate_size] = Editdistancecore (Pattern, pattern_size-1, candidate, candidate_size); R1 = dist[pattern_size-1][candidate_size] +1;//Delete if(!dist[pattern_size][candidate_size-1]) Dist[pattern_size][candidate_size-1] = Editdistancecore (pattern, pattern_size, candidate, Candidate_size-1); r2 = dist[pattern_size][candidate_size-1] +1;//Insert intDelta = (Pattern[pattern_size-1]! = candidate[candidate_size-1] ?1:0);if(!dist[pattern_size-1][candidate_size-1]) Dist[pattern_size-1][candidate_size-1] = Editdistancecore (Pattern, pattern_size-1, candidate, Candidate_size-1); R3 = dist[pattern_size-1][candidate_size-1] + Delta;int min= R1;min=min> r2? R2:min;min=min> r3? R3:min; Dist[pattern_size][candidate_size] =min;return min;}intEditdistancereverse (conststringPattern[],intPattern_size, conststringCandidate[],intCandidate_size) {intI,j;//Because 0-0 editing distance is 0,0-1 to 1, and so on for(i =0; I <= candidate_size; i++) dist[0][i] = i; for(i =0; I <= pattern_size; i++) dist[i][0] = i;returnEditdistancecore (pattern, pattern_size, candidate, candidate_size);}int_tmain (intARGC, _tchar* argv[]) {stringPattern[] ={"I","Love","Baby","Me2"};stringCandidate[] = {"I","Love","Me"}; cout << editdistancereverse (pattern, sizeof (pattern)/sizeof (string), candidate, sizeof (candidate)/sizeof (string));//cout << editdistance (pattern, sizeof (pattern)/sizeof (string), candidate, sizeof (candidate)/sizeof ( string)); return 0;}
The editing distance of the sentence