Minimum edit distance
Solution 1:
Determine the similarity of different strings.
A set of operation methods are defined to make the two strings different from each other the same. The specific operation method is as follows:
1. modify a character (for example, replace "A" with "B ")
2. Add a character (for example, change "abdd" to "aebdd ")
3. delete a character (for example, change "traveling" to "traveling ")
Definition: defines the minimum number of times required for this operation as the distance between two strings, and the similarity is equal to the reciprocal of "distance + 1 ".
Recursive thinking is used to convert the problem into a smaller one.
U if the first character of the two strings is the same, for example, a = xabcdae and B = xfdfa
A [2 ,..., 7] = abcdae and B [2 ,..., 5] = FDFA distance.
If the first character of the two strings is different, perform the following operations:
1. Delete the first character of string a and calculate a [2 ,..., Lena] and B [1 ,..., Lenb.
2. Delete the first character of string B and calculate a [1 ,..., Lena] and B [2 ,..., Lenb.
3. Modify the first character of string a to the first character of string B, and then calculate a [2 ,..., Lena] and
B [2 ,..., Lenb.
4. Modify the first character of string B to the first character of string a, and then calculate a [2 ,..., Lena] and
B [2 ,..., Lenb.
5. Add the first character of string B before the first character of string a, and then calculate
A [1 ,..., Lena] and B [2 ,..., Lenb.
6. Add the first character of string a before the first character of string B, and then calculate
A [2 ,..., Lena] and B [1 ,..., Lenb.
We don't care what the two strings are like after they become equal.
You can merge the above six operations:
1. After one step of operation, replace a [2 ,..., Lena] and B [1 ,..., Lenb] to the same string.
2. After one step of operation, replace a [1 ,..., Lena] and B [2 ,..., Lenb] to the same string.
3. After one step of operation, replace a [2 ,..., Lena] and B [2 ,..., Lenb] to the same string.
Pseudocode
Int calculatestringdistance (string stra, int pabegin, int paend, string strb, int pbbegin, int pbend) {If (pabegin> paend) // recursive termination condition {If (pbbegin> pbend) return 0; else return pbend-pbbegin + 1;} If (pbbegin> pbend) {If (pabegin> paend) return 0; else return paend-pabegin + 1 ;} if (stra [pabegin] = strb [pbbegin]) // algorithm core {return calculatestringdistance (stra, pabegin + 1, paend, strb, pbbegin + 1, pbend );} else {int T1 = calculatestringdistance (stra, pabegin, paend, strb, pbbegin + 1, pbend); int t2 = calculatestringdistance (stra, pabegin + 1, paend, strb, pbbegin, pbend); int T3 = calculatestringdistance (stra, pabegin + 1, paend, strb, pbbegin + 1, pbend); Return minvalue (T1, T2, T3) + 1 ;}}
Simplified Version
# Define max 100 char S1 [Max]; char S2 [Max]; int distance (char * S1, char * S2) // evaluate the string distance {int len1 = strlen (S1); int len2 = strlen (S2); If (len1 = 0 | len2 = 0) {return max (len1, len2);} If (S1 [0] = S2 [0]) return distance (S1 + 1, S2 + 1 ); else return min (distance (S1, S2 + 1), distance (S1 + 1, S2), distance (S1 + 1, S2 + 1) + 1 ;}
What improvements do the above algorithms need?
In the algorithm, some data is computed repeatedly.
To avoid this type of repeated computing, we can consider saving the computed solution of the subproblem.
Dynamic Planning and solving
Part 1: http://blog.csdn.net/huaweidong2011/article/details/7727482
This article describes edit distance (for details about the definition of the editing distance, see the body), which includes five aspects:
- Defining minimum edit distance
- Computing minimum edit distance
- Backtrace for computing alignments
- Weighted minimum edit distance
- Minimum edit distance in computational biololgy
1. Definition of minimum edit distance
Edit distanceUsed to measure the similarity between two strings. Between two strings
Minimum edit distanceIt refers to the minimum operand for converting one string to another by editing (including insert, delete, and replace operations. As shown in, D (deletion) indicates the delete operation, S (substitution) indicates the replace operation, and I (insertion) indicates the insert operation. (For the sake of simplicity, edit distance is abbreviated as Ed.) If the cost (cost) of each operation is 1, then Ed = 5. if the cost of the s operation is 2 (levenshtein distance), ED = 8.2. computing minimum edit distance how to find the minimun edit distance of two strings? You can use many methods (or "path") to convert a string to another string. We know that the starting status (the first string), ending status (another string), and basic operations (insert, delete, and replace) require the shortest path. For the following two strings: the length of X is NY and M. We define d (I, j) as the first I character of X [1... i] and Y's first J characters y [1... j], where 0 <I <n, 0 <j <M. Therefore, the distance between x and y can be expressed by D (n, m. If we want to calculate the final D (n, m), we can calculate the values of d (I, j) (I and J starting from 1) first, then, the larger d (I, j) is calculated based on the preceding results until D (n, m) is obtained ). Shows the algorithm process. "levenshtein distance" is used, that is, the replacement cost is 2. for more information, see section D (I, j). The possible values are as follows: 1. D (I-1, j) + 1; 2. d (I, J-1) + 1; 3. D (I-1, J-1) + 2 (when x new and y new characters are not at the same time, need to replace) or + 0 (that is, the newly added characters of both strings are the same) that is, to find the tables formed by Ed step by step for the string intention and execution. The 8 in the Red Circle in the upper left corner is the smallest ed between two strings. 3. backtrace for computing alignments we obtained the edit distance in the previous lesson, but it is not enough to only edit distance, sometimes we also need to match each character in two strings one by one (some letters will correspond to "blank"), which can be obtained through the calculation process of Backtrace ed. Through the previous section we know that d (I, j) has three value sources, D (I-1, J), D (I, J-1) or D (I-1, J-1 ), the following table shows the calculation process of the entire table by adding arrows (the shadow below represents only a path, and you will find that the path to the final result is not unique, because the number of each cell may be obtained from the left, bottom, or bottom left ). Starting from the top-right corner of the table, you can extract a path (not unique) along the tracing scissors. The Scissors of this path can easily show which method (insert, delete, and replace) is used) completed. There are only one path in the shadow part of the table in the upper right corner. We can easily see that the last four letters are the same, but this situation is not absolute, for example, there is only one path for the shadow six cells in the middle, but it corresponds to the letters E and C respectively. The idea of Algorithm Implementation of "path searching" is very simple-it is to define a pointer for each cell. The pointer value is left/down/diag (I don't know why it is a pointer), as shown in. Think about common situations, for example, any non-descent path from (0, 0) to (m, n) corresponds to an arrangement between two strings, the optimal arrangement is composed of the best sub-arrangement. Briefly think about the algorithm performance time: O (Nm) Space: O (Nm) backtrace: O (N + M) 4. weighted minimum edit distanceed can also be used to add weights, because some letters are more prone to mistakes in spelling. For example, if the confusion matrix is displayed, a larger value indicates a higher probability of being written by mistake. For example, a may be mistakenly written as E, I, O, and U. As we all know, keyboard la S may affect the accidental writing. As shown in the weighted min edit distance algorithm, this figure defines different weights for Del, INS, and sub operations. In the "levensky distance", the cost of Del and INS is 1, sub is 2. 5. Minimum edit distance in computational biology This section describes the application of minimum edit distance in computational biology. For example, if we compare two genome sequences (working part), we hope to align the two sequences (lower half) and study the functions of different gene fragments. In natural language processing, we have discussed the minimum distance and weight. In computational biology, we will introduce the maximum similarity and scores. There is an important algorithm in computational biology-Needleman-Wunsch.
Part 2 code: http://blog.csdn.net/abcjennifer/article/details/7735272
In natural language processing (NLP), a basic problem is to find the minimal edit distance of two strings, also known as levenshtein distance. Get an edit
Distance is inspired by this article. This article uses dynamic programming to obtain the minimal edit distance between two strings. The dynamic programming equation is described below.
1. What is minimal edit distance?
Simply put, it is the minimum number of steps for converting a string S1 to another string S2 only through insert, delete, and substitute operations. It is easy for people familiar with algorithms to know that this is a dynamic planning problem.
In fact, a replacement operation can be equivalent to a delete + insert operation, so we define the weight as follows:
I (insert): 1
D (delete): 1
S (substitute): 2
2. Example:
Intention-> execution
Minimal edit distance:
Delete I; n-> E; t-> E; insert C; n-> U sum cost = 8
3. Calculate minimal edit distance dynamically
For more information, see note. Here, d [I, j] is the minimal edit distance obtained from the first character of S1 and the first J character of S2.
Three operations are dynamically updated:
D (I, j) = min {d (I-1, j) + 1, d (I, J-1) + 1, D (I-1, J-1) + S1 [I] = S2 [J]? 0: 2}; the three items correspond to D, I, s respectively.
/** Mineditdis. CPP ** @ created on: Jul 10,201 2 * @ Author: Sophia * @ discription: Calculate the minimal edit distance between 2 strings ** method: dp (Dynamic Programming) * d [I, j]: The first I character of the minimal edit distance for S1 and the first J character of S2 * DP formulation: d [I, j] = min (d [I-1, J] + 1, D [I, J-1] + 1, D [I-1, J-1] + flag ); // Where if (S1 [I]! = S2 [J]) Then flag = 2, else flag = 0; **/# include "iostream" # include "stdio. H "# include" string. H "using namespace STD; # define n 100 # define INF 100000000 # define min (a, B) a <B? A: bint dis [N] [N]; char S1 [N], S2 [N]; int n, m; // length of the two stringint main () {int I, j, k; while (scanf ("% S % s", & S1, & S2 )! = EOF) {n = strlen (S1); M = strlen (S2); for (I = 0; I <= n + 1; I ++) for (j = 0; j <= m + 1; j ++) dis [I] [J] = inf; DIS [0] [0] = 0; for (I = 0; I <= N; I ++) for (j = 0; j <= m; j ++) {if (I> 0) dis [I] [J] = min (DIS [I] [J], DIS [I-1] [J] + 1); // deleteif (j> 0) dis [I] [J] = min (DIS [I] [J], DIS [I] [J-1] + 1 ); // insert // substituteif (I> 0 & J> 0) {If (S1 [I-1]! = S2 [J-1]) dis [I] [J] = min (DIS [I] [J], DIS [I-1] [J-1] + 2 ); elsedis [I] [J] = min (DIS [I] [J], DIS [I-1] [J-1]);} printf ("Min edit distance is: % d \ n ", DIS [N] [m]);} return 0 ;}
Running result:
Intention
Execution
Min edit distance is: 8
ABC
Acbfbcd
Min edit distance is: 4
Zrqsophia
Aihposqrz
Min edit distance is: 16
Reference:
1.
Https://www.coursera.org/course/nlp
2. http://blog.csdn.net/huaweidong2011/article/details/7727482