Edit distance editing distance

Source: Internet
Author: User
Tags min

Original address: http://blog.csdn.net/ihavenoidea/article/details/526764


Refrence : Dynamic programming Algorithm (DPA) for edit-distance

Edit Distance

The difference between the two string s1,s2 can be determined by calculating their minimum editing distance.

The so-called editing distance : To make S1 and S2 the same string requires the minimum number of operations below.

1. Turn a character ch1 into a CH2

2. Delete a character

3. Inserting a character

For example S1 = "12433" and s2= "1233";

It is possible to get 12433 consistent with S1 by inserting 4 in the middle of S2.

i.e. d (s1,s2) = 1 (one insert operation)

the nature of the editing distance

Calculates two string s1+ch1, S2+ch2 's editing distance has this property:

1. D (S1, "") = D ("", s1) = |s1| D ("Ch1", "CH2") = Ch1 = = CH2? 0:1;

2. D (S1+CH1,S2+CH2) = min (d (s1,s2) + ch1==ch2? 0:1,

D (S1+CH1,S2),

D (S1,S2+CH2));

The first nature is obvious.

The second nature: because of the three actions we define as a measure of distance to the editor.

So the possible operation on CH1,CH2 is only

1. Turn the ch1 into a CH2

2. s1+ch1 Delete ch1 D = (1+d (S1,S2+CH2))

3. S1+ch1 after inserting CH2 d = (1 + d (S1+CH1,S2))

The operation for 2 and 3 can be equivalent to:

_2. Add Ch1 d= after S2+ch2 (1+d (S1,S2+CH2))

_3. S2+CH2 Delete CH2 d= (1+d (S1+CH1,S2))

So we can get the property of calculating the editing distance 2.

Analysis of Complexity

From the above properties 2 you can see that the computation process presents such a structure (assuming that each layer is marked with the current computed string length and assumes that two string lengths are all n)

It can be seen that the complexity of the problem is exponential level 3 of the N-side, for a longer string, time is unbearable.

Analysis: in the above structure, we found many occurrences (n-1,n-1), (n-1,n-2) .... In other words, the structure has overlapping sub-problems. Plus the best sub-structure of the previous property 2. Conforms to the basic elements of the dynamic programming algorithm . Therefore, the dynamic programming algorithm can be used to reduce the complexity to the polynomial level .

Dynamic Programming Solution

First, to avoid repeating the sub-problem, add two auxiliary arrays.

One. Save sub-problem results.

m[|s1|, |s2|], where m[I, j] represents the editing distance of the substring S1 (0->i) and S2 (0->J)

Two. Save the editing distance between characters.

e[|s1|, |s2|], where e[i, j] = s[i] = s[j]? 0:1

Three. New Calculation expressions

According to the nature of 1 get

m[0,0] = 0;

M[s1i, 0] = |s1i|;

m[0, s2j] = |s2j|;

According to the nature of 2 get

m[I, j] = min (m[i-1,j-1] + e[I, j],

M[i, J-1],

M[i-1, j]);

Complexity of

The new formula shows that the calculation process is

I=1-|s1|

J=1-|s2|

M[I][J] = ...

So the complexity is O (|s1| * |s2|), if they are assumed to be N, the complexity is O (n^2)

Reference: Dynamic programming Algorithm (DPA) for edit-distance

The words ' computer ' and ' commuter ' is very similar, and a change in just one letter, p->m'll change the first word Into the second. The word ' sport ' can be changed to ' spot ' by the deletion of the ' P ', or equivalently, ' spot ' can is changed into ' sport ' By the insertion of ' P '.

The edit distance of strings, S1 and S2, is defined as the minimum number of point mutations required to change S1 int o S2, where a point mutation are one of:change a letter, insert a letter or delete a letter

The following recurrence relations define the edit distance, D (S1,S2), of the strings S1 and S2:

D (', ') = 0d (S, ')  = d (', s) = |s|   --i.e. length of SD (S1+CH1, s2+ch2)  = min (d (S1, S2) + if Ch1=ch2 then 0 else 1 fi,         D (s1+ch1, S2) + 1,         D (S1, S2+CH2) + 1)

The first and rules above is obviously true, so it's only necessary consider the last one. Here, neither string was the empty string, so all have a last character, Ch1 and CH2 respectively. Somehow, CH1 and CH2 have the to is explained in a edit of S1+ch1 into S2+CH2. If ch1 equals CH2, they can be matched for no penalty, i.e. 0, and the overall edit distance are D (s1,s2). If CH1 differs from CH2, then Ch1 could is changed into CH2, i.e. 1, giving a overall cost D (S1,S2) +1. Another possibility is to delete ch1 and edit S1 into S2+ch2, D (S1,S2+CH2) +1. The last possibility are to-edit s1+ch1 into S2 and then insert CH2, D (S1+CH1,S2) +1. There is no other alternatives. We take the least expensive, i.e min, of these alternatives.

The recurrence relations imply an obvious ternary-recursive routine. This isn't a good idea because it's exponentially slow, and impractical for strings of more than a very few characters.

Examination of the relations reveals that D (S1,S2) depends only on D (S1 ', S2 ') where s1 ' are shorter than S1, or S2 ' is Shor ter than S2, or both. This allows the dynamic programming technique to be used.

A two-dimensional matrix, m[0..| s1|,0..| S2|] is used-to-hold the edit distance values:

M[I,J] = d (s1[1..i], S2[1..J]) m[0, 0] = 0m[i, 0] = i,  i=1..| S1|m[0, J] = j,  j=1..| S2|m[i,j] = min (m[i-1,j-1]              + if S1[I]=S2[J] then 0 else 1 fi,              M[i-1, J] + 1,              m[i, j-1] + 1),    i=1..| s1|, j=1..| s2|

M[,] can is computed row by row. Row M[i,] depends only on row m[i-1,]. The time complexity of this algorithm is O (|s1|*|s2|). If S1 and S2 has a ' similar ' length, about ' n ' say, this complexity is O (N2), much better than exponential!

The words ' computer ' and ' commuter ' is very similar, and a change in just one letter, p->m'll change the first word Into the second. The word ' sport ' can be changed to ' spot ' by the deletion of the ' P ', or equivalently, ' spot ' can is changed into ' sport ' By the insertion of ' P '.

The edit distance of strings, S1 and S2, is defined as the minimum number of point mutations required to change S1 int o S2, where a point mutation are one of:change a letter, insert a letter or delete a letter

The following recurrence relations define the edit distance, D (S1,S2), of the strings S1 and S2:

D (', ') = 0d (S, ')  = d (', s) = |s|   --i.e. length of SD (S1+CH1, s2+ch2)  = min (d (S1, S2) + if Ch1=ch2 then 0 else 1 fi,         D (s1+ch1, S2) + 1,         D (S1, S2+CH2) + 1)

The first and rules above is obviously true, so it's only necessary consider the last one. Here, neither string was the empty string, so all have a last character, Ch1 and CH2 respectively. Somehow, CH1 and CH2 have the to is explained in a edit of S1+ch1 into S2+CH2. If ch1 equals CH2, they can be matched for no penalty, i.e. 0, and the overall edit distance are D (s1,s2). If CH1 differs from CH2, then Ch1 could is changed into CH2, i.e. 1, giving a overall cost D (S1,S2) +1. Another possibility is to delete ch1 and edit S1 into S2+ch2, D (S1,S2+CH2) +1. The last possibility are to-edit s1+ch1 into S2 and then insert CH2, D (S1+CH1,S2) +1. There is no other alternatives. We take the least expensive, i.e min, of these alternatives.

The recurrence relations imply an obvious ternary-recursive routine. This isn't a good idea because it's exponentially slow, and impractical for strings of more than a very few characters.

Examination of the relations reveals that D (S1,S2) depends only on D (S1 ', S2 ') where s1 ' are shorter than S1, or S2 ' is Shor ter than S2, or both. This allows the dynamic programming technique to be used.

A two-dimensional matrix, m[0..| s1|,0..| S2|] is used-to-hold the edit distance values:

M[I,J] = d (s1[1..i], S2[1..J]) m[0, 0] = 0m[i, 0] = i,  i=1..| S1|m[0, J] = j,  j=1..| S2|m[i,j] = min (m[i-1,j-1]              + if S1[I]=S2[J] then 0 else 1 fi,              M[i-1, J] + 1,              m[i, j-1] + 1),    i=1..| s1|, j=1..| s2|

M[,] can is computed row by row. Row M[i,] depends only on row m[i-1,]. The time complexity of this algorithm is O (|s1|*|s2|). If S1 and S2 has a ' similar ' length, about ' n ' say, this complexity is O (N2), much better than exponential!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.