Algorithm generation 4: String Similarity

Source: Internet
Author: User

 

We define the similarity between two strings as the cost of converting a string to another string (the conversion method may not be unique ), the higher the conversion cost, the lower the similarity between the two strings. For example, two strings: "Snowy" and "Sunny". The following two methods are provided to convert "Snowy" to "Sunny:

 

Transform 1:

S-N o W Y

S u n-y

Cost = 3 (insert U, replace o, delete W)

 

Transformation 2:

-S n o w-y

S u n-n y

Cost = 5 (insert S, replace S, delete o, delete W, insert N)

Analyze problems

We can understand this similarity as follows: the minimum number of edits required to change a string (source) to another string (target) through the "insert, delete, and replace" edit operation, that is, the edit distance between two strings ). Can an algorithm be provided to solve the editing distance between any two strings?

The example given in the question shows that there are more than one method to convert a string to another by inserting, deleting, or replacing it, and the number of edits required is also different, if there is a way to complete the conversion with the smallest number of modifications, the number of edits for this method is the required editing distance. Obviously, this is a problem to find the optimal solution.

When it comes to the optimal solution, you can first consider using the greedy method. However, this question is obviously an optimal solution for multi-stage decision-making. The source string is transformed to the target string with the smallest modification, you need to select the minimum modification method for each stage in the processing process. However, each stage in this question is not isolated and is jointly affected by the previously determined decision and the optional decision later, the greedy method cannot be ruled out by simply stacking the final optimal results of the optimal decision for each decision.

Dynamic Programming)

Dynamic Programming (DP) should be given priority to the problem of multi-stage decision-making ). Dynamic Programming is a commonly used method to solve multi-stage decision optimization problems [1]. It is also the most abstract method in all solutions. There are two key points to solve the problem by using dynamic programming. One is to define the optimal sub-structure of the sub-problem [NOTE 1], and the other is to determine the stacking method of the optimal solution of the sub-problem. Defining an optimal sub-structure is to break down sub-problems. It can be either recursive or recursive. The basic principle is to divide the problem into M sub-problems, determine the relationship between the optimal solution of each subproblem and other n (n Less than m) subproblems. The stack mode of the subproblem optimal solution refers to the recursive relationship between the optimal decision sequence and its subsequence, including the recursive relationship of the subproblem optimal solution and the boundary value. For a problem, if we can find the optimal sub-structure definition method (including the relationship between sub-problems) and the stack method of the sub-problem optimal solution, in addition, the optimal solution of each sub-problem satisfies the problem of no aftereffect [annotation 2], then the problem can be solved by dynamic programming.

Take this question as an example. Assume that the source string contains n characters and the target string contains M characters, if the problem is defined as the minimum number of edits required to convert 1-n characters of source to 1-M characters of target (the minimum editing distance ), the subproblem can be defined as the minimum number of edits required to convert the 1-I characters of source to the 1-j characters of target. This is the optimal sub-structure of this problem. We use d [I, j] to represent the minimum editing distance between source [1. I] and target [1. J]. Then, we calculate d [I,
The recursive relationship of J] can be calculated as follows:

 

If source [I] is equal to target [J], then:

 

D [I, j] = d [I, j] + 0 (recurrence 1)

 

If source [I] is not equal to target [J], the distance between the three policies is calculated based on the insert, delete, and replace policies, and the smallest one is obtained:

 

D [I, j] = min (d [I, j-1] + 1, D [I-1, J] + 1, D [I-1, j-1] + 1) (recurrence 2)

 

D [I, j-1] + 1 indicates that the minimum editing distance is calculated after the insert operation is performed on source [I ].

D [I-1, J] + 1 indicates that the minimum editing distance is calculated after the source [I] is deleted.

D [I-1, J-1] + 1 indicates replacing source [I] with target [I] to calculate the minimum editing distance.

 

The Boundary Value of d [I, j] is the editing distance calculated when the target is a null string (m = 0) or the source is a null string (n = 0:

 

M = 0, for all I: d [I, 0] = I

N = 0. For all J: d [0, J] = J

 

Based on the optimal sub-structure, recursive relationship of the optimal solution, and the boundary value analyzed above, it is easy to write an algorithm that uses the Dynamic Programming Method to Solve the minimum editing distance, the following code calculates the minimum editing distance between two strings:

 

30/* Note: the length of the source and target strings cannot exceed the limit of the d matrix */

31 int editdistance (const STD: string & source,
Const STD: string & target)

32 {

33 STD: String: size_type I, J;

34
Int d [max_string_len] [max_string_len]
= {
0 };

35

36
For (I =
0; I <= source. Length (); I ++)

37 d [I] [0]
= I;

38
For (j =
0; j <= target. Length (); j ++)

39 D [0] [J]
= J;

40

41
For (I =
1; I <= source. Length (); I ++)

42
{

43
For (j =
1; j <= target. Length (); j ++)

44
{

45
If (source [I
-1]
= Target [J
-1])

46
{

47 d [I] [J]
= D [I
-1] [J
-1];
// No need to edit

48
}

49
Else

50
{

51
Int edins = d [I] [J
-1]
+ 1;
// Insert source characters

52
Int eddel = d [I
-1] [J]
+ 1;
// Delete characters from source

53
Int edrep = d [I
-1] [J
-1]
+ 1;
// Replace the source character

54

55 d [I] [J]
= STD: min (edins, eddel), edrep );

56
}

57
}

58
}

59

60
Return d [source. Length ()] [target. Length ()];

61}

Exhaustive method (enumeration method)

In addition to greedy method and dynamic programming method, the exhaustive method is also a common method for solving the optimal solution. The exhaustive method is easier to understand than the dynamic programming method. The principle of the exhaustive method is to search the entire solution space of the problem domain and select the optimal solution by comparing all possible solutions. The purpose of the exhaustive method is to solve all valid solutions to a problem. The optimal solution can be understood as a by-product in the search process. According to different questions, the implementation of the exhaustive method is different. If the solution space of the problem is a linear structure, you can use the cyclic method. If the solution space of the problem is a tree structure, the recursive method can be used. The solution space of this problem is obviously not a linear structure. Therefore, we need to use recursive methods to enumerate all solutions. Recursive Algorithms need to solve two problems: how to recursively resolve the problem into subproblems, and how to determine recursive termination conditions.

For the problem in this article, we can solve the subproblem of recursive decomposition as follows: location I represents the common character location in the source and target strings. For each I position, calculates the editing distance of the substring starting from position I. The calculation method is to compare the values of source [I] and target [I]. If they are equal, it indicates that the number of edits required for this position is 0. The distance between the edits of the substring starting with I is equal to the distance between the source and target strings from the position I + 1. If they are not equal, the edit distance of the new substring is calculated by inserting, deleting, and replacing the characters at the position I of the source string, then add 1 to the length of the substring starting from position I. The start position of the new substring needs to be adjusted by inserting, deleting, and replacing the substring. If the character is inserted at the I position, the I position of the source string remains unchanged, the position I of the target string is moved to
+ 1 continues. If the characters are deleted at the I position, the I position of the source string is moved to the I + 1 position, and the I position of the target string remains unchanged. If the character is replaced by the I position, the I position of the source and target strings is moved to the I + 1 position.

With recursive decomposition of sub-problems, you also need to determine the termination conditions of recursive calculation. The termination condition for this problem is very simple. According to the definition of the recursive subproblem above, recursion can be terminated when one of the source and target strings is null. When the recursive termination condition is met, the value of the edit distance is the length of the remaining substring that is not empty in the source and target strings. This is easy to understand, if the remaining substring of source is not empty, it means that you need to delete these characters to be the same as that of target. If the remaining substring of target is not empty, this means that you need to insert these characters to the source to be the same as the target. Therefore, the final editing distance is the length of the remaining substring.

Based on the decomposition method of recursive subproblems and the analysis of recursive termination conditions, it is easy to write the algorithm implementation that uses the recursive method to solve this problem. Recursive Algorithms are usually inefficient, but they are consistent with the way humans think about solving problems. The code is concise and easy to understand. The algorithm implementation described in this article uses only nine lines of code:

14 int editdistance (const STD: string & source,
Const STD: string & target)

15 {

16
If (source. Empty ()
| Target. Empty ())

17
Return STD: ABS (source. Length ()
-Target. Length ());

18

19
If (source [0]
= Target [0])

20
Return editdistance (source. substr (1), target. substr (1 ));

21

22
Int edins = editdistance (source, target. substr (1 ))
+ 1;
// Insert source characters

23
Int eddel = editdistance (source. substr (1), target)
+ 1;
// Delete characters from source

24
Int edrep = editdistance (source. substr (1), target. substr (1 ))
+ 1;
// Replace the source character

25

26
Return STD: min (edins, eddel), edrep );

27}

Summary

In terms of time complexity, the time complexity of the Dynamic Programming Method is O (n2). the time complexity of the exhaustive algorithm is O (n) in the best case ), that is, the condition that the branches go every time (source [0] = target [0]). The worst case is O (3N ). This time complexity is an exponential algorithm, which is basically unavailable and only has theoretical value. In actual engineering, this time complexity algorithm is not applicable.

In terms of spatial complexity, the spatial complexity of dynamic programming is O (Mn), but the spatial complexity can be optimized to a large extent. The degree of optimization varies with the algorithm. Taking the algorithm in this article as an example, we can see from (recursive 2) that the result of d [I, j] only needs to know I, I-1, and J, the result at the position J-1 can be calculated recursively. The information at other locations can be completely released after computation, without occupying the space of m x n from start to end. Similar optimization is only a small skill in data organization. This article will not go into details. Interested readers can modify the editdistance () function on their own.

 

Note:

[1]Optimal sub-structure:For multi-stage decision-making problems, if the Sub-sequence of the optimal decision-making sequence in each stage is also optimal, and the decision-making sequence has "No aftereffect", the decision-making method can be understood as the optimal sub-structure.

 

[2]No aftereffect:The optimal solution of dynamic programming is usually a decision sequence composed of a series of optimal decisions. The optimal sub-structure is a subsequence of these optimal decision sequences, A new optimal decision (sub) sequence is generated when the optimal decision is made for each sub-sequence. If a decision is only affected by the current optimal decision-making sub-sequence, without the influence of the new optimal decision subsequence that the current decision may produce, we can understand that this optimal decision has no aftereffect.

 

References:

[1]. Competition in algorithm art and Informatics. Liu rujia and Huang Liang. Tsinghua University Press. 2003

Http://en.wikipedia.org/wiki/Edit_distance.

Http://en.wikipedia.org/wiki/Levenshtein_distance.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.