The following describes the editing distance;
1. The distance between string a and string B is the number of steps required to change string a to string B. Only insert, delete, and replace operations are required.
2. First, let's observe the nature of levenshtein distance. D (x, y) indicates the levenshtein distance from string X to Y. Obviously:
1. d (x, y) = 0 if and only if x = Y (levenshtein distance is 0 <=> the string is equal)
2. d (x, y) = d (y, x) (the minimum number of steps from X to Y is the minimum number of steps from Y to X)
3. d (x, y) + d (y, z)> = d (x, z) (the number of steps required to change from X to Z does not exceed the number of steps required to change from X to Y and then to Z)
Converted from matrix67
In addition to the classic issues such as string matching, searching for a return string, and searching for duplicate substrings, we also encounter other weird string problems in our daily lives. For example, sometimes we need to know the similarity between the two strings that are given "Multi-image. In 1965, Russian scientist Vladimir levenshtein gave a clear definition of string similarity called levenshtein distance, which we usually call "edit distance ".
For example, two steps (two replications) are required from fame to gate, and three steps are required from game to ACM (delete g and E and then add C ). Levenshtein provides a general method of editing distance, which is a classic dynamic planning issue that everyone is very familiar.
In natural language processing, this concept is very important. For example, we can develop a semi-automatic proofreading System Based on this definition: To find all words not in the dictionary in an article, then, for each word, list the words in the dictionary whose levenshtein distance is less than a certain number of N, so that the user can select the correct one. N usually gets 2 or 3, or better, gets 1/4 of the Word Length and so on. This idea is good, but the efficiency of algorithms has become a new challenge: it is easy to look up the dictionary and build a trie tree. But how can we quickly find the most similar words in the dictionary? This problem is hard to solve. levenshtein can be defined as an operation in any position of a word. It seems impossible to do it without traversing the dictionary. Nowadays, many software have the spelling check function, and the speed of proposing correction suggestions is very fast. How did they do it? In 1973, the BK tree proposed by Burkhard and Keller effectively solved this problem. This data structure is strong. It initially solves a seemingly impossible problem, and its principle is very simple.
This last property is called a triangle inequality. Like a triangle, the sum of the two sides must be greater than the third side. Define a binary "distance function" for the elements in a set. If the distance function satisfies the three properties mentioned above, it is called a "measurement space ". Our 3D space is a typical measurement space, and its distance function is the straight line distance of the point. There are still many measurements, such as the Manhattan distance, the shtein distance in graph theory, and the levenshtein distance mentioned here. Just as the query set applies to all equivalence relationships, the BK tree can be used in any measurement space.
The building process is similar to trie. First, we can find a word as the root (such as game ). When inserting a word, calculate the levenshtein distance between the word and the root. If the distance value is the first time that the node appears, create a new son node; otherwise, recursion is performed along the corresponding edge. For example, we insert the word fame, which is 1 from the game, so we create a new son and connect it to an edge marked as 1; next time we insert gain, the distance between it and the game is 2, so it is placed under the side numbered 2. Next time, we insert the gate with a 1 distance from the game, so we will recursively Insert the gate along the side numbered 1 to the subtree where the fame is located; the distance between gate and fame is 2, so the gate is placed under the fame node, and the edge number is 2.
The query operation is abnormal and convenient. If we need to return a word with a distance not greater than N from the error word, the distance between the error word and the word corresponding to the root is d, then, we only need to recursively consider the subtree connected by the edge in the range of D-N to D + N. Since N is usually very small, many Subtrees can be excluded every time you compare it with a node.
For example, if we enter gaie, the program finds that it is not in the dictionary. Now, we want to return all words in the dictionary that are 1 away from gaie. We first compare gaie with the root of the tree, and get the distance d = 1. Because levenshtein distance satisfies the Triangle Inequality, all words that are more than 2 away from game can be ruled out. For example, if the distance from the subtree rooted in aim to game is 3, and the distance between game and gaie is 1, the distance between aim and Its subtree to gaie is at least 2. So now the program only needs to continue along the side of the label range from 1-1 to 1 + 1. We continue to calculate the distance between gaie and fame and find it is 2, so we continue to move forward along the edge of the number between 1 and 3. After the traversal, return to the second node of the game. The distance between gaie and gain is 1, output gain and continue recursion along the edge numbered 0 to 2 (the subtree connected by the edge numbered 4 is also excluded, in this figure, there is no side numbered 0 )......
Practice shows that the number of nodes traversed by one query does not exceed 5% to 8% of all nodes, and the number of nodes traversed by two queries is generally not 17-25%, which far exceeds the brute force enumeration efficiency. Cache as appropriate, reducing levenshtein distance constant n can make the algorithm more efficient.
Note that for the sake of processing, we should start with the string subscript;
The number of cubes is the same as the number of cubes;
However, this question is to convert the number of cubes into the number of inquiries;
Therefore, when the number of inquiries is changed to the number of cubes, although the answer is correct, an error is submitted ..
# Include "stdio. H"
# Include "string. H"
Char STR [2010] [2010], s [2010];
Int DP [2010] [2010];
# Define min (A, B) A> B? B:
Int main ()
{
Int Len, len1, K, H, r = 1, n, m, sum, P;
Int I, J;
Scanf ("% d", & K );
While (k --)
{
Scanf ("% d", & N, & M );
For (I = 0; I <n; I ++)
{
Scanf ("% s", STR [I] + 1 );
}
Printf ("case # % d: \ n", r ++ );
While (M --)
{
Scanf ("% S % d", S + 1, & H );
Len = strlen (S + 1 );
Sum = 0;
For (I = 0; I <n; I ++)
{
Len1 = strlen (STR [I] + 1 );
For (j = 0; j <= Len; j ++)
DP [0] [J] = J;
For (j = 0; j <= len1; j ++)
DP [J] [0] = J;
For (j = 1; j <= len1; j ++)
{
For (P = 1; P <= Len; P ++)
{
DP [J] [p] = min (DP [J] [P-1] + 1, DP [J-1] [p] + 1 );
DP [J] [p] = min (DP [J] [p], DP [J-1] [P-1] + (STR [I] [J] = s [p]? 0: 1 ));
}
}
If (DP [len1] [Len] <= H)
Sum ++;
}
Printf ("% d \ n", sum );
}
}
Return 0;
}