What exactly is the word spelling checker and recommendation based on the mass thesaurus?

Source: Internet
Author: User
Objective

In our daily applications, we should encounter a number of similar situations:

    • When writing a document, the tool automatically recommends a similar and correct spelling when the word is misspelled;
    • Use Sogou Input method, hit the wrong word pinyin can still play the Chinese characters we want;
    • When searching with a search engine, the drop-down box automatically lists words that are similar to the input.
    • And so on, not listed.

How is this function implemented? What algorithms are used in it? This article introduces an algorithm that can accomplish this task.

Problem description

In fact, these problems can be converted to the same problem: for the given input string T, in the pre-prepared pattern string set Q to find a subset of the pattern string similar to the input string.

So how do you get the set of these pattern strings ready? We can get through some mechanisms such as data mining.

So the next question is how to quickly find a string similar to the input string from this collection? Usually we use the minimum editing distance to represent the similarity of two strings.

For example, for input string T, we limit the number of errors to equal to 2, that is, in the pre-prepared pattern collection, find all strings with the input string editing distance less than or equal to 2.

What algorithms can accomplish this task quickly?

Brute Force algorithm

Iterate through each pattern string p in the set Q to calculate its minimum editing distance from the input string T, and output this pattern string if the editing distance is less than the specified error tolerance x.

    • Time complexity: O (| q| * n * m), when | When the q| is very large, the speed will be very slow.

So can this algorithm be optimized? OK!

For example, the first word is very few people input wrong, so we can in the pattern string set Q only the first word with the input string the same string of the same number of similarity calculation, so that can reduce a considerable amount of computation, is a feasible method.

But this also has the problem, if a few people really entered the wrong first word, then the algorithm found all the strings are wrong, can not achieve the effect of error correction.

Therefore, the optimization algorithm for the first character filtering has some limitations.

Step-by-step optimization

We think about this problem, because the pattern string q is a set, there must be a lot of pattern strings with a common prefix. Can you use this prefix for optimization?

Optimization 1: Optimized with the same prefix for two words

For example: strings explore and explain, they have a public prefix, which means that they are the same as the first few columns of the edit matrix of the string explode, without repeating the calculation, as shown in the Red section.

Explore and explain the first 4 columns of the edit matrix must be identical, regardless of the editing distance from any string. So, if we have calculated the explore distance from a string, the first 4 columns can be reused and calculated directly from column fifth when we calculate the editing distance between the string and the explain.

To this, we get a new algorithm to calculate the multi-mode editing distance: Set the pattern string set into a dictionary tree, depth first traverse the tree, in the process of traversal, constantly update the editing matrix of a column, if the node reached is a terminator, and T and P (the string formed on the path) are less than the specified tolerance, and a string that matches the condition is found.

Optimization 2: Pruning

Although we use word prefixes to optimize the algorithm, it is possible to avoid duplicate computations of edit matrices with the same prefix pattern strings, but all nodes must be traversed. Is there any way to calculate the remaining nodes of the subtree after the calculation to a certain depth, based on some constraints? In the search algorithm, this optimization is called pruning. Next we discuss how to design a pruning function.

Re-examining our definition of editing distances can actually be seen as dividing the string p and T into two segments, and then calculating the sum of the editing distances of the corresponding segments, as shown in.

The string p and T are split into two segments, red and green respectively. The edit distance between the edit distance and the green part of the red section is the editing distance of the string p and T.

For example, a more image:

    • Example 1
ed("explore", "express") = ed("explo", "exp") + ed("re", "ress")
    • Example 2
ed("explore", "express") = ed("exp", "exp") + ed("lore", "ress")
    • Example 3
      However, not every division is correct, as shown in the following illustration:
ed("ex","exp") + ed("plore", "ress") = 1 + 4 = 5

Therefore, the minimum editing distance problem is also equivalent to an optimal split, that is, for the character of position I in the string p, find an optimal position J in T, making

ed(P.prefix(i), T.prefix(j)) + ed(P.suffix(i+1), T.suffix(j+1))

Minimum.

Back to our question, if we limit the minimum editing distance of P and t to equal to X,

We let P[i] match t[i-x],t[i-x+1],......, t[i],t[i+1],...... t[i+x], and find the smallest edit distance ed1=ed (P[1~i],t[1~j]) that matches the first half, if ed1 is greater than X, We can infer that Ed (p,t) will eventually be greater than Q (ED=ED1+ED2>Q>X), which is also greater than X.

Why P[i] does not match t[i-x-1] and the previous position? That's because Ed (P.prefix (i), T.prefix (i-x-1)) > x, because i-x-1 characters must be inserted at least in T.prefix (x+1) to ensure that the string length is equal, and the same p[i] does not match the position of t[i+x+1] and beyond. So, according to the segmentation principle, the optimal match must appear between T[i-x] ~ T[i+x], if the minimum editing distance for this interval is greater than x, then we do not need to match the p[i+1] and its subsequent characters.

For example: When traversing to the Blue node L, the path formed by the string expl and t=exist satisfies the pruning condition, then the post-routing node does not need to traverse, because it is not possible to have any one string to meet the editing distance of T less than 2.

At this point, we get the pruning optimization: The depth traversal reaches a node of the dictionary tree, the characters on its path are composed of a string p, calculated with T.prefix (i-x), T.prefix (i-x+1),...... T.prefix (i+x) The minimum editing distance, if the minimum value is greater than X, stops traversing the subsequent nodes on this subtrees tree.

In fact, this final version of the optimization algorithm is derived from the thesis:"error-tolerant finite-state recognition with applications to morphological analysis and Spelling correction ". K oflazer:1996

Code implementation vs. effect

The code implementation needs to be very technical, because either the pruning function or the final confirmation function can be reused the same editing matrix, paste a very ugly code: github.com/haolujun/algorithm/tree/master/muti-edit-distance

This algorithm is very efficient in the case of very small error tolerance, I randomly generated 100,000 length 5~10 pattern string, and then randomly generated 100 input string T (length 5 ~ 10), the character set size is 10,x minimum editing distance limit, calculate the multi-mode editing distance, the total processing time is as follows, unit MS:

algorithm x = 1 x = 2 x = 3 x = 4 x = 5 x = 6
Brute Force algorithm 21990 21990 21990 21990 21990 21990
Optimization algorithm 97 922 4248 11361 20097 28000

When the tolerance is very small, the optimization algorithm outright wins the brute force algorithm, and the actual application of the X general value is very little, just suitable for the optimization algorithm.

When the value of x increases, the efficiency of the optimization algorithm decreases and finally slows down the brute force algorithm, which is due to the complexity of the optimization algorithm (recursive + more complex judgment).

So, at the end of the application, we select different algorithms based on the X value.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.