Analysis of Baidu written examination questions -- spelling correction

Source: Internet
Author: User
Tags sodium

Answers are provided to Baidu's written examination questions circulating on the Internet. However, the words of the family are biased.




Errors often occur when users enter English words. We need to correct them. Suppose there is already a dictionary containing correct English words. Please design a spelling correction program.

(1) Describe your solution to this problem;

(2) provide the main processing procedures, algorithms, and complexity of algorithms;

(3) describe possible improvements (effects, performance, and so on). This is an open question ).


Online FAQ:

(1) ideas:

The dictionary is organized in the hidden key tree and matched at the same time as the user input.


(2) process:

Each input letter:

Layer down the dictionary tree,

A) if the downlink is smooth, continue to the end and give the result;

B) if there is no matching, correct the error, give spelling suggestions, and continue to );



1. Search for words in the dictionary

The dictionary uses a 27-tree structure. Each node corresponds to a letter, and the search is a letter and a letter match. The algorithm time is the length of the word K.


2. Error Correction Algorithm

Situation: when the last letter of the input cannot match, an error is prompted, which simplifies error handling and dynamically prompts possible handling methods:

(A) A letter is missing before the current letter: two layers on the search tree to the current match as a suggestion;

(B) spelling of the current letter is incorrect: the keyboard of the current letter is adjacent as a prompt; (it is just a simple description and can contain more) select (a) and (B) based on the analysis dictionary features and user words that have been entered


Complexity analysis: the efficiency of algorithms is mainly influenced by dictionary implementation and error correction.

(A) The implementation of dictionaries has mature algorithms, which are not greatly improved and will not become a bottleneck;

(B) The error correction policy should be simple and effective, as in the above case, it is linear complexity;


(3) Improvement

Policy Selection is the most important and can be improved using statistical learning.




Let's take a look at the implementation process of Google's Chinese correction function through search examples.





Enter keywords

Error Correction prompt


Chlorine sodium

Sodium Chloride


Chlorine sodium









As shown in the preceding example, Google uses the same phoneme to correct the error. However, different phoneme cannot correct the error. That is to say, only the correct homophone is corrected, and the correct homophone is not prompted.


Considering that Chinese input methods can generally be divided into Pinyin input methods and Font Input Methods, the error correction task is still a long way to go. Even English is not comfortable: considering handwritten input, spaces between English words are not so easy to recognize-maybe it is a good way to use word segmentation.




First, we will discuss how to locate the error and how to correct it. For example, the statistical-based Chinese typo detection method proposed by Dr. Shi in 1992 is mainly a method to automatically detect the location of the typo in the article. This paper is based on the statistical method. After training a large number of sentence libraries, we can obtain the Word Frequency table and the continuous strength table, which are based on these two tables, the score obtained from a suspected single word is calculated using the scoring function. If the score is smaller than the threshold value, it is marked as an incorrect word. The error rate of the experiment data is more than 70%.

Dr. Zhang zhaohuang proposed a method for automatic error detection and correction of typos in Chinese documents in 1994. The method is to replace the original text with the 'Comprehensive near-definition subsets 'formed by the characters containing the glyphs, pronunciation, meaning, or input code, and generate candidate strings. Then, the language model is used for scoring, the candidate string with the highest score can be used to automatically detect typos in the document. However, the prepared 'comprehensive near-yarn subsets 'may cause misjudgment due to insufficient collection or excessive sub-sets, in addition, after the replacement of the approximate subset, the number of candidate strings is large, which will inevitably cause the software burden in the aspect of Chinese errors.





Common English word correction methods include :, it mainly involves dictionary spelling, word-form distance, minimum editing distance, similar key, Skeleton Key, n-gram, rule-based, dictionary, and neural network technologies.

(1) mistaken dictionary spelling. Collect English words with spelling mistakes in Large-Scale Real Text, give correct spelling, and build a dictionary without ambiguity. When you check the spelling of an English word, search for the dictionary by mistake. If hit occurs, the spelling of the word is incorrect. The correct spelling field of the word is recommended for correction. This method is characterized by the integration of error detection and error correction, with high efficiency. However, English spelling mistakes are random and difficult to guarantee the non-ambiguity and comprehensiveness of dictionary spelling mistakes. Therefore, the accuracy is low and the correction effect is poor.

(2) morphological distance method. This is an English calibration method based on the maximum similarity and the minimum distance between strings. The core idea is to construct the likelihood function of a word. If the word is in a dictionary, the word is spelled correctly. Otherwise, according to the likelihood function, in the dictionary, find a word that is most similar to a mistaken spelling word as a candidate word for correction. This method saves storage space and reflects the statistical rules of common spelling errors. It is a fuzzy verification method.

(3) The minimum editing distance method. Calculate the minimum editing distance between a misspelled string and a word in the dictionary to determine the candidate word for error correction. The minimum editing distance refers to the minimum number of edits required to convert a word string to another one (the editing operation refers to insert, delete, transformer, and replace ). Some people have proposed the reverse least-edited distance method. This method first exchanges and arranges each possible single error to generate a candidate set. Then, we can look up the dictionary to see which are

Effective words, and use these valid words as error correction suggestions for misspelling strings.

(4) Similar key method. Similar keys match each string with a key. Make those strings with similar spelling have the same or similar keys. After calculating the key value of a misspelled string, it returns a pointer. Point to all words similar to the spelling string, and use them as suggestions for correcting the spelling string.

(5) skeleton key method. By constructing a skeleton key dictionary, when an error occurs in an English word, extract the Skeleton Key of the wrong word and then query the Skeleton Key dictionary, correct words in the dictionary that have the same skeleton key as the word are recommended for correction.

(6) n-gram method. Based on the N-element syntax, the transfer probability matrix between words and word queries is obtained through statistics on large-scale English texts. When an English word is not in the dictionary. Query the transfer probability matrix. It is recommended that the words with the transfer probability greater than a given threshold be used for error correction.

(7) Rule-based technology. Use Rules to represent common spelling mistakes. These rules can be used to replace spelling mistakes with valid words. For a mistaken string, use all the appropriate rules to find the corresponding words in the dictionary as the result, calculate a value for each result based on the probability estimation of the rule given in advance, and sort all candidate results based on this value.


There are three types of context-based text Error Correction Methods: ① using text features, such as font features, part-of-speech features, or context features; ② using probability statistics features for context connection analysis; ③ use rules or linguistic knowledge, such as grammar rules and word matching rules.

(1) The text verification process can be described as the word segmentation process by using the co-occurrence and matching features of the text context. If the word to be proofread is the target word, the confusion set C = {W1 ,..., Wn}, where each word is prone to confusion or ambiguity with the target word in the text. Suppose c = {from, form}. If from or from appears in the text, it is regarded as an ambiguity between from and from, the task of checking is to determine which word is what we want based on the context. Context-related proofreading problems are composed of words to be corrected in the statements and statements. Both the Bayesian method and the winnow-Based Method represent such problems as valid feature tables, each Valid feature represents a special linguistic pattern in the context of the target word. Currently, two types of features are commonly used: contextual word and word matching. Context word features are used to check whether there are special words in the range of ± K words around the target word; word Collocation is used to detect the state of F adjacent words and/or part-of-speech tagging around the target word. Assume that the target word obfuscation set is {weather, whether },

If K = 10, F = 2, the available features of the target word include:

① Cloudy in the range of 10 words before and after the target word;

② The current word is followed by the "to +" verb.

Feature ① indicates that the current word should be weather; and feature ① is used to check word collocation. It indicates that the current word is followed by a structure of "to + verb, it indicates that the current word should be whether (for example, I don't know whether to laugh or cry ). In this method, the main problems to be solved include obtaining the confusion set; The Feature Representation in the context where the target word is located, that is, how to convert the initial text representation of the statement into valid features.

There are many proofreading methods based on word co-occurrence and matching features, and the better are the Bayesian method and the winnow method. Various n-gram models, such as long distance n-gram and trigger n-gram models, can all use the word co-occurrence feature or feature combination in the context of the target word, the maximum likelihood estimation, mutual information, correlation, and other methods are used to detect errors in the text, and the candidate word for correction is determined by the transfer probability between adjacent words, so as to correct the target word.



Chinese auto-calibration Assist System

Analysis of the network search engine's Chinese correction function example

Overview of Text automatic proofreading Technology

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.