A 21-line Python code method for implementing the spelling checker

Source: Internet
Author: User

Introduced

When you use Google or Baidu search, enter the search content, Google can always provide a very good spelling check, such as you enter speling, Google will immediately return to spelling.
Here is a simple, but full-featured spell checker implemented with 21 lines of Python code.

Code

Import Re, collectionsdef words (text): Return Re.findall (' [a-z]+ ', Text.lower ()) def Train (features): Model = Collectio Ns.defaultdict (lambda:1) for F in features:model[f] + = 1 return modelnwords = Train (words (File (' Big.txt '). R EAD ())) alphabet = ' ABCDEFGHIJKLMNOPQRSTUVWXYZ ' def edits1 (word): splits = [(Word[:i], Word[i:]) for I in range (len (wo  RD) + 1)] Deletes = [A + b[1:] For a, b in splits if b] transposes = [A + b[1] + b[0] + b[2:] For a, b in splits if Len (b) >1] replaces = [A + C + b[1:] For a, b in splits for C in Alphabet if b] inserts = [A + C + b for a , b in splits for C in Alphabet] return set (deletes + transposes + replaces + inserts) def known_edits2 (word): return Set (E2 for E1 in Edits1 (word) for E2 in Edits1 (E1) if E2 in Nwords) def known (words): Return set (W to W in words if w in N WORDS) def correct (word): candidates = known ([Word]) or known (EDITS1 (word)) or known_edits2 (word) or [Word] return MA X (Candidates, Key=nwords. Get) 

The correct function is the entrance to the program, and the misspelled word is returned correctly. Such as:

>>> correct ("Cpoy") ' Copy ' >>> correct ("Engilsh") ' 中文版 ' >>> correct ("sruprise") ' Surprise

In addition to this code, as part of machine learning, there must be plenty of sample data to prepare big.txt as our sample data.

Behind the principle

The code above is based on Bayesian, in fact, Google Baidu's implementation of the spell check is also achieved through Bayesian, but certainly more complicated than this.
Let's start with a brief introduction to the rationale behind it, and if the reader has known it before, skip this paragraph.
To give a word, we try to pick one of the most likely correct spelling suggestions (the suggestion may also be the word entered). Sometimes it is unclear (such as lates should be corrected to late or latest?). , we use probabilities to decide which one to suggest. We find the most probable spelling suggestion C from all possible correct spellings associated with the original word w:

ARGMAXC  P (c|w)

By Bayes theorem, the above formula can be transformed into

ARGMAXC p (w|c) p (c)/P (W)

Here's what it means in the above:

    1. P (c|w) represents the probability that you would have entered the word C in case of entering the word W.

    2. P (w|c) represents the probability that the user wants to enter the word C but to enter W, which we can think of as given.

    3. P (c) represents the probability of the occurrence of the word C in the sample data

    4. P (w) represents the probability of the word w appearing in the sample number
      You can determine that P (W) is the same for all possible word C probabilities, so the above can be converted to

ARGMAXC p (w|c) p (c)

All of our code is based on this formula, the following analysis of specific code implementation

Code Analysis

Extracting words from big.txt using the words () function

def words (text): Return Re.findall (' [a-z]+ ', Text.lower ())

Re.findall (' [a-z]+ ') uses the Python regular expression module to extract all the words that match the ' [a-z]+ ' condition, that is, the letters. (There is no detail about regular expressions, but interested students can look at the regular expressions.) Text.lower () is the conversion of text into lowercase letters, meaning "the" and "the" are defined as the same word.

Use the train () function to calculate the number of occurrences of each word and then train a suitable model

Def train (features):    model = collections.defaultdict (lambda:1) for    F in Features:        model[f] + = 1    return Modelnwords = Train (words (File (' Big.txt '). Read ()))

This nwords[w] represents the number of times the word W appears in the sample. What if there is a word that doesn't appear in our sample? The process is to set their number by default to 1, which is achieved through the collections module and the lambda expression. Collections.defaultdict () Creates a default dictionary that lambda:1 each value in the dictionary to 1 by default. (lambda expressions can look at lambda introduction

Now that we have finished working with argmaxc P(w|c) P(c) P (c) in the formula, and then we are dealing with P (w|c), we want to enter the word C but mistakenly enter the probability of the word w, by "Edit distance"-the number of edits required to change one word to another, one edit may be a deletion, An interchange (two adjacent letters), one insertion, one modification. The following function returns a collection of all possible word w that will be edited by C once:

def edits1 (word):   splits     = [(Word[:i], Word[i:]) for I in range (len (word) + 1)]   Deletes    = [A + b[1:] for a  , b in splits if b]   transposes = [A + b[1] + b[0] + b[2:] For a, b in splits if Len (b) >1]   Replaces   = [A + C + b[1:] For a, B-splits for C-alphabet if B]   inserts    = [A + C + B     for A, B-in-splits for C in Alphabet]   return Set (deletes + transposes + replaces + inserts)

Related papers show that 80-95% spelling mistakes and want to spell the words are only 1 editing distance, if you think one edit is not enough, then we do it again

def known_edits2 (word):    return Set (E2 for E1 in Edits1 (word) for E2 in Edits1 (E1) if E2 in Nwords)

It is also possible to have an editing distance of 0 times that itself is spelled correctly:

def known (words):    return Set (W for W in words if w in nwords)

We assume that the probability of editing a distance of 1 times is much greater than 2 times, 0 times larger than 1 times. The following is the correct function to select the minimum editing distance of the word, its corresponding p (W|C) will be larger, as a candidate word, and then select the largest word p (c) as a spelling suggestion

def correct (word):    candidates = known ([Word]) or known (EDITS1 (word)) or known_edits2 (word) or [Word]    return max ( Candidates, Key=nwords.get
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.