21 Line Python code implementation spell checker

Last Update:2016-06-10 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

introduced

When you use Google or Baidu search, enter the search content, Google can always provide a very good spelling check, such as you enter speling, Google will immediately return to spelling.
Here is a simple, but full-featured spell checker implemented with 21 lines of Python code.

Code

Import Re, collectionsdef words (text): Return Re.findall (' [a-z]+ ', Text.lower ()) def Train (features): Model = Collections . Defaultdict (Lambda:1) for F in features:model[f] + = 1 return modelnwords = Train (words (File (' Big.txt '). Read ()) ALP Habet = ' ABCDEFGHIJKLMNOPQRSTUVWXYZ ' def edits1 (word): splits = [(Word[:i], Word[i:]) for I in range (len (word) + 1)] de Letes = [A + b[1:] For a, b in splits if b] transposes = [A + b[1] + b[0] + b[2:] For a, b in splits if Len (b) >1] re Places = [A + C + b[1:] For a, b in splits for C in Alphabet if b] inserts = [A + C + b for a, B ' splits for C in a Lphabet] Return Set (deletes + transposes + replaces + inserts) def known_edits2 (word): Return set (E2 for E1 in Edits1 (WOR d) for E2 in Edits1 (E1) if E2 in Nwords) def known (words): Return set (W for W in words if W in nwords) def correct (word): C Andidates = known ([Word]) or known (EDITS1 (word)) or known_edits2 (word) or [Word] return max (candidates, Key=nwords.get) CO The Rrect function is the entrance to the program, passing in the misspelled word.will return correctly. such as:>>> correct ("Cpoy") ' Copy ' >>> correct ("Engilsh") ' 中文版 ' >>> correct ("sruprise") ' Surprise

In addition to this code, as part of machine learning, there must be plenty of sample data to prepare big.txt as our sample data.

Behind the principle

The code above is based on Bayesian, in fact, Google Baidu's implementation of the spell check is also achieved through Bayesian, but certainly more complicated than this.
Let's start with a brief introduction to the rationale behind it, and if the reader has known it before, skip this paragraph.
To give a word, we try to pick one of the most likely correct spelling suggestions (the suggestion may also be the word entered). Sometimes it is unclear (such as lates should be corrected to late or latest?). , we use probabilities to decide which one to suggest. We find the most probable spelling suggestion C from all possible correct spellings associated with the original word w:

ARGMAXC P (C|W)

By Bayesian theorem , the equation can be transformed into

ARGMAXC p (w|c) p (c)/P (W)

Here's what it means in the above:

P (c|w) represents the probability that you would have entered the word C in case of entering the word W.
P (w|c) represents the probability that the user wants to enter the word C but to enter W, which we can think of as given.
P (c) represents the probability of the occurrence of the word C in the sample data
P (w) represents the probability of the word w appearing in the sample number

You can determine that P (W) is the same for all possible word C probabilities, so the above can be converted to
ARGMAXC p (w|c) p (c)
All of our code is based on this formula, the following analysis of specific code implementation

Code Analysis

Extracting words from big.txt using the words () function

def words (text): Return Re.findall (' [a-z]+ ', Text.lower ())

Re.findall (' [a-z]+ ') uses the Python regular expression module to extract all the words that match the ' [a-z]+ ' condition, that is, the letters. (There is no detail about regular expressions, but interested students can look at the regular expressions.) Text.lower () is the conversion of text into lowercase letters, meaning "the" and "the" are defined as the same word.

Use the train () function to calculate the number of occurrences of each word and then train a suitable model

Def train (features):  model = collections.defaultdict (lambda:1) for  F in Features:    model[f] + = 1  return Modelnwords = Train (words (File (' Big.txt '). Read ()))

This nwords[w] represents the number of times the word W appears in the sample. What if there is a word that doesn't appear in our sample? The process is to set their number by default to 1, which is achieved through the collections module and the lambda expression. Collections.defaultdict () Creates a default dictionary that lambda:1 each value in the dictionary to 1 by default.

Now that we're done with P (c) in formula ARGMAXC P (w|c) p (c), the next process P (w|c) is the probability of entering the word C but mistakenly entering the word w, measured by "edit distance"-the number of edits needed to turn one word into another, An edit may be deleted once, an interchange (two adjacent letters), one insertion, and one modification at a time. The following function returns a collection of all possible word w that will be edited by C once:

def edits1 (word):  splits   = [(Word[:i], Word[i:]) for I in range (len (word) + 1)]  Deletes  = [A + b[1:] for a  , b in splits if b]  transposes = [A + b[1] + b[0] + b[2:] For a, b in splits if Len (b) >1]  Replaces  = [A + C + b[1:] For a, B-splits for C-alphabet if B]  inserts  = [A + C + B   for A, B-in-splits for C in Alphabet]  return Set (deletes + transposes + replaces + inserts)

Related papers show that 80-95% spelling mistakes and want to spell the words are only 1 editing distance, if you think one edit is not enough, then we do it again

def known_edits2 (word):  return Set (E2 for E1 in Edits1 (word) for E2 in Edits1 (E1) if E2 in Nwords)

It is also possible to have an editing distance of 0 times that itself is spelled correctly:

def known (words):  return Set (W for W in words if w in nwords)

We assume that the probability of editing a distance of 1 times is much greater than 2 times, 0 times larger than 1 times. The following is the correct function to select the minimum editing distance of the word, its corresponding p (W|C) will be larger, as a candidate word, and then select the largest word p (c) as a spelling suggestion

def correct (word):  candidates = known ([Word]) or known (EDITS1 (word)) or known_edits2 (word) or [Word]  return max ( Candidates, Key=nwords.get)

The above is the whole content of this article, I hope that you learn Python programming help.



This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

21 Line Python code implementation spell checker

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support