21 lines of Python code to implement the spelling checker

Last Update:2018-07-18 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The 21-line python code is a simple but complete spelling checker. If you are interested, refer to it. Introduction

When you use Google or Baidu to search for content, Google always provides excellent spelling checks. For example, if you enter speling, Google will return spelling immediately.
The following is a simple but complete spelling checker implemented using 21 lines of python code.

Code

Import re, collectionsdef words (text): return re. findall ('[a-z] +', text. lower () def train (features): model = collections. defaultdict (lambda: 1) for f in features: model [f] + = 1 return modelNWORDS = train(words(file('big.txt '). read () alphabet = 'abcdefghijklmnopqrstuvwxyz 'def edits1 (word): splits = [(word [: I], word [I:]) for I in range (len (word) + 1)] deletes = [a + B [1:] for a, B in splits if B] t Ransposes = [a + B [1] + B [0] + B [2:] for a, B in splits if len (B)> 1] replaces = [a + c + B [1:] for a, B in splits for c in alphabet if B] inserts = [a + c + B for, B in splits for c in alphabet] return set (deletes + transposes + replaces + inserts) def known_edits2 (word): return set (e2 for e1 in edits1 (word) for e2 in edits1 (e1) if e2 in NWORDS) def known (words): return set (w for w in words if w I N NWORDS) def correct (word): candidates = known ([word]) or known (edits1 (word) or known_edits2 (word) or [word] return max (candidates, key = NWORDS. get) the correct function is the entry of the program. If a misspelled word is input, the system returns the correct result. For example: >>> correct ("cpoy") 'copy' >>> correct ("engilsh") 'inc' >>> correct ("sruprise") 'surprise'

In addition to this Code, for the sake of machine learning, big.txt should be used as our sample data.

Principles

The above code is implemented based on Bayes. In fact, Google Baidu implements spelling checks through Bayesian, but it must be much more complicated than this one.
First, let's briefly introduce the principles behind it. If you have learned about it before, you can skip this section.
To give a word, we try to select the most likely correct spelling suggestion (the suggestion may be the words entered ). Sometimes it is unclear (for example, lates should be corrected to late or latest ?), We use probability to determine which one is recommended. We can find the most likely spelling from all possible correct spelling related to the original word w. c:

argmaxc P(c|w)

PassBayes Theorem, The above formula can be converted

argmaxc P(w|c) P(c) / P(w)

The following describes the meaning of the above formula:

P (c | w) indicates the probability that you would like to enter the word c when entering the word w.
P (w | c) indicates the probability that the user wants to input the word c but w, which can be considered as given.
P (c) indicates the probability that word c appears in the sample data.
P (w) indicates the probability that w appears in the sample number.

It can be determined that P (w) has the same probability for all possible words c, so the above formula can be converted
Argmaxc P (w | c) P (c)
All our code is based on this formula. The specific code implementation is analyzed below.

Code Analysis

Extract words from big.txt by using words(extract comma

def words(text): return re.findall('[a-z]+', text.lower())

Re. findall ('[a-z] +' uses the python Regular Expression module to extract all words that meet the '[a-z] +' condition, that is, words consisting of letters. (The regular expression is not detailed here. If you are interested, you can refer to the regular expression introduction. Text. lower () is to convert text into lowercase letters, that is, "the" and "The" are defined as the same word.

Use the train () function to calculate the number of occurrences of each word and then train a suitable model.

def train(features):  model = collections.defaultdict(lambda: 1)  for f in features:    model[f] += 1  return modelNWORDS = train(words(file('big.txt').read()))

In this way, NWORDS [w] indicates the number of times the word w appears in the sample. What if a word does not appear in our sample? The solution is to set the number of times to 1 by default, which is implemented through the collections module and lambda expression. Collections. defaultdict () creates a default dictionary. lambda: 1 sets each value in this dictionary to 1 by default.

Now we have processed P (c) in the formula argmaxc P (w | c) P (c), and then processed P (w | c) that is, the probability that you want to enter the word c but mistakenly enter the word w is measured by the number of edits required to change a word to another word by "edit distance, one edit operation may be one deletion, one exchange (two adjacent letters), one insertion, and one modification. The following function returns a set of all possible words w that can be edited once by c:

def edits1(word):  splits   = [(word[:i], word[i:]) for i in range(len(word) + 1)]  deletes  = [a + b[1:] for a, b in splits if b]  transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]  replaces  = [a + c + b[1:] for a, b in splits for c in alphabet if b]  inserts  = [a + c + b   for a, b in splits for c in alphabet]  return set(deletes + transposes + replaces + inserts)

According to the related papers, the spelling mistake between 80 and 95% is only one edit distance from the word to be spelled. If one edit is not enough, let's try again.

def known_edits2(word):  return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

At the same time, if the distance between the edits is 0, the spelling is correct:

def known(words):  return set(w for w in words if w in NWORDS)

Let's assume that the probability of editing distance is greater than twice, and that of editing distance is greater than once. The following uses the correct function to select the word with the smallest distance, and the corresponding P (w | c) will be larger. As a candidate word, P (c) will be selected) the largest word as a spelling suggestion

def correct(word):  candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]  return max(candidates, key=NWORDS.get)

The above is all the content of this article. I hope it will help you learn python programming.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

21 lines of Python code to implement the spelling checker

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

21 lines of Python code to implement the spelling checker

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support