21 lines of Python code to implement the spelling checker

Source: Internet
Author: User
The 21-line python code is a simple but complete spelling checker. If you are interested, refer to it. Introduction

When you use Google or Baidu to search for content, Google always provides excellent spelling checks. For example, if you enter speling, Google will return spelling immediately.
The following is a simple but complete spelling checker implemented using 21 lines of python code.

Code

Import re, collectionsdef words (text): return re. findall ('[a-z] +', text. lower () def train (features): model = collections. defaultdict (lambda: 1) for f in features: model [f] + = 1 return modelNWORDS = train(words(file('big.txt '). read () alphabet = 'abcdefghijklmnopqrstuvwxyz 'def edits1 (word): splits = [(word [: I], word [I:]) for I in range (len (word) + 1)] deletes = [a + B [1:] for a, B in splits if B] t Ransposes = [a + B [1] + B [0] + B [2:] for a, B in splits if len (B)> 1] replaces = [a + c + B [1:] for a, B in splits for c in alphabet if B] inserts = [a + c + B for, B in splits for c in alphabet] return set (deletes + transposes + replaces + inserts) def known_edits2 (word): return set (e2 for e1 in edits1 (word) for e2 in edits1 (e1) if e2 in NWORDS) def known (words): return set (w for w in words if w I N NWORDS) def correct (word): candidates = known ([word]) or known (edits1 (word) or known_edits2 (word) or [word] return max (candidates, key = NWORDS. get) the correct function is the entry of the program. If a misspelled word is input, the system returns the correct result. For example: >>> correct ("cpoy") 'copy' >>> correct ("engilsh") 'inc' >>> correct ("sruprise") 'surprise'

In addition to this Code, for the sake of machine learning, big.txt should be used as our sample data.

Principles

The above code is implemented based on Bayes. In fact, Google Baidu implements spelling checks through Bayesian, but it must be much more complicated than this one.
First, let's briefly introduce the principles behind it. If you have learned about it before, you can skip this section.
To give a word, we try to select the most likely correct spelling suggestion (the suggestion may be the words entered ). Sometimes it is unclear (for example, lates should be corrected to late or latest ?), We use probability to determine which one is recommended. We can find the most likely spelling from all possible correct spelling related to the original word w. c:

argmaxc P(c|w)

PassBayes Theorem, The above formula can be converted

argmaxc P(w|c) P(c) / P(w)

The following describes the meaning of the above formula:

  • P (c | w) indicates the probability that you would like to enter the word c when entering the word w.
  • P (w | c) indicates the probability that the user wants to input the word c but w, which can be considered as given.
  • P (c) indicates the probability that word c appears in the sample data.
  • P (w) indicates the probability that w appears in the sample number.

It can be determined that P (w) has the same probability for all possible words c, so the above formula can be converted
Argmaxc P (w | c) P (c)
All our code is based on this formula. The specific code implementation is analyzed below.

Code Analysis

Extract words from big.txt by using words(extract comma

def words(text): return re.findall('[a-z]+', text.lower())

Re. findall ('[a-z] +' uses the python Regular Expression module to extract all words that meet the '[a-z] +' condition, that is, words consisting of letters. (The regular expression is not detailed here. If you are interested, you can refer to the regular expression introduction. Text. lower () is to convert text into lowercase letters, that is, "the" and "The" are defined as the same word.

Use the train () function to calculate the number of occurrences of each word and then train a suitable model.

def train(features):  model = collections.defaultdict(lambda: 1)  for f in features:    model[f] += 1  return modelNWORDS = train(words(file('big.txt').read()))

In this way, NWORDS [w] indicates the number of times the word w appears in the sample. What if a word does not appear in our sample? The solution is to set the number of times to 1 by default, which is implemented through the collections module and lambda expression. Collections. defaultdict () creates a default dictionary. lambda: 1 sets each value in this dictionary to 1 by default.

Now we have processed P (c) in the formula argmaxc P (w | c) P (c), and then processed P (w | c) that is, the probability that you want to enter the word c but mistakenly enter the word w is measured by the number of edits required to change a word to another word by "edit distance, one edit operation may be one deletion, one exchange (two adjacent letters), one insertion, and one modification. The following function returns a set of all possible words w that can be edited once by c:

def edits1(word):  splits   = [(word[:i], word[i:]) for i in range(len(word) + 1)]  deletes  = [a + b[1:] for a, b in splits if b]  transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]  replaces  = [a + c + b[1:] for a, b in splits for c in alphabet if b]  inserts  = [a + c + b   for a, b in splits for c in alphabet]  return set(deletes + transposes + replaces + inserts)

According to the related papers, the spelling mistake between 80 and 95% is only one edit distance from the word to be spelled. If one edit is not enough, let's try again.

def known_edits2(word):  return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

At the same time, if the distance between the edits is 0, the spelling is correct:

def known(words):  return set(w for w in words if w in NWORDS)

Let's assume that the probability of editing distance is greater than twice, and that of editing distance is greater than once. The following uses the correct function to select the word with the smallest distance, and the corresponding P (w | c) will be larger. As a candidate word, P (c) will be selected) the largest word as a spelling suggestion

def correct(word):  candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]  return max(candidates, key=NWORDS.get)

The above is all the content of this article. I hope it will help you learn python programming.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.