Python implements word spelling check

Source: Internet
Author: User
This article mainly introduces how to implement the word spelling check in Python. This article describes some knowledge about the word spelling check and provides two implementation methods, for more information, see the previous days. when I flipped through the old code, I found many spelling mistakes in the comments I wrote earlier. these mistakes are not outrageous. I should be able to use tools to automatically correct most of them. It is easy to write a spelling check script in Python. it is easier to use the ready-made tools such as aspell and ispell.

Key points

1. enter a misspelled word, call aspell-a to obtain some correct candidate words, and then use distance editing to further extract more accurate words. For example, run aspell-a and enter 'hella' to get the following results:
Hell, Helli, hello, heal, Heall, he'll, hells, Heller, Ella, Hall, Hill, Hull, hall, heel, hill, hula, hull, Helga, Helsa, bella, Della, Mella, Sella, fella, Halli, Hally, Hilly, Holli, Holly, hallo, hilly, holly, hullo, Hell's, hell's

2. what is Edit-Distance, also called Levenshtein algorithm? That is to say, given a word, after multiple insert, delete, exchange, or replace a single character operations, all possible spelling is given, such as inputting 'hella ', after multiple insert, delete, swap, and replace single-character operations, the result is:
'Helkla', 'hjla', 'hyler', 'hellma', 'khella', 'iella ', 'helhla', 'hellag', 'ha', 'vhella ', 'hhella', 'hell ', 'hegler', 'hvlla', 'hellaa', 'ghella', 'hellar ', 'hesler', 'lhela', 'helpa ', 'Hello ',...

3. combining the results of the above two sets, and considering some theoretical knowledge, the accuracy of spelling check can be improved. for example, if a wrong word is written unintentionally or by mistake, the possibility of a completely wrong word is very small, in addition, the first letter of a word is generally not misspelled. Therefore, you can remove the words that do not match the first letter in the above set, such as 'sella', 'mella ', khella', and 'iella. VPSee does not delete words here, these words are extracted from the queue and placed at the end of the queue (with lower priority). Therefore, words starting with h cannot match those starting with other letters.

4. The program uses the external tool aspell. how can I capture the input and output of the external program in Python to process the input and output in the Python program? The subprocess module is introduced in Python 2.4 and can be processed using subprocess. Popen.

5. Google Daniel Peter Norvig wrote an article on How to Write a Spelling Corrector which is worth seeing. Daniel is Daniel, and 21 lines of Python solves Spelling problems without external tools, you only need to read a dictionary file in advance. The edits1 function of this program is copied from Niujia.

Code


#!/usr/bin/python# A simple spell checkerimport os, sys, subprocess, signalalphabet = 'abcdefghijklmnopqrstuvwxyz'def found(word, args, cwd = None, shell = True):  child = subprocess.Popen(args,     shell = shell,     stdin = subprocess.PIPE,     stdout = subprocess.PIPE,     cwd = cwd,     universal_newlines = True)   child.stdout.readline()  (stdout, stderr) = child.communicate(word)  if ": " in stdout:    # remove \n\n    stdout = stdout.rstrip("\n")    # remove left part until :    left, candidates = stdout.split(": ", 1)     candidates = candidates.split(", ")    # making an error on the first letter of a word is less     # probable, so we remove those candidates and append them     # to the tail of queue, make them less priority    for item in candidates:      if item[0] != word[0]:         candidates.remove(item)        candidates.append(item)    return candidates  else:    return None# copy from http://norvig.com/spell-correct.htmldef edits1(word):  n = len(word)  return set([word[0:i]+word[i+1:] for i in range(n)] +               [word[0:i]+word[i+1]+word[i]+word[i+2:] for i in range(n-1)] +    [word[0:i]+c+word[i+1:] for i in range(n) for c in alphabet] +    [word[0:i]+c+word[i:] for i in range(n+1) for c in alphabet])def correct(word):  candidates1 = found(word, 'aspell -a')  if not candidates1:    print "no suggestion"    return   candidates2 = edits1(word)  candidates = []  for word in candidates1:    if word in candidates2:      candidates.append(word)  if not candidates:    print "suggestion: %s" % candidates1[0]  else:    print "suggestion: %s" % max(candidates)def signal_handler(signal, frame):  sys.exit(0)if __name__ == '__main__':  signal.signal(signal.SIGINT, signal_handler)  while True:    input = raw_input()    correct(input)

Simpler method

Of course, it is easiest to directly call the relevant modules in the program. a library called PyEnchant supports spelling check. after installing PyEnchant and Enchant, you can directly import them in the Python program:

>>> import enchant>>> d = enchant.Dict("en_US")>>> d.check("Hello")True>>> d.check("Helo")False>>> d.suggest("Helo")['He lo', 'He-lo', 'Hello', 'Helot', 'Help', 'Halo', 'Hell', 'Held', 'Helm', 'Hero', "He'll"]>>>

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.