Python for word spell checking

Source: Internet
Author: User
The past few days in the old code to find that the previous written comment section has a lot of misspelled words, these words are not ridiculous, you should be able to use tools to automatically correct most of the error. It's easy to write a spelling checker script in Python, which is even easier if you can take advantage of Aspell/ispell's ready-made gadgets.

Points

1, enter a misspelled word, call aspell-a to get some of the correct words, and then use distance editing further sunscreens to select more accurate words. For example, run aspell-a and enter ' Hella ' to get the following results:
Hell, Helli, Hello, heal, Heall, he ' ll, Hells, Heller, Ella, Hall, Hill, Hull, Hall, Heel, Hill, Hula, Hull, Helga, Helsa, Bella, Della, Mella, Sella, Fella, Halli, Hally, Hilly, Holli, Holly, Hallo, Hilly, Holly, Hullo, Hell ' s, Hell ' s

2. What is distance editing (edit-distance, also called Levenshtein algorithm)? That is, given a word, by inserting, deleting, swapping, and replacing single-character all possible correct spellings, such as input ' Hella ', after multiple insertions, deletions, swaps, and replacements single-character become:
' Helkla ', ' Hjlla ', ' Hylla ', ' Hellma ', ' Khella ', ' Iella ', ' Helhla ', ' hellag ', ' Hela ', ' Vhella ', ' Hhella ', ' hell ', ' Heglla ', ' Hvlla ', ' Hellaa ', ' Ghella ', ' hellar ', ' Heslla ', ' Lhella ', ' Helpa ', ' hello ', ...

3, combined with the results of the above 2 sets, and take into account some theoretical knowledge can improve the accuracy of the spelling, such as generally speaking the wrong words are unintentional or wrong, the probability of a complete error is very small, and the first letter of the word is generally not misspelled. So can be in the above set to remove the first letter of the non-conforming words, such as: ' Sella ', ' Mella ', Khella ', ' Iella ' and so on, here Vpsee do not delete the word, and put these words from the queue to put the queue at the end (priority reduction), so it really does not match the H The beginning of the word to match those words that begin with other letters.

4, the program uses the external tool Aspell, how to capture the input and output of external programs in python to process these inputs and outputs in a python program? Python 2.4 Introduces the subprocess module, which can be used with subprocess. Popen to deal with.

5, Google Daniel Peter Norvig wrote a How to write a spelling corrector is worth a look, Daniel is Daniel, 21 lines of Python to solve the spelling problem, but also without external tools, only need to read in advance a dictionary file. The EDITS1 function of this program is copy from the cattle family.

Code


#!/usr/bin/python# A simple spell checkerimport OS, sys, subprocess, Signalalphabet = ' ABCDEFGHIJKLMNOPQRSTUVWXYZ ' def fou nd (Word, args, cwd = None, Shell = True): Child = subprocess. Popen (args, Shell = shell, stdin = subprocess. PIPE, stdout = subprocess.  PIPE, CWD = cwd, Universal_newlines = True) child.stdout.readline () (stdout, stderr) = Child.communicate (Word) If ":" In stdout: # remove \ n \ stdout = Stdout.rstrip ("\ n") # Remove left part until:left, candidates = s     Tdout.split (":", 1) candidates = Candidates.split (",") # making an error in the first letter of a word was less # probable, so we remove those candidates and append them # to the tail of the queue, make them less precedence for ITE M in candidates:if item[0]! = word[0]: Candidates.remove (item) candidates.append (item) return Cand Idates else:return none# Copy from Http://norvig.com/spell-correct.htmldef edits1 (word): n = len (word) return set ([ WOrd[0:i]+word[i+1:] for I in range (n)] + [word[0:i]+word[i+1]+word[i]+word[i+2:] for I in range (n-1)] + [ Word[0:i]+c+word[i+1:] for I in range (n) for C in Alphabet] + [word[0:i]+c+word[i:] for I in range (n+1) for C in Alphab ET]) def correct (word): Candidates1 = Found (Word, ' aspell-a ') if not candidates1:print "no suggestion" return C Andidates2 = edits1 (word) candidates = [] for word in Candidates1:if Word in candidates2:candidates.append (wor D) If not Candidates:print "suggestion:%s"% candidates1[0] else:print "suggestion:%s"% max (candidates) def si Gnal_handler (signal, frame): Sys.exit (0) If __name__ = = ' __main__ ': signal.signal (signal. SIGINT, signal_handler) while true:input = Raw_input () correct (input)

A simpler approach

Of course, directly in the program to call the relevant module is the simplest, there is a library called Pyenchant support spell check, install Pyenchant and enchant can be directly in the Python program import:

>>> import enchant>>> d = enchant. Dict ("en_US") >>> D.check ("Hello") true>>> D.check ("Helo") false>>> d.suggest ("Helo") [' He Lo ', ' He-lo ', ' Hello ', ' helot ', ' help ', ' Halo ', ' Hell ', ' Held ', ' Helm ', ' Hero ', ' He ' ll ']>>>
  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.