Python implements word spell checking _python

Source: Internet
Author: User
Tags in python

The past few days in the old code found that the previous written annotation part of many words spelling mistakes, these words are not outrageous, should be able to use tools to automatically correct the vast majority. It's easy to write a spelling checker script in Python, and it's easier if you can make good use of aspell/ispell these ready-made gadgets.

Points

1, enter a misspelled word, call aspell-a to get some candidate correct words, and then use distance edit further sunscreen to select more accurate words. For example, run aspell-a, enter ' Hella ' and get the following results:
Hell, Helli, Hello, heal, Heall, he ll, Hells, Heller, Ella, Hall, Hill, Hull, Hall, Heel, Hill, Hula, Hull, Helga, Helsa, Bella, Della, Mella, Sella, Fella, Halli, Hally, Hilly, Holli, Holly, Hallo, hilly

2, what is the distance editor (Edit-distance, also called Levenshtein algorithm)? That is, given a word, through the insertion, deletion, exchange, replace the given of the operation of all possible correct spelling, such as the input ' Hella ', after multiple inserts, deletes, exchanges, replace the operation of the given into:
' Helkla ', ' Hjlla ', ' Hylla ', ' Hellma ', ' Khella ', ' Iella ', ' Helhla ', ' hellag ', ' Hela ', ' Vhella ', ' Hhella ', ' hell ', ' Heglla ', ' Hvlla ', ' Hellaa ', ' Ghella ', ' hellar ', ' Heslla ', ' Lhella ', ' Helpa ', ' hello ', ...

3, synthesis of the above 2 sets of results, and take into account a number of theoretical knowledge can improve the accuracy of the spelling check, such as generally speaking wrong words are unintentional or mistaken, complete wrong words are very unlikely, and the first letter of the word is generally not spelled wrong. So you can remove the first letter from the collection. For example: ' Sella ', ' Mella ', Khella ', ' Iella ', and so on, here Vpsee do not delete words, and put these words from the queue to put the end of the queue (lower priority), so it really does not match with H The first word matches those words that start with other letters.

4, the program used the external tool Aspell, how to capture the input and output of external programs in python to process these inputs and outputs in a python program? Python 2.4 has introduced the Subprocess module, which can be used with subprocess. Popen to deal with.

5, Google Daniel Peter Norvig wrote a How to write a spelling corrector is worth a look, Daniel is Daniel, 21 lines of Python to solve the spelling problem, but also without external tools, only need to read in advance a dictionary file. The EDITS1 function of this program is copy from the cattle.

Code


#!/usr/bin/python # A Simple spell checker import OS, sys, subprocess, signal alphabet = ' ABCDEFGHIJKLMNOPQRSTUVWXYZ ' def found (word, args, cwd = None, Shell = True): Child = subprocess. Popen (args, Shell = shell, stdin = subprocess. PIPE, stdout = subprocess. PIPE, CWD = cwd, Universal_newlines = True) child.stdout.readline () (stdout, stderr) = Child.communicate (wo RD) If ":" In stdout: # remove \ n stdout = Stdout.rstrip ("\ n") # Remove left part Until:left, Candi  Dates = Stdout.split (":", 1) candidates = Candidates.split (",") # making an error on the ' a ' a word are less # probable, so we remove those candidates and append them of queue, make tail them Ty for item in Candidates:if item[0]!= word[0]: Candidates.remove (item) Candidates.append (ITE m) return candidates Else:return None # Copy from Http://norvig.com/spell-correct.html def edits1 (wORD): n = len (word) return set ([word[0:i]+word[i+1:] for I in range (n)] + [word[0:i]+word[i+1]+word[i]+ Word[i+2:] for I in range (n-1)] + [word[0:i]+c+word[i+1:] for I in range (n) for C in Alphabet] + [Word[0:i]+c+word [I:] for-I in range (n+1) for C-Alphabet]) def correct (word): Candidates1 = Found (Word, ' aspell-a ') if not Candida
    Tes1:print "no suggestion" return candidates2 = edits1 (word) candidates = [] for word in candidates1:
  If Word in candidates2:candidates.append (word) If not Candidates:print "suggestion:%s"% candidates1[0] Else:print "Suggestion:%s"% max (candidates) def signal_handler (signal, frame): Sys.exit (0) If __name__ = ' __ma In__ ': signal.signal (signal.

 SIGINT, signal_handler) while true:input = Raw_input () correct (input)

A simpler way.

Of course, directly in the program to call the relevant module is the simplest, there is a library called Pyenchant support spell check, install Pyenchant and enchant can be directly in the Python program import:

>>> import enchant
>>> d = enchant. Dict ("en_US")
>>> D.check ("Hello")
True
>>> d.check ("Helo")
False
>> > d.suggest ("Helo")
[' He lo ', ' He-lo ', ' Hello ', ' helot ', ' help ', ' Halo ', ' Hell ', ' Held ', ' Helm ', ' Hero ', ' and ' ll ']
   >>>

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.