The past few days in the old code found that the previous written annotation part of many words spelling mistakes, these words are not outrageous, should be able to use tools to automatically correct the vast majority. It's easy to write a spelling checker script in Python, and it's easier if you can make good use of aspell/ispell these ready-made gadgets.
Points
1, enter a misspelled word, call aspell-a to get some candidate correct words, and then use distance edit further sunscreen to select more accurate words. For example, run aspell-a, enter ' Hella ' and get the following results:
Hell, Helli, Hello, heal, Heall, he ll, Hells, Heller, Ella, Hall, Hill, Hull, Hall, Heel, Hill, Hula, Hull, Helga, Helsa, Bella, Della, Mella, Sella, Fella, Halli, Hally, Hilly, Holli, Holly, Hallo, hilly
2, what is the distance editor (Edit-distance, also called Levenshtein algorithm)? That is, given a word, through the insertion, deletion, exchange, replace the given of the operation of all possible correct spelling, such as the input ' Hella ', after multiple inserts, deletes, exchanges, replace the operation of the given into:
' Helkla ', ' Hjlla ', ' Hylla ', ' Hellma ', ' Khella ', ' Iella ', ' Helhla ', ' hellag ', ' Hela ', ' Vhella ', ' Hhella ', ' hell ', ' Heglla ', ' Hvlla ', ' Hellaa ', ' Ghella ', ' hellar ', ' Heslla ', ' Lhella ', ' Helpa ', ' hello ', ...
3, synthesis of the above 2 sets of results, and take into account a number of theoretical knowledge can improve the accuracy of the spelling check, such as generally speaking wrong words are unintentional or mistaken, complete wrong words are very unlikely, and the first letter of the word is generally not spelled wrong. So you can remove the first letter from the collection. For example: ' Sella ', ' Mella ', Khella ', ' Iella ', and so on, here Vpsee do not delete words, and put these words from the queue to put the end of the queue (lower priority), so it really does not match with H The first word matches those words that start with other letters.
4, the program used the external tool Aspell, how to capture the input and output of external programs in python to process these inputs and outputs in a python program? Python 2.4 has introduced the Subprocess module, which can be used with subprocess. Popen to deal with.
5, Google Daniel Peter Norvig wrote a How to write a spelling corrector is worth a look, Daniel is Daniel, 21 lines of Python to solve the spelling problem, but also without external tools, only need to read in advance a dictionary file. The EDITS1 function of this program is copy from the cattle.
Code
#!/usr/bin/python # A Simple spell checker import OS, sys, subprocess, signal alphabet = ' ABCDEFGHIJKLMNOPQRSTUVWXYZ ' def found (word, args, cwd = None, Shell = True): Child = subprocess. Popen (args, Shell = shell, stdin = subprocess. PIPE, stdout = subprocess. PIPE, CWD = cwd, Universal_newlines = True) child.stdout.readline () (stdout, stderr) = Child.communicate (wo RD) If ":" In stdout: # remove \ n stdout = Stdout.rstrip ("\ n") # Remove left part Until:left, Candi Dates = Stdout.split (":", 1) candidates = Candidates.split (",") # making an error on the ' a ' a word are less # probable, so we remove those candidates and append them of queue, make tail them Ty for item in Candidates:if item[0]!= word[0]: Candidates.remove (item) Candidates.append (ITE m) return candidates Else:return None # Copy from Http://norvig.com/spell-correct.html def edits1 (wORD): n = len (word) return set ([word[0:i]+word[i+1:] for I in range (n)] + [word[0:i]+word[i+1]+word[i]+ Word[i+2:] for I in range (n-1)] + [word[0:i]+c+word[i+1:] for I in range (n) for C in Alphabet] + [Word[0:i]+c+word [I:] for-I in range (n+1) for C-Alphabet]) def correct (word): Candidates1 = Found (Word, ' aspell-a ') if not Candida
Tes1:print "no suggestion" return candidates2 = edits1 (word) candidates = [] for word in candidates1:
If Word in candidates2:candidates.append (word) If not Candidates:print "suggestion:%s"% candidates1[0] Else:print "Suggestion:%s"% max (candidates) def signal_handler (signal, frame): Sys.exit (0) If __name__ = ' __ma In__ ': signal.signal (signal.
SIGINT, signal_handler) while true:input = Raw_input () correct (input)
A simpler way.
Of course, directly in the program to call the relevant module is the simplest, there is a library called Pyenchant support spell check, install Pyenchant and enchant can be directly in the Python program import:
>>> import enchant
>>> d = enchant. Dict ("en_US")
>>> D.check ("Hello")
True
>>> d.check ("Helo")
False
>> > d.suggest ("Helo")
[' He lo ', ' He-lo ', ' Hello ', ' helot ', ' help ', ' Halo ', ' Hell ', ' Held ', ' Helm ', ' Hero ', ' and ' ll ']
>>>