Bayesian inference and Internet applications (III): spelling check

Last Update:2014-09-25 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

(The first part of this series introduces Bayesian theorem. The second part describes how to filter spam. Today is the third part .)

When Google is used, if you misspell a word, it will remind you of the correct spelling.

For example, you accidentally input seperate.

Google tells you that this word does not exist. The correct spelling is separate.

This is called spelling corrector ). There are several ways to implement this function. Google uses Bayesian-based statistical methods. This method is characterized by fast processing of a large number of texts in a short period of time and high accuracy (more than 90% ). Peter norvig, Google's R & D Director, wrote a famous article explaining the principles of this method.

Next let's take a look at how to use Bayesian inference to achieve "spelling check ". In fact, it is very simple. A short piece of code is enough.

I. Principles

The user entered a word. The spelling is correct or incorrect. Make C (for example, correct) for the correct spelling and w (for example, wrong) for the wrong spelling ).

The so-called "spelling check" is to try to deduce C in the case of W. From the perspective of probability theory, W is known, and then find the C with the highest possibility among several alternatives, that is, find the maximum value of the following formula.

P (c | W)

According to Bayesian theorem:

P (c | W) = P (w | C) * P (C)/P (W)

For all alternative C, the corresponding W is the same, so their P (W) is the same, so what we want is

P (w | C) * P (c)

P (c) indicates the occurrence of a correct word "probability", which can be replaced by "Frequency. If we have a large enough text library, the frequency of occurrence of each word in this text library is equivalent to the probability of occurrence. The more frequently a word appears, the greater P (c.

P (w | C) indicates the probability of a spelling error w when trying to spell C. This requires the support of statistical data. However, to simplify the problem, we assume that the closer the two words are to the form, the more likely they are to be spelled incorrectly. The larger the P (w | C) is. For example, the spelling of different letters is more likely to happen than the spelling of different two letters. If you want to spell the word "hello", the possibility of misspelling it into Hallo (a letter different from each other) is higher than that of haallo (two letters different from each other ).

Therefore, we only need to find the words that are most similar to the input words in the form, and then pick out the one with the highest occurrence frequency, P (w | C) can be realized) * maximum value of P (c.

Ii. Algorithms

The simplest algorithm requires only four steps.

Step 1: create a large enough text library.

There are some free sources on the Internet, such as the gutenburg plan, wiktionary, and the UK National Corpus.

Step 2: take out every word in the text library and calculate their occurrence frequency.

Step 3: obtain all possible spelling forms based on the words entered by the user.

The so-called "similar spelling" refers to the "editing distance" (edit distance) between two words cannot exceed 2. That is to say, the two words only differ by 1 to 2 letters. One word can be changed to another by means of ---- Delete, exchange, change, and insert.

Step 4: Compare the frequency of occurrence of all words with similar spelling in the text library. The word with the highest frequency is the correct spelling.

According to Peter norvig's validation, the accuracy of this algorithm is about 60%-70% (6 can be checked for 10 spelling errors .) Although not satisfactory, it is acceptable. After all, it is simple enough and extremely fast. (The last part of this article will detail the defects of this algorithm .)

Iii. Code

We use the Python language to implement the algorithm in the previous section.

Step 1: Save the local files downloaded on the Internet as big.txt files.This step does not require programming.

Step 2: load the python regular language module (re) and collections module, which will be used later.

Import re, Collections

Step 3: Define the words () function to retrieve every word in the text library.

Def words (text): Return re. findall ('[A-Z] +', text. Lower ())

Lower () converts all words to lowercase to avoid being counted as two words because the case is different.

Step 4: Define a train () function to create a "Dictionary" structure.Each word in the text library is the key of this "Dictionary". The value corresponding to these words is the frequency at which the word appears in the text library.

Def train (features ):

Model = collections. defaultdict (lambda: 1)

For f in features:

Model [f] + = 1

Return Model

Collections. defaultdict (lambda: 1) means that the default frequency of each word is 1. This is for words that do not appear in the text library. If a word does not appear in the text library, we cannot determine that it is a nonexistent word. Therefore, we set the default frequency of each word to 1. The frequency increases by 1 every time it appears.

Step 5: Use the words () and train () functions to generate the "Word Frequency Dictionary" in the previous step and put the variable nwords.

Nwords = train(words(file('big.txt '). Read ()))

Step 6: Define the edits1 () function to generate all words whose "editing distance" is 1 from the input parameter word.

Alphabet = 'abcdefghijklmnopqrstuvwxy'

Def edits1 (Word ):

Splits = [(word [: I], word [I:]) for I in range (LEN (Word) + 1)]

Deletes = [A + B [1:] For a, B in splits if B]

Transposes = [A + B [1] + B [0] + B [2:] For a, B in splits if Len (B)> 1]

Replaces = [A + C + B [1:] For a, B in splits for C in Alphabet if B]

Inserts = [A + C + B for a, B in splits for C in Alphabet]

Return set (deletes + transposes + replaces + inserts)

The meanings of several variables in the edit1 () function are as follows:

(1)Splits: Divide word into two halves according to each digit in sequence. For example, 'abc' is divided into [('', 'abc'), ('A', 'bc'), ('AB', 'C '), ('abc', '')].

(2)Beletes: Delete all new words after each digit of a word in sequence. For example, the deletes corresponding to 'abc' is ['bc', 'ac', 'AB'].

(3)Transposes: Exchange the neighborhood of word in sequence. For example, the transposes corresponding to 'abc' is ['bac ', 'acb'].

(4)Replaces: Replace each digit of a word with 25 other letters in sequence to form all new words. For example, replaces corresponding to 'abc' are ['abc', 'BBC ', 'cbc ',..., 'abx', 'aby', 'abz'], which contains 78 words (26 × 3 ).

(5)Inserts: Insert a letter between two adjacent words to form all new words. For example, the inserts corresponding to 'abc' are ['abc ',..., 'abcx ', 'abcy', 'abcz '], which contains 104 words (26 × 4 ).

Finally, edit1 () returns a collection of deletes, transposes, replaces, and inserts, which is all words that are equal to 1 from word "editing distance. For a n-bit word, 54n + 25 words are returned.

Step 7 defines the edit2 () function, which is used to generate all words that are 2 away from word editing.

Def edits2 (Word ):

Return set (E2 For E1 in edits1 (Word) for E2 in edits1 (E1 ))

However, an array (54n + 25) * (54n + 25) is returned, which is too large. Therefore, we change edit2 () to the known_edits2 () function to limit the returned words to the words that have appeared in the text library.

Def known_edits2 (Word ):

Return set (E2 For E1 in edits1 (Word) for E2 in edits1 (E1) If E2 in nwords)

Step 8 defines the correct () function to select the words most likely to be spelled by the user from all the alternative words.

Def known (words): return set (W for W in words if W in nwords)

Def correct (Word ):

Candidates = known ([word]) or known (edits1 (Word) or known_edits2 (Word) or [word]

Return max (candidates, key = nwords. Get)

We adopt the following rules:

(1) If the word is an existing word in the text library, it indicates that the word is correctly spelled and the word is directly returned;

(2) If the word is not an existing word, the most frequently used word appears in the text library among the words with "edit distance" 1;

(3) If the word "editing distance" is 1 and is not an existing word in the text library, the most frequently occurring word in the word "editing distance" is returned;

(4) If none of the above three rules can obtain results, the word is directly returned.

At this point, all the code is complete, and a total of 21 lines are combined.

Import re, Collections

Def words (text): Return re. findall ('[A-Z] +', text. Lower ())

Def train (features ):

Model = collections. defaultdict (lambda: 1)

For f in features:

Model [f] + = 1

Return Model

Nwords = train(words(file('big.txt '). Read ()))

Alphabet = 'abcdefghijklmnopqrstuvwxy'

Def edits1 (Word ):

Splits = [(word [: I], word [I:]) for I in range (LEN (Word) + 1)]

Deletes = [A + B [1:] For a, B in splits if B]

Transposes = [A + B [1] + B [0] + B [2:] For a, B in splits if Len (B)> 1]

Replaces = [A + C + B [1:] For a, B in splits for C in Alphabet if B]

Inserts = [A + C + B for a, B in splits for C in Alphabet]

Return set (deletes + transposes + replaces + inserts)

Def known_edits2 (Word ):

Return set (E2 For E1 in edits1 (Word) for E2 in edits1 (E1) If E2 in nwords)

Def known (words): return set (W for W in words if W in nwords)

Def correct (Word ):

Candidates = known ([word]) or known (edits1 (Word) or known_edits2 (Word) or [word]

Return max (candidates, key = nwords. Get)

The usage is as follows:

>>> Correct ('speling ')

'Shelling'

>>> Correct ('korrecter ')

'Correcer'

Iv. Defects

The algorithm we use has some defects. If we put it into the production environment, we must add improvements in these aspects.

(1) The text library must be accurate and cannot contain misspelled words.

If you enter an incorrect spelling method, the text library exactly contains this spelling method, and it will be regarded as the correct spelling.

(2) No solution is proposed for new words not included in the text library.

If you enter a new word that is not in the text library, it is corrected as a wrong spelling.

(3) The program returns the word "edit distance" 1, but in some cases, the "edit distance" of the correct word is 2.

For example, if you enter reciet, it will be corrected as Recite (the editing distance is 1), but the word you really want to enter is receipat (the editing distance is 2 ). That is to say, the shorter the "edit distance", the more correct the rule, not all cases are true.

(4) The "editing distance" of some common spelling errors is greater than 2.

This error cannot be found by the program. The following are some examples. The word in front of each line is a correct spelling, and the word in the back is a common misspelling.

Purple perpul
Curtains courtens
Minutes muinets
Successful sucssuful
Inefficient ineffiect
Availability avaiblity
Dissension desention
Unnecessarily unessasarily
Necessary nessasary
Unnecessary unessessay
Night nite
Assessing accesing
Necessitates nessisitates

(5) The words entered by the user are correctly spelled, but the words to be entered are actually another word.

For example, if the user input is where, the word is correctly spelled and the program will not correct it. However, what the user really wants to input is were, but he accidentally typed an H.

(6) The program returns the most frequently-occurring word, but what the user really wants to input is another word.

For example, if you enter ther, the program will return the result because it appears frequently. However, what users really want to input is their, with less I. That is to say, words that appear frequently are not necessarily words that the user wants to input.

(7) Some Words have different spelling and cannot be identified by the program.

For example, the spelling of English is different from that of American English. UK user input 'humur 'should be corrected to 'humor'; U.S. user input 'humur' should be corrected to 'humor '. However, our program will be corrected as 'humor '.

(End)

Bayesian inference and Internet applications (III): spelling check

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Bayesian inference and Internet applications (III): spelling check

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Bayesian inference and Internet applications (III): spelling check

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support