Bayesian algorithm C # implements the word Spelling Checker

Source: Internet
Author: User

Recently, we have been looking at Bayesian algorithms. This algorithm has been applied in many aspects, including spelling check, text classification, spam filtering, and Chinese word segmentation. Based on your needs, you can decide to implement the first two types of spelling checks.

Program:

 

For more information about Bayesian algorithm learning and spelling correction, see the original article here, translated by Xu Yi.

 

Procedure:

1. calculate the p (h) anterior probability based on the number and frequency of each word in the training corpus. The training corpus is downloaded here, containing millions of words and can be used as the corpus.

2. calculate the conditional probability p (D | h), that is, assume that the word (conjecture) is the probability size of the word we enter. The concept of editing distance is used here to simplify the process, calculated all possible edits whose editing distance is 1. For more information, see Google or here.

3. according to the Bayes principle, the posterior probability is irrelevant to the generating probability p (D) of each input. Therefore, p (h | D) ∝ P (h) * P (D | h ), calculate the most likely spelling.

 

Note:

1. for words that do not appear in the corpus, smooth processing of 1/N and N is the sum of the occurrences of all words in the training sample.

2. the conditional probability is 1/M, and M is the sum of all possible words. For example, the conditional probability of every speculative word in speling is 1/290, and 290 is all possible guesses with the distance of 1. (26 letters can also be expressed as a matrix to find the distance between each letter on the keyboard. I believe it will be more convincing. Some people in Russia did this research in 1973 .)

3. Simple preprocessing of the training corpus, which is converted to lowercase letters in a unified manner.

4. Enter exit to exit the program.

 

Key code:

Entry:

Code Static void Main (string [] args)
{
// Bool trainFlag = false; // indicates whether the corpus has been trained.
String currentPath = Environment. CurrentDirectory. ToString () + "/big.txt ";

Double sumWordNum = 0.0; // The total number of words.
Hashtable htProbability = new Hashtable (); // The probability of storing each word in all training corpus
Hashtable htTmp = new Hashtable (); // the number of times each word appears in the training corpus stored in the temporary ht.

Hashtable recommandWordHt = new Hashtable (); // the correct word recommended after the spelling check

Train train = new Train ();//
Task task = new Task ();

Console. WriteLine ("System is now training the Corpus ....");

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.