Spelling correction example:
The problem is that the user has entered a word that is not in the dictionary. We need to guess: "What is the word that this guy really wants to input?
P (we guess the word he wants to input | the word he actually entered)
The words actually entered by the user are marked as D (d Indicates data, that is, observation data)
1: P (H1 | D), 2: P (H2 | D), 3: P (H1 | D )...
Unified as: P (H | D)
$ P (H | D) = \ frac {P (h) \ times P (d | h)} {P (d)} $
For example:
P (H1) is the probability (proportion) of the word in the dictionary)
P (H2) is the probability (proportion) of than in the dictionary)
P (H) is also called a prior probability, which is the probability of the word itself.
P (d | h) indicates the probability of an event actually occurring under the condition of speculation.
P (d)-it is the probability of words actually entered by the user in the dictionary. It is useless for this instance and can be divided down later. It is actually a constant.
For different specific guesses H1 H2 H3 .., P (d) is the same, so we can ignore this constant when comparing P (H1 | D) and P (H2 | D ).
P (H | D) ∝ P (h) * P (d | H)
For a given observation data, a guess is good or bad, depending on the possibility of "This guess itself independent (a prior probability, prior) and the possibility of generating the data we have observed.
That is, in the realWhen a word is input, you can guess the probability of a word.And"Guess the probability of a word (proportion of the word to be guessed in the dictionary)AndWhen a word is guessed, the probability of a word actually entered"ChengProportionalLink.
Bayesian calculation: P (h) * P (d | H), P (H) is the prior probability of a specific prediction.
For example, if the user inputs TLP, is it top or tip? At this time, when the maximum likelihood cannot make a decisive decision, the anterior probability can intervene to give instructions-"since you cannot decide, I will tell you, in general, top appears much higher, so it is more likely that he wants to play top"
Model Comparison Theory
Maximum Likelihood: the most consistent with the observed data (that is, the maximum value of P (d | H) has the greatest advantage.
Occam Razor: large P (h) models have great advantages
If you throw a coin (throwing only once) and observe "positive", based on the maximum likelihood estimation, we should guess that the probability of the coin throwing "positive" is 1, this is the guess that maximizes P (d | H ).
Occam Razor:
If there are N points on the plane, a straight line is formed, but it is not precise to a straight line. In this case, we can use a straight line to fit (Model 1), a second-order polynomial (Model 2), or a third-order polynomial (Model 3). In particular, with the N-1 order polynomial can guarantee the perfect through n data points. Which of these possible models is the most reliable?
Occam Razor: the higher the polynomial, the less common
Spam filter instance:
Problem: Give an email and determine whether it is a spam email.
D indicates this email. Note that D is composed of N words. We use H + to indicate spam, and H-to indicate normal emails.
$ P (H + | D) = \ frac {P (H +) * P (d | H +)} {P (d)} $
$ P (H-| D) = \ frac {P (H-) * P (d | H-)} {P (d)} $
P (d) is irrelevant to P (H + | D), so it can be removed.
Prior probability: the two prior probabilities P (H +) and P (H-) are easily obtained, you only need to calculate the ratio of spam to normal emails in the same Email library.
D contains N words D1, D2, D3, P (d | H +) = P (D1, D2,..., DN | H +)
P (D1, D2,..., DN | H +) indicates the probability that an email in the spam is exactly the same as that in the current email! This probability is very small. For example, what is the probability that a spam email is exactly the same as an actual email? Of course it is very small.
P (D1, D2 ,.., DN | H +) Extension: P (d1 | H +) * P (D2 | D1, H +) * P (D3 | D2, D1, H + )*..
Extended explanation: probability of occurrence of word D1 * occurrence of word D1 When judgment is spam * occurrence of word D1 When judgment is spam * occurrence of word D2 when judgment is spam * after D1, the word D2 appears again and the probability of D3 appears again .......
Naive Bayes:
P (d1 | H +) * P (D2 | D1, H +) * P (D3 | D2, D1, H + )*..
Assume that Di and di-1 are completely independent of each other (Naive Bayes assumes that features are independent from each other)
Simplified to P (d1 | H +) * P (D2 | H +) * P (D3 | H + )*..
For P (d1 | H +) * P (D2 | H +) * P (D3 | H +) * .. just count the frequency of the word di appearing in spam.
That is, the probability of D1 appearing when it is determined as spam * the probability of D2 appearing when it is determined as spam * the probability of D3 appearing when it is determined as spam.
002-Bayesian spelling correction example