The core algorithm of machine learning

The core algorithm of machine learning - Bayesian method

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Bayes theorem

Bayes theorem has become one of the core algorithms of machine learning, such as spell check, language translation, shipwreck search and rescue, biomedicine, disease diagnosis, mail filtering, text classification, detection cases, industrial production, etc. It is also the basis of many machine learning algorithms. Here, it is necessary to understand the Bayes theorem.

The Bayes' theorem is named after the British scholar Thomas Bayes. In 1763, Richard Price published the results of Bayesian "An Essay towards solving a Problem in the Doctrine of Chances", which made the Bayes' theorem appear in front of the world.

The Bayes' theorem is proposed to solve the "anti-general" problem. Positive problems are common, such as black balls in opaque bags.

M, N white balls, grab a ball and ask for the probability of a black ball. Everyone can know that it is M/M+N.

Of course, there are a large number of such examples in life, such as population flow statistics, financial statistics, etc. The characteristics of these statistics are that we already know the distribution of all samples in advance, and then calculate the probability based on this. "problem. However, if we don't know the information of all the samples (such examples abound, for example, we can't see the running state of all electrons in physics, so we can only observe most situations through experimental simulation to build the most suitable model. To explain), and we want to know the probability of the sample? The role of the Bayes' theorem is reflected.

There is still a small ball in a bag with black and white inside. We randomly take out some small balls and then calculate the actual distribution of the balls in the bag according to the situation of the ball. At this point we may have a variety of models (guessing) to explain, as the number of balls removed increases, our models become more and more accurate, getting closer and closer to the actual situation, and then we find the best stickers from these models. Practical. To sum up: the solution of different models is to calculate different posterior probabilities (the event has occurred, the probability of causing the event to occur), and the probability density function is calculated for the continuous guess space; if the model is not considered a priori The probability (based on past experience and the probability of analysis) uses the maximum likelihood estimate. This is the core of Bayesian thinking.

Let me give you an example: 60% of boys and 40% of girls in a school. Boys always wear long trousers, while girls wear half trousers and half skirts. With this information, we can easily calculate the probability of “randomly picking a student, the probability that he or she wears trousers and the probability of wearing a skirt”, which is the calculation of the “positive probability” mentioned earlier. However, suppose you are walking on campus and coming forward with a student wearing trousers (unfortunately you are highly similar, you only see if he or she is wearing trousers and cannot be sure of him or her) Gender), how likely are you to infer that he or she is a male student?

Some cognitive science studies have shown ("Decision and Judgment" and "Rationality for Mortals" Chapter 12: Children can also solve Bayesian problems), we are not good at formal Bayesian problems, but for frequency The equivalent problem presented is very good. Here, we may wish to restate the problem as: you walked randomly on campus, met N people wearing trousers (still assume that you can not directly observe their gender), ask how many girls in this N person How many boys.

You said, this is not easy: figure out how many trousers are in the school, and then figure out how many girls in these people, isn't it?

We assume that there are a total of H students in the school, of which 60% are boys (all wearing trousers), and only 50% of girls wear trousers.

We first calculate the number of people wearing trousers:

H*P(Boy)*P(Pants|Boy)+H*P(Girl)*P(Pants|Girl), where P(Boy) is the ratio of boys, and P(Pants|Boy) is the trousers for boys. The ratio (100% in this question), the same as girls. Among them, there are H*P(Girl)*P(Pants|Girl) girls who create trousers. When we compare them, we get:

Formula 1

P(Girl|Pants) = P(Girl)*P(Pants|Girl)/P(Boy)*P(Pants|Boy)+P(Girl)*P(Pants|Girl)

And the boys and girls here can refer to everything, so the general formula is

Equation 2 (B' is a complement of B, such as boys and girls)

P(B|A)=P(B)*P(A|B)/P(B')*P(A|B')+P(B)*P(A|B)

In fact, the denominator refers to the probability P (Pants) of all people wearing trousers is P (A), the numerator is the probability that girls wear trousers and girls at the same time, that is, P (Pants, Girl) or P ( A, B), so Equation 2 (a special case of the full probability formula) can be written as: Equation 3

P(B|A)=P(A,B)/P(A)

Can also be written as: Equation 4

P(B|A)*P(A)=P(A,B)

Similarly, we can get P(A|B)*P(B)=P(A,B), so:

P(A|B)*P(B) = P(B|A)*P(A), ie: Equation 5

P(B|A)=P(A)*P(B|A)/P(B)

Equation 3 or Equation 4 is also the Bayes theorem. In fact, Equation 2 is a situation in which there are two types of classifications. For example, gender, coin toss, etc. There are only two situations. In real life, many of them are composed of multiple situations. One thing may be affected by multiple reasons, so promote it. Is the general formula of the Bayes' theorem. As shown in Figure 1-1, a thing has two influencing factors A and B. The size of the area corresponds to the probability of occurrence. The occurrence of the C event is affected by the A and B elements. If the probability P(A|C) affected by the A event is calculated in the C event, the ratio of the area of A∩C to C is calculated, that is, P(A∩C)/P(C), P(A∩ C) can also be written as P(A, C), because P(C|A) represents the probability of occurrence of C event under A condition, and P(A) represents the probability of occurrence of A event, that is, the area of A, so P(A) *P(C|A) is the area of A∩C, that is, the probability P(A, C) at which A and C occur simultaneously. Obtain P(A∩C)=P(A)*P(C|A), and the same reason: P(B∩C)=P(B)*P(C|B), because P(A|C) is obtained. )=P(A)*P(C|A)/[P(A)*P(C|A)+P(B)*P(C|B)]. This is a situation in which the influencing factors of a thing consist of two. We unite all the situations into a full probability formula:

Equation 6

Bayesian inference

2.1 What is Bayesian inference?

BAYESIAN INFERENCE is a statistical method applied to decision making under uncertainty conditions. A striking feature of Bayesian inference is that a priori information and sample information can be utilized in order to obtain a statistical conclusion. In general, I want to know the occurrence of the A event. Without any prior knowledge, I can only make a judgment that the probability of occurrence and non-occurrence is 50% each. However, fortunately, I know that the B event has occurred. According to the experience of the two, I know that it promotes the occurrence of the A event, so I can more accurately judge that the A event is a high probability (such as 80%). ), instead of the original 50% of non-zero. If I have more A related events, then I can make a more accurate judgment, which is Bayesian inference.

We still look at Equation 5: P(B|A)=P(A)*P(B|A)/P(B), P(A) is our prior probability, (Prior probability), ie in B event Before the occurrence, we made a judgment on the probability of the A event. P(A|B) is called "Posterior probability", that is, after the occurrence of the B event, we re-evaluate the probability of the A event.

P(B|A)/P(B) is called "Likelyhood", which is an adjustment factor that makes the estimated probability closer to the true probability.

To explain Bayesian inference more intuitively, here is an example of Wikipedia - drug abuse monitoring:

Assume that the sensitivity and reliability of a routine test result are both 99%, that is, the probability of a drug user being positive (+) per test is 99%. The probability that a drug-inducing person is negative (-) per test is 99%. From the probability of detection results, the detection results are more accurate, but Bayes' theorem can reveal a potential problem. Suppose a company conducts drug testing for all employees, and 0.5% of employees are known to take drugs. What is the probability that each employee who tested positive will take drugs?

Let “D” be an employee drug use incident, “N” for an employee not taking drugs, and “+” for a positive test. Available

P(D) represents the probability of an employee taking drugs. Regardless of other circumstances, the value is 0.005. Because the company's pre-statistics indicate that 0.5% of the company's employees use drugs, this value is the prior probability of D.
P(N) represents the probability that an employee will not take drugs. Obviously, the value is 0.995, which is 1-P(D).
P(+|D) represents the positive detection rate of drug users. This is a conditional probability. Since the positive detection accuracy is 99%, the value is 0.99.
P(+|N) represents the positive detection rate of non-drug users, which is the probability of error detection. The value is 0.01, because for non-drug users, the probability of being negative is 99%, so it is mistaken. The probability of being tested positive is 1 - 0.99 = 0.01.
P(+) represents the positive detection rate without considering the influence of other factors. This value is 0.0149 or 1.49%. We can calculate this by the full probability formula: this probability = the positive rate of drug abusers (0.5% x 99% = 0.495%) + the positive rate of non-drug users (99.5% x 1% = 0.995%). P(+)=0.0149 is the prior probability of detecting positive.

Described as a mathematical formula as:

Based on the above description, we can calculate the conditional probability P(D|+) of a person who actually takes a drug when it is positive:

Although the accuracy of drug testing is as high as 99%, Bayes' theorem tells us that if someone tests positive, the probability of drug use is only about 33%, and the possibility of not taking drugs is relatively large. If the false positive is high, the result of the test is not reliable.

At the same time, we can calculate the probability P(D|-) if a person is taking drugs but he is falsely detected as negative:

P(D|-) = P(-|D)P(D)/P(-|D)P(D)+P(-|N)P(N)

=0.01*0.005/0.01*0.005+0.99*0.995

=0.0000507

It can be seen that the probability of a person taking drugs but being mistakenly detected as negative is only 0.005%, which means that if a person is negative, he can basically judge that he is not taking drugs. However, if a person is positive, only 33% of the probability is determined to be drug-using. This is very similar to the case of many medical monitoring. False positives deserve more attention than false negatives!

2.2 Bayesian inference and spelling correction

Bayesian inference actually has many applications, such as language translation, Chinese word segmentation, image recognition, etc. Many blogs also use spelling correction as an example. Here I will elaborate on the process of spelling correction.

Peter Norvig, author of the classic book Artificial Intelligence: A Modern Approach, once wrote an article on how to write a spell check/corrector.

During the input process, users will inevitably encounter spelling errors. What we need to do is to give one or a few corrected words that the user would like to input. A key question here is: What words do users want to enter?

The language of its practical mathematics is to describe

P (we guess the size of the word the user wants to enter | the word the user actually entered).

Use T to indicate that we guess the word that the user entered, and use S to indicate the word that the user actually entered.

P(t|S)= P(S|t)*P(t)/P(S).

For the same word, the probability of P(S) is the same, then it is equivalent to P(t|S)∝ P(S|t)×P(t).

∝ is proportional to, not infinite, then if P(t|S) is the largest, it is to make P(S|t)×P(t) the largest.

P(S|t) nominally means that the word t we guess is the probability that the user really wants to input the word. The different word probabilities are different, which involves the maximum likelihood estimation. For example, the word entered by the user is thriw, then throw and thraw are possible, but you will think that o is very close to i, the possibility that the user may lose the word is throw is much more likely than the thraw, according to the maximum The likelihood estimate finds the most likely word. However, sometimes the maximum likelihood of light does not solve the problem perfectly, we also need to use the prior probability P(t).

P(t) makes us guess the probability of occurrence of words. These words t1, t2, t3... are theoretically infinite, but it is a prior probability, which may be a bit abstract for words. Here is an example of a participle:

The girl saw the boy with a telescope.

If only the maximum likelihood estimation method is used, two results may be given:

The girl saw | the boy with a telescope
The girl saw the boy | with a telescope

But according to our common sense, a girl looks at a boy with a telescope? Holding the telescope is a bit inexplicable, and it is associated with the "seeing" movement. The most appropriate explanation is probably that the girl is holding the telescope to see the boy. So to draw this conclusion, we use our prior knowledge, which is P(t).

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More