Today, we will start learning pattern recognition and machine learning (PRML), Chapter 1.2, probability theory (I)

Source: Internet
Author: User

Original writing, reproduced please indicate the source of


Today I will start learning pattern recognition and machine learning (PRML), Chapter 1.2, probability theory (I)


This section describes the essence of probability theory in the entire book, highlighting an uncertainty understanding. I think it is slow. I want to take a look at it and write the blog code, but I want to leave some marks so I can write it down. The previous sections are actually very important. Therefore, you can write a blog in a single or even semi-sections. If a topic is discussed in a later section, I may write a blog in a chapter. For example, in chapter 9, I will talk about em.Algorithm, I should have used a blog.

(Just getting started, let's laugh at it ~.~)

I will deduce the formulas in the book if they are important, and then use yellow to represent them. This is also true for subsequent chapters. If you read the PRML book, we suggest you use several formulas to deepen your understanding. If "NOTE" appears, it is the description I add myself.


Let's start with an example: there are two boxes. A red box contains two apples (green) and six oranges (yellow ), A blue box contains three apples and one orange, as shown in Figure 1.9. Randomly select a box, and then randomly take out a fruit from the box, observe what is then put back to the original place, repeat this process many times.

We define that the number of times to select a red box is 40% of the total number of times, and the number of times to select a blue box is 60%.

In this example, the color of the box is a random variable, which we call B. It has two values: R (red) and B (blue). Fruit is also a random variable, which is called F, the values are a (Apple) and O (orange ).

First, the probability is understood from the frequency perspective. The probability of selecting the red/Blue Box is as follows:

Note: The probability must be within the range of [0, 1], and the probability of covering all possible mutex events is 1.


Now we can ask questions like this:(1) What is the probability of an apple picking? (2) If we get an orange, what is the probability of selecting a red box this time?



========================================================== ===== Gorgeous split line ============================================ ======================================

Before solving the above problems, let's jump out of this example to consider the more general situation: see Figure 1.10


Repeat a large number of experiments on two random variables and record the number of (XI, Yi) results to n_ij, the sum of the columns in the figure is CI (indicating the total number of times all Xi appears). The sum of a row is RJ, indicating the total number of times YJ appears. Obtain the values of X = xi and Y = Yi.Joint probability:


And x = xiEdge Probability:


You can also getConditional probability, givenXIObtain YJProbability:

Through the above derivation, we can get the following relationship:


The above formula (1.7) is called an addition rule.Sum Rule, (1.9) is called a multiplication rule.Product RuleIs the most basic rule in probability theory:

Note: These two rule are almost the most important methods.

Through these two rules, we can obtain bayesian theory that is very important in Machine Learning:

Where p (x) can expand all Y:

It can be understood as normalization, so that the conditional probability on the left side of (1.12) is 1 under all values of Y.


========================================================== ===== Gorgeous split line ============================================ ======================================



Now, let's jump back to the example of the previous two boxes (We will emphasizeRandom Variable(Uppercase letters) and theirInstance(Lower-case letters), which may be simplified later)

These probability expressions are obtained directly and are given in the question. For example, if the box is red, the probability of fruit being apple is 1/4. The sum of all possibilities under the same condition is 1. OK. Now we can answer"Probability of drawing an apple"The problem is:

That is, the possibility that all boxes are lifted, and the sum of the probability that an apple is drawn under each box. Correspondingly, the probability of getting an orange is P (F = O) = 1-11/20 = 9/20.

Okay. Now I want to answer the second question:If we get an orange, what is the probability of selecting a red box this time?

Bayesian formula:

The answer is easily obtained through Bayesian formula, and all the information we need can be obtained from the basic information above. In this example, we have designed many concepts. For example, we have an estimate of the box selection, that is, P (B ).Anterior ProbabilityBecause it was known before the results of our observations (determined );

Then our second problem is to find the probability of the box red when the orange is drawn. In fact, it is estimated that the box is red.Posterior ProbabilityBecause the estimation is obtained after we observe the random variable F. It can be seen that when we have a certain observation value, we cannot determine an event from an intuitive understanding. For example, a prior tells us that 60% of the results may be blue boxes, the formula (1.23) indicates that when o is observed, the probability of a red box is 2/3, which is greater than that of a blue box.

The product of edge distribution of two random variables equals their joint distribution, that is, P (x, y) = p (x) P (y ).Independent from each otherAnd P (Y | X) = P (y ).



1.2.1Probability Density

The previous sections are about discrete variables. We need to reconsider the probability definition of continuous variables.

If a real number continuous variable falls in the interval probability, then a small p (x) is called X'sProbability Density. Probability is defined:

Note: Small P is used to indicate obfuscation. The preceding P indicates probability, and the following p (x) indicates probability density.

(Continuous variables do not have the probability that X is defined for a specific value. Because there are infinite values, they all mean the probability of falling into a specific segment)

Cumulative distribution is defined as the probability that X is in the interval:

Meet P' (x) = p (x ). In Figure 1.12, we plot a small probability density P and a large cumulative distribution function P. The green area is the probability of falling between cells.

The sum rule and product rule discussed earlier also apply to continuous variables:





1.2.2Expectation and variance

Expected: The average value of a function f (x) in a probability distribution p (x) is the expectation of f (x), defined:

In discrete cases, we expect to be the weighted sum of all possible values. For continuous variables, use the corresponding integral form:

Note: The two small P values are different. The above discrete values are considered as probabilities, while the small P values below (1.34) are probability density.

An intuitive expectation is the average of all observation points:

When N tends to be infinite, the equal sign is used. Such an average value is often used.

If F has multiple variables, we usually use subscript to indicate which variable is considered for the distribution (change), such

That is to say, F's expectation about variable X. In fact, the above formula is a function with Y as the variable. Similarly, we can also define conditions as expected:

Variance: Variance can be used to estimate the intensity of change of a function f near his expectation. It is defined

If the variable X itself is considered, the variance of X is also available:

Note: (skipped in the book) This equation is actually derived from the definition of variance:


In addition, we define two random variables.Covariance:

X, YDegree of change together, if XAnd yIndependent of each other, the covariance is 0. We can see that the variance of a single variable is a special case of covariance, x = y.

If X and Y represent two vector variables, and X and Y represent column vectors, the covariance is a matrix:


Well, the expectation and variance are introduced here. These two concepts almost run through all areas of machine learning. Let's record these records today. Chapter 1.2 is really important. It's just the first half. The second half will be sorted out in a few days before release.


It took 2 hours to write so many pages + (is the action too slow ?), The main reason is that the statements are sorted out. Although the formulas and charts are both copied, it still takes a lot of time. It is also good to remember that I have deepened my impression.

The lower half of 1.2 includesBayesian theorem and Gaussian distribution are very important.The above section 1.2 forms the basic content of probability statistics learning. It is recommended that you have a good understanding of it.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.