This article by leftnoteasy original, can be reproduced, but please keep the source and this line, if there is commercial purpose, please contact the author wheeleast@gmail.com

** ****I. Bayesian theorem:**
** ****Bayesian Theorem explains common knowledge that everyone knows in life using mathematical methods**

The simplest theorem of form is often the best theorem, such as the central limit theorem. Such theorem often becomes the theoretical basis of a certain field. Machine LearningAlgorithmBayesian theorem is the most common method used in.

I did not find any relevant information in the Process of Bayesian Theorem discovery, but I believe that Thomas. bayesian (1702-1761) discovers this theorem that has a profound impact on future generations through small issues in life. I believe that when Bayesian discovers this theorem, I still don't know how powerful it is. Next I will use a small example to introduce the Bayesian theorem:

**It is known that there are n apples and M pears. The probability of an apple being yellow is 20%, and the probability of a pear being yellow is 80%, if I see a yellow fruit in this pile of fruit, I will ask how likely it is to be a pear.**

Expressed in mathematical language, that is, known P (Apple) = N/(n + M), P (PEAR) = M/(n + M ), P (yellow | Apple) = 20%, P (yellow | pear) = 80%, evaluate P (Pear | yellow ).

To get this answer, we need 1. specify the number of yellow fruits for all the fruits. 2. Find the number of yellow pears.

For 1) We can get P (yellow) * (N + M), P (yellow) = P (Apple) * P (yellow | Apple) + P (PEAR) * P (yellow | pear)

For 2) We can get P (yellow | pear) * m

2)/1) available: P (Pear | yellow) = P (yellow | pear) * P (PEAR)/[P (Apple) * P (yellow | Apple) + P (PEAR) * P (yellow | pear)]

Simplification: P (Pear | yellow) = P (yellow, pear)/P (yellow). in simple words, it is known that it is yellow, the probability P (Pear | yellow) of a pear is the probability P (yellow, pear) of a yellow pear that occupies all fruits except that the color of a fruit is yellow ). this formula is simple.

We can replace the pear substitute with A, and the yellow substitute with B. The formula can be written as P (A | B) = P (A, B)/P (B), which can be P (, b) = P (A | B) * P (B ). the Bayesian formula is introduced in this way.

A general idea of this article: First, let's talk about a basic Bayesian learning framework that I have summarized, and then give a few simple examples to illustrate these frameworks, finally, I would like to give a more complex example, which is explained by the modules in the Bayesian machine learning framework.

** ****Ii. Bayesian machine learning framework**

For bayesian Learning, I have the views and methods of each book, some of which are vivid and some are abrupt, I have never seen an official saying that bayesian Learning is composed of several modules. I think the following modules are necessary to understand bayesian Learning:

** ****1) Bayesian Formula**

One major category of machine learning problems is classification, that is, to find out which belongs to the class (also known as H, hε {H0, H1, h2 ...}) The probability is, that is, the result is:

P (H | D), which can be:

P (H, d) = P (H | D) * P (d) = P (d | h) * P (H), so: P (H | D) = P (d | h) * P (h)/P (d). For all data under a dataset, P (d) remains unchanged. Therefore, P (d) is considered as a constant, and P (H | D) ∝ P (d | h) * P (H) is obtained ). We often do not need to know the specific values of P (H | D), but the relationship between the values of P (H1 | D) and P (H2 | D. This formula is the Bayesian formula in machine learning. In general, we call P (H | D) as the posterior probability of the model, that is, to obtain the hypothesis probability from the data, P (h) it is called a prior probability, that is, the probability in the space. P (d | H) is the likelihood probability of the model.

Likelihood (likelihood) is a relatively confusing probability. It can be considered that the probability of introducing data from the hypothesis is obtained when the assumption is known. In the actual machine learning process, many assumptions are often added, such as an English translation of French:

Which of the following is the most reliable French sentence? P (F = French sentence | E = English sentence) = P (E | f) * P (f ), P (E | f) is the likelihood function. P (E | f) is written as follows to make it clearer: P (E | f ε {F1, F2 ...}) It can be considered that from the input English sentence E, many different French sentences F, P (E | f) are introduced) it is the probability of introducing e from one of these French sentences.

**The content after this article will alsoArticleSome of the content not mentioned in, but also some important questions that are easy to doubt, ignore, and be explained in Bayesian learning.**.

** ****2) Prior Distribution Estimation, likelihood function Selection**

In the Bayesian method, the right side of the equal sign has two parts: the prior probability and the likelihood function. The anterior probability is obtained. In the hypothetical space, what is the probability of a certain hypothesis? For example, if an animal is hairy on the street, ask 1. what is the probability that this animal is a pub dog? 2. the probability of this animal being a Java tiger is as follows:

Although the likelihood functions of the two assumptions are very close to 1 (unless this animal is ill), the anterior probability of the Java tiger is 0 because the Java tiger is extinct, therefore, the probability of P (Java tiger | hairy animal) is also 0.

** ****Prior Probability Distribution Estimation**

When the variables are continuous during observation, a prior distribution is often required to obtain the probability of a certain hypothesis in the hypothesis space, which is not present in the sparse data set. For example, if there is a large, even metal disc, ask the probability that the metal disc will fall down in the air and face up on the front, the cost of this experiment is relatively high (the metal disc is large and heavy), so we can only conduct a limited number of experiments. what may happen is that the front is up four times, and the back is up one time, however, if we calculate the anterior probability based entirely on this dataset, a large deviation may occur. However, since we know that the disc is even, we can use this knowledge to assume that p (x = front) = 0.5.

Sometimes, we know the distribution type, but do not know the distribution parameters. We also need to estimate the distribution parameters based on the input data, and even make some corrections to the distribution, to meet the needs of our algorithm: for example, we know that the distribution of a variable X is evenly distributed in a continuous interval. We observed the variable for 1000 times. The result is: 1, 1.12, 1.5... 199.6, 200, can we estimate the distribution of variables from [1,200] evenly? If a variable is 0.995, can we say P (0.995) = 0? What if there is a 200.15 error? Therefore, we may need to adjust the probability distribution at this time, and the probability within the range of x <1, x> 200 is a straight line of decline, the entire probability density function may be a trapezoid, or a very small probability can be given to values outside the region. I will give some examples later.

** ****Likelihood function Selection**
For the same model, the likelihood function may have different options. For these options, the likelihood function may be more precise but search for a large space, which may be rough, however, the speed is faster. We need to select different likelihood functions to calculate the posterior probability. For these likelihood functions, some smoothing techniques may need to be added to minimize the impact of noise in the data or hypothetical defects on the results.

**In my understanding, the Bayesian method is used to estimate the posterior probability of the hypothesis of the given data, which is transformed to the posterior distribution through prior * likelihood. Is a process of distribution transformation.**

** ****3) loss function (loss function)**

X is the input data, y (x) is the model of the result of the test, T is the actual result of X, L (t, y (x) is the loss function, E [l] indicates the loss time of the model when model Y is used for prediction and L is used as the loss function. In general, the loss function is the most effective method to determine whether a model can obtain accurate results. The most common and simplest loss function is:

However, I have never known why the square here is used instead of the absolute value. Is there a detailed explanation? :-P

** ****4) model selection (Model Selection)**

As mentioned above, the likelihood function can have different options, and the prior probability can have different options. However, suppose we construct a complete test set and an appropriate loss function, the final result will be definite and quantified. We can easily obtain the advantages and disadvantages of two models with different parameters and methods. However, in general, our test set is incomplete, and our loss functions are not so precise. Therefore, we provide a perfect model for this test set, we may also need to question whether the training set is too similar to the test set, and the model is too complex. Resulting in over-fitting (the generation of over-fitting will be detailed later )?

Model Selection is essentially a balance between the complexity of the model and the accuracy of the model. This article will provide some similar examples.

** ****Example 1: Sequential Probability Estimation**

Note: This example is from PRML chapter 2.1.1.

There are many methods to estimate the probability density, one of which is called Sequential Probability estimation.

This method is an incremental learning process. When you see a sample, the previously observed data is used as a prior probability. Then, after obtaining the posterior probability of the new data, the current posterior probability is used as the prior probability of the next prediction.

The traditional binary distribution is:

Because the probability μ of the traditional binary distribution is obtained based entirely on the prior probability, and this prior distribution has been mentioned before, it may be due to a large deviation due to the insufficient number of experiments, and,**We cannot know the distribution of μ, but only the expectation of one μ.**This is not good for some machine learning methods. To reduce the influence of prior distribution on μ and obtain the μ distribution, we add two parameters, A and B, indicating the number of occurrences of x = 0 and x = 1, this value will change the distribution of μ. The formula for beta distribution is as follows:

Values of A and B will have the following impact on the probability density function of μ: (picture from PRML)

In the process of observation data, we can use the results of observation data at any time to change the prior distribution of current μ. We can add two parameters m and l to the beta distribution, indicating the number of times X = 0 and x = 1 observed. (The previous A and B are a prior number of times, not currently observed)

Our order:

A', B 'indicates the new A and B that are added to the observed results. Bring the original formula to get

We can use the μ posterior probability after observation to update the μ anterior probability for the next observation so that new data can be obtained from time to time, it is useful when real-time is required to give the result. However, the sequential method assumes that the data is in the same I. I. d (independent distribution. Data processed each time must be independently distributed.

** ****Example 2: spelling check**

The central idea of this article comes from: How to Write a spelling checker. If necessary, refer to the original article. This example mainly describes the influence of prior distribution on the result.

Directly give the Bayesian formula of the Spelling Checker:

P (c | W) indicates the probability that the word w (wrong) is correctly spelled as the word C (correct), and p (w | C) indicates the likelihood function, here we simply think that the distance between two words is the likelihood between them, P (c) represents, and the probability of word C in the overall document set, that is, the prior probability of word C.

When checking the spelling of a word, we will definitely consider it intuitively: if a word entered by a user does not appear in the dictionary, it should be corrected into a dictionary, and the word closest to the user input. If the word entered by the user appears in the dictionary but the word frequency is very small, we can recommend a word closer to the user, however, words with high word frequency are used.

The statistics of prior probability P (c) is very important. Generally, there are two feasible methods. One is to use some authoritative word frequency dictionaries, one is to make statistics in your own corpus (that is, the corpus for spelling check. I suggest using the following method for statistics so that the prior probability of words will match the test environment. For example, if a game vertical search website needs to correct the spelling of user input information, the prior probability calculated in a general environment is not suitable.

** ****Example 3: Occam Razor and Model Selection**
The following figure is provided: (from the Mackey book)

Q: How many boxes are there behind a big tree?

In fact, there must be a lot of answers. One, two, or even n boxes are possible (for example, there is a row of boxes behind them, and they are arranged in a straight line). We can only see the first one:

However, the most correct and reasonable explanation is a box, because if there are two or more boxes behind the big tree, why does the front of the big tree look like the height on both sides, the color is the same. Isn't it a coincidence. Based on this picture, if our model tells us that there are two boxes behind the big tree, is the generalization ability of this model too bad.

Therefore, in essence, the Occam Razor, or model selection, is also a mathematical representation of a common behavior in human life. It is a process of simplifying complexity.**The beauty of mathematics: an ordinary and magical Bayesian Method**As mentioned in this Article, the Occam Razor works on likelihood and has no influence on the Prior Distribution of the model.**I do not agree with this statement here**: The Occam Razor removes complex models. complex models are also uncommon and the anterior probability is low. The final result is that a model with a higher anterior probability is selected.

** ****Example 4: curve fitting:**

(This example is from PRML)

Problem: Given the points of some columns,**X**= {X1, x2. .. xn },**T**= {T1, T2... tn}, requires a model to fit this observation, so that a new vertex x' can be given and a T' can be given '.

It is known that the given point is the 10 points produced by the noise of Y = 2 π x plus normal distribution, for example. For simplicity, we use a polynomial to fit this curve:

To verify whether our formula is correct, we have added a loss function:

When the loss function is minimal, we plot the curves generated by Polynomials in different dimensions:

When the m value increases, the curve becomes steep. When M = 9, in addition to fitting the input sample points, the new sample points cannot be predicted. We can observe the polynomial coefficient:

We can see that when M (dimension) increases, the coefficient also expands greatly. To eliminate the influence of this coefficient, we need to simplify the model, we add a penalty factor to the loss function:

We multiply the L2 distance of W by a coefficient λ and add it to the new loss function. This is**Occam Razor**To change the original Complex Coefficient to a simple coefficient (for more specific quantitative analysis, see PRML section 1.1 ). If we want to consider how to select the most appropriate dimension, we can also regard the dimension as a part of a loss function, which is one of model selection.

However, this problem has not been well solved. Currently, the model we get can only predict one accurate value: enter a new x and give a T, however, it cannot describe the probability density function of T.**The probability density function is very useful.**. Assume that our task is corrected to give n sets, each of which has several points, representing a curve and a new point, ask which curve the new vertex is most likely to belong. If we only use the distance between new points and these curves as a measurement criterion, it is hard to get a convincing result. To obtain a distribution of the tvalue, we assume that t belongs to a Gaussian distribution with the mean of Y (X) and variance of 1/β:

In the previous E (W), we added a L2 distance of W, which seems a little abrupt. Why should we add such a distance? Why not add another item. We can use a Bayesian method to replace it and get a more convincing result. We take P (w) as a Gaussian distribution with 0 as the mean and α as the variance. The distribution is that W has a high density near 0, which is a prior probability of W, in this way, when the posterior probability is maximized, the smaller the absolute value of W, the larger the posterior probability.

We can get a new Posterior Probability:

Does this expression look familiar? We set λ = α/β to get a result similar to the previous loss function. We can not only use this function to calculate the optimal fitting function, but also obtain a probability distribution function. It can lay the foundation for many other tasks of machine learning.

I want to talk about it again. In fact, many machine learning contents are similar to the curve fitting algorithm mentioned in this Article. If we don't need any knowledge about probability statistics, we can get a solution, just like our first curve fitting solution, which can also fit well, but the only thing missing is probability distribution, with probability distribution, you can do a lot of things. This includes classification and regression. Essentially, beta distribution and binary distribution, Dirichlet distribution, and polynomial distribution are similar in that W is directly calculated in the curve fitting and W is estimated through Gaussian distribution: the beta distribution and Dirichlet distribution provide the Prior Distribution of μ. With this Prior Distribution, We Can Do Bayesian-related things better.

** ****Postscript:**

This article is written here. It took about four nights to write this article, and I would like to thank my girlfriend for her support. I also hope I can use it to summarize some of my recent experiences and see if I can tell it out. I think the process of learning is a process of mountain climbing. Sometimes I feel that I am approaching the mountain. As a result, I feel frustrated and excited, but the feeling of learning is generally happy. I also want to bring my happiness to everyone.

** ****References:**
**The beauty of mathematics: an ordinary and magical Bayesian method, pongba**

**Pattern Recognition and machine learning, Bishop**

**Some content on Wikipedia**