Mathematical Learning in Machine Learning

Source: Internet
Author: User

To learn about machine learning, you must master a few mathematical knowledge. Otherwise, you will be confused (Allah was in this state before ). Among them, data distribution, maximum likelihood (and several methods for extreme values), deviation and variance trade-offs, as well as feature selection, model selection, and hybrid model are all particularly important. Here I will take you to review the relevant knowledge (a lot of probability knowledge will be thrown away after the Postgraduate Entrance Exam ).

I. Overview of probability theory knowledge

1. Conditional Probability

The conditional probability is relatively simple.

2. Bayesian Probability

Bayesian theory is already in contact with university probability theory. However, normal university teachers only explain the formula and then let us set up the formula to answer questions. This is the utilitarian nature of university education, it is never just for the purpose of application, but not to investigate the nature of the problem. Bayesian is widely used in machine learning, which requires us to have a thorough understanding of bayesian theory.

Bayesian (Thomas Bayes) is very powerful and also so lucky. Here is an introduction from Wikipedia:

The so-called Bayesian method originated from an article he wrote to solve the "inverse overview" problem during his lifetime.ArticleThis article was published by one of his friends after his death. Before Bayesian wrote this article, people had been able to calculate "Positive Probability", for example, "Suppose there are n white balls and m black balls in the bag. You can reach out and touch them, how likely is a black ball ". A natural problem is the opposite: "If we don't know the ratio of the black and white balls in the bag beforehand, we close our eyes and find one or more balls, after observing the color of the obtained balls, we can speculate on the proportion of the black and white balls in the bag ". This is a so-called inverse problem.

Here I will first give a general Bayesian formula:

In fact, Bayesian's thesis at that time was just a direct solution to this problem. It was not clear whether he had realized that there was a profound idea. However, the Bayesian method swept through probability theory and applied it to various problem fields. The shadows of Bayesian methods can be seen in all places where Probability Prediction is needed, bayesian is one of the core methods of machine learning. The profound reason behind this is that the real world itself is uncertain, human observation capabilities are limited (otherwise there is no need to do a large part of science-imagine that we can directly observe the operation of electrons, do I have to argue over the Atomic Model ?), What we observe in our daily life is only the result on the surface of things. We can only know the color of the ball from the bag, the actual situation in the bag cannot be seen directly. At this time, we need to provide a guess (hypothesis, a more rigorous statement is "hypothesis". Here we use "speculation" to make it easier to understand, of course, it is just uncertain (there may be many or even countless kinds of guesses that can satisfy the current observations), but it is by no means a false guess.

... To be continued

Beta distribution

In the bernuoli distribution, the maximum likelihood estimation of parameters is obtained by the percentage of the calculated number of times in all tests. We have learned this in the previous probability class. However, this method is easy to cause over-fitting when the number of tests is very small. You can't say that if you throw two coins with the national emblem up, you will think that you will get the National Emblem up in the future...

Then we will adopt the Bayesian approach to first introduce a prior probability of distribution. This prior distribution requires intuitive explanations and useful analysis features. Note that the likelihood function of the two distributions is the product of a factor, for example: this likelihood function is mentioned in the basic section before this book. If n experiments are performed, the likelihood function can be written as follows:

 

 

Therefore, if we select a proportional ratio between the prior probability and the subsquare, then the posterior probability distribution is the product of the prior probability and the likelihood function according to the Bayesian law, and is proportional to the subsquare. This kind of property is called conjugacy ). Therefore, we choose a prior probability called beta distribution, as shown below:

 

 

The Gamma function ensures that the beta distribution is a standard probability distribution, namely:

 

 

This proof is unclear. Mean and variance are:

 


 

Parameters and frequently referred to as hyperparameters, because they determine the parameter distribution. The following figure shows different hyperparameter beta distribution images.

The posterior distribution of can now be obtained by multiplying the prior beta distribution (2.13) and likelihood function (2.9), and normalization. Only the relevant formula is retained. We can obtain the posterior distribution as follows:


That is, the number of times the opposite side is obtained in the coin test. We have noticed that the dependence of (2.17) pairs is similar to that of a prior probability, and it is the same as that of a prior probability and a likelihood function. In fact, this is indeed another Beta distribution. In comparison (2.13), we can add its coefficient (so that the sum of its points is 1) and get:

 

 

 

We can see that in the process of transformation from a prior probability (formula 2.13) to a posterior probability (formula 2.18) (the latter is the former multiplied by the likelihood function ), the observed times and times are used to increase and respectively. Therefore, we can give an intuitive explanation of the Super parameters A and B in the prior probability distribution, that is, they serve as and the number of effective observations respectively. Note that the sum is not necessarily an integer. In addition, this posterior probability distribution can be used as the prior probability of the data observed next. We can see that if we observe a group of data each time, we will update the current posterior probability distribution by multiplying the likelihood function of the new observed data. In each update process, the posterior probability is a beta distribution. Parameters A and B give the number of neutralization times (anterior and actual observation. The number of times in the new observed value is added to parameter A, and the number of times is added. Figure figure2.3 below demonstrates an update procedure.

The learning method of this sequence is natural from the perspective of Bayesian. It only depends on the independence hypothesis between data, and the prior probability is irrelevant to the choice of the likelihood function. Learning in sequence increments to get only one or a few observed values at a time, and ignore them before the next observed values arrive. In practical applications, this method can be used to learn stable data streams. This method does not require all data to be imported to the memory at a time. Therefore, it is suitable for large-scale data learning tasks. The maximum likelihood estimation method can also be applied to the sequence learning framework.

If our task is to predict and guess the results of the next experiment as much as possible, we must estimate the distribution of the observed data. According to the sum and product rules of probability, we can get:

 

From the mean of formula (2.18) and beta distribution, we can obtain the mean of Posterior Probability:


This looks very natural. Similar to likelihood estimation, this estimate is the percentage of all x = 1 (including the anterior probability and actual observation) in the observed value. If we set a large dataset size, such as Infinity, (2.20), it will become the maximum likelihood estimation result. We can see that the Bayesian and maximum likelihood results are equivalent when the dataset size tends to be infinite. For a finite dataset, the posterior probability mean is always between the anterior probability mean and the maximum likelihood estimation.

 

As shown in figure 2.2, the peak value and tendency of posterior probability become more obvious as the observed data increases. This can also be seen through (2.16) that the variance of beta distribution tends to 0 at and below. In fact, we want to know whether bayesian Learning has such an attribute. That is, the more data we observe, the smaller the uncertainty of the posterior probability.

In order to answer this question, we regard Bayesian learning as a process of Frequency Statistics and prove that such an attribute exists. For a Bayesian inference problem, the parameters we obtain from the dataset are expressed by joint probability distribution. The following results are available:


Where:


The formula (2.12) indicates that the mean value of the posterior probability distribution of the parameter on the dataset is equal to the anterior probability mean. Similarly, we get:

 

On the left side of the preceding formula is the variance of the prior distribution. The first item on the right is the mean of the posterior square difference, and the second item is the variance of the posterior distribution mean. Because the variance is positive, the posterior probability on the surface of the result is usually smaller than the anterior probability. The degree of reduction increases with the mean of the posterior probability. Note that this result is only in the mean sense. For a specific dataset, it is possible that the posterior probability variance is larger than the prior probability variance.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.