"Mathematics in machine learning" probability distribution of two-yuan discrete random variables under Bayesian framework

Source: Internet
Author: User

Introduction

I feel that learning machine learning algorithms is the only way to get started from a mathematical perspective, the machine learning field, the machine learning definition given by Michael I Jordan is, "A field that bridge computation and Statistics,with Ties to information theory, signal processing, algorithm, control theory and Optimization theory ". So for the machine-learning disciples, I think that the combination of computer and statistical theory is the right way out. The market boast of so-called not to introduce the mathematical background, only to introduce how to use the algorithm books, can only cater to those who are quick-buck's taste, can indeed feel the fiery concept of the people of the impetuous.
Of course, look at other people's impetuous, you also have a impetuous heart.
I am still the step by step of the road! Otherwise, I am also a drift, catching fish tide of the fisherman, without their own fundamental, once the boat, that is nothing.
Many of the teachers taught in schools are really bluffing students, in fact, they may not have a very solid foundation of mathematics, so that students can not be brought into the right path. At least as a class student, I feel that way. The result is a sense that the course is independent of one area and is very isolated. From some foreign books can be seen, machine learning is actually a multi-disciplinary derivative, and a lot of engineering field theory has a close connection, so that at least let us this beginner can be checked, not feel it is from the stone seam.

Next, the probability distributions introduced in several articles are the basis for building complex models. One important application of these probabilistic distributions is density estimation (density estimation), which is to build the model based on limited observational data and then obtain the probability distributions followed by the samples of these random variables. (It was only then that I knew a little bit about what the parameters of the teaching in the probability and statistics class would be for the undergraduate course)

Binary variables (binary Variables)

Let's consider the binary random variable x∈{0,1} first.

Bernoulli distribution (Bernoulli distribution)

Bernoulli distribution (the Bernoulli distribution, also known as the two-point distribution or 0-1 distribution, is a discrete probability distribution, named after the Swiss scientist Jacob Bernoulli), the Jobbenouri test succeeds, the Bernoulli random variable value is 1. If the Jobbenouri test fails, the Bernoulli random variable is evaluated as 0.


Maximum likelihood estimation (Maximum likelihood estimation)

Now we give a set of observational data d={x1,..., XN}, we estimate the parameter μ (probability of random variable taking 1 o'clock) by constructing the likelihood function.


To give an example,
If three observations were made, three observations of X were 1, then μml was 1, indicating that future observations should be x=1. According to common sense, this is obviously unreasonable. In fact, this is the result of overfitting caused by small datasets. the next thing we want to explain is how to understand this question from the angle of Bayesian theory.

Two items distributed (binomial distribution)

The two-item distribution is a discrete probability distribution of n Independent/non-experimental successes, where the probability of success for each trial is p. Such a single success/failure test is also called a Bernoulli test. In fact, when n = 1 o'clock, the two-item distribution is the Bernoulli distribution.
The two-item distribution is defined as:


The expectation and variance of the two-item distribution are:


Beta distribution

In order to solve the phenomenon of estimating the overfitting of parameters by the method of maximum likelihood estimation in small data sets, we try to introduce a priori distribution of parameter μ by Bayesian method.


Here A and B are called Hyper-parameters (hyperparameters), because they have left and right the distribution of parameter μ, which are not necessarily integers.
The following image shows the effects of different parameters on the distribution:


Prior probability

In Bayesian statistics, the prior probability distribution of a certain indeterminate amount P can express the probability distribution of P uncertainty before considering the "observational data" . It is intended to describe the uncertainty of the uncertainty, not the randomness of the indeterminate quantity. This indeterminate amount can be a parameter, or an implied variable (latent variable).
When using Bayesian theorem, we can obtain the posterior probability distribution by multiplying the priori probability and likelihood function, and then standardizing it, which is to give a certain data, the condition distribution of the uncertain quantity.
A priori probability is usually a subjective guess, and in order to facilitate the calculation of the posterior probability, a conjugate priori is sometimes chosen. If the posterior probability and the priori probability are the same family, they are considered to be conjugate distributions, and this priori probability is the conjugate priori corresponding to the likelihood function .

Conjugate distribution (conjugate Prior)

In order to make the prior distribution and the posterior distribution of the same form, we define: if the prior distribution and the likelihood function can make the prior distribution and the posterior distribution have the same form, then it is called the priori distribution and the likelihood function is conjugate. So conjugate refers to: Prior distribution and likelihood function conjugate.
The meaning of conjugate transcendental is that it makes Bayesian inference more convenient, for example, in the continuation of Bayesian inference (sequential Bayesian inference), a posterior distribution can be calculated after a observation is obtained. Since the selection is a conjugate priori, so the posterior is the same as the original transcendental form, it can be used as a new priori for the next observation, and then continue the iteration.

Post-Test distribution

The posterior distribution of parameter μ is obtained by multiplying its prior distribution by the two-item likelihood function (binomial likelihood functions).
The posterior distribution has the following form:


wherein, L = n-m.
We can see that the posterior distribution and the prior distribution have the same form, which embodies the characteristic of the conjugate priori of the likelihood function. This posterior distribution is also a beta distribution, so we can think of this posterior distribution as a new prior distribution, and when we get a new set of data, we can update it to get a new posteriori distribution.
This sequential method (sequential approach) uses a wavelet (small batches) to observe the data, and when the new observational data comes, the old observations are discarded.
So this approach is well suited for data flow stabilization, and for real-time learning scenarios where all of the data is being observed, as this method does not require data to be loaded into memory at once.
The image below depicts a link to continuous Bayesian inference (sequential Bayesian inference). Prior distribution parameters a=2, b=2, corresponding to only one observation data x=1 likelihood function, its parameters n=m=1, and then the parameters A=3 and b=2 of the distribution.


Predictive data

Now all we have to do is to evaluate the predictive distribution of x based on the given observation dataset D.


From the above, we can see that as the data increases, M and l tend to infinity, then the posterior distribution of the parameters equals the maximum likelihood solution. For finite data sets, the posteriori mean of parameter μ is always between the transcendental average and the maximum likelihood estimate of μ.

Summarize

As we can see, the posterior distribution becomes an increasingly steep peak shape as the observational data increases. This is shown by the variance of the beta distributions, when a and b approach infinity, the variance of the beta distribution tends to be nearly 0. At a macro level, when we observe more data, the uncertainty reflected in the posterior distribution is drastically reduced (steadily decrease).
Some prior distributions can prove that the Bayesian method and the frequency method are equivalent when the variance of the data increases, the distribution becomes more and more steep, and finally collapses into the Dirac function.

Resources

Pattern recognition and machine learning, Christopher M. Bishop
wiki:β-two-item distribution

Reprint please indicate the author Jason Ding and its provenance
GitHub home page (http://jasonding1354.github.io/)
CSDN Blog (http://blog.csdn.net/jasonding1354)
Jane Book homepage (http://www.jianshu.com/users/2bd9b48f6ea8/latest_articles)

"Mathematics in machine learning" probability distribution of two-yuan discrete random variables under Bayesian framework

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.