This involves a mathematical probability problem.
two meta variable distribution:
Bernoulli distribution, is 0-1 distribution (such as a coin toss, face up probability)
Then the probability distribution of a coin toss is as follows:
Suppose the training data is as follows:
So, based on maximum likelihood estimation (MLE), we require u:
The evaluation derivation process is as follows:
So we can find out:
The above derivation process is the maximum likelihood estimate, we can see that you are the sample frequency divided by the total number of coin toss experiment. But the maximum likelihood estimate has its limitation, when the training sample is relatively small can cause the overfitting question, for example throws 10 times the coin, has 8 times upwards, then according to the maximum likelihood estimate, U's
The value should be 8/10 (the point of view of this symbol frequency faction). How to solve this problem?
At this point, we need to start with Bayesian theory, Bayesian theory that you are not a fixed value, U is also subject to a distribution, so we assume you have a priori distribution P (u).
But how to choose this priori distribution P (u)?
We know
So we want a priori distribution to have a similar probability distribution, why do you say so? Because the posterior probability = a priori probability * likelihood function, if the selected prior distribution and likelihood function have the same structure, then the resulting posterior probability will also have similar structure, which will make the calculation of the later simple.
conjugation: The posterior distribution P (θ|x) of θ belongs to the same distribution as the prior distribution P (θ), so the two are conjugate distributions.
So we assume that the prior distribution of U is also
So this time there's a distribution in mathematics called the Beta distribution:
So let's say we cast a coin, M-times positive, L-second reverse. A total of m+l=n experiments:
Then the distribution of U is:
is still the same distribution as the prior distribution (conjugate distribution)
Suppose we want to predict the next experimental result, that is, given that D gets the next predicted distribution:
We can find that when the m,n is infinitely larger, this estimate is approximately equal to the maximum likelihood estimate.
multivariate variable distribution:
Most of the time, there are more than two elements of the variable, and there are many, in fact, the estimation process is similar. Suppose there is a k-dimensional vector, where one vector xk=1, the other equals 0.
For example, if a variable x2 occurs, the x2=1,x= (0,1,0,0,0,0) is a cast sieve example with a total of 6 faces.
Then the probability of xk=1 occurring is UK, so the distribution of x is:
Consider n independent observed values {x1,x2,... xn}d, corresponding likelihood functions:
Where MK is actually so many experiments, the number of times the UK appears. Estimating maximum likelihood estimates, we will conclude that:
Similarly, to avoid overfitting problems caused by small data volumes, we also assume a priori distribution for the UK:
Considering the distribution of multivariate variables U:
So we chose its conjugate distribution Dirichlet distribution to a prior distribution:
then posterior distribution = likelihood distribution * Prior distribution:
is still the same distribution as the prior distribution (conjugate distribution)
Suppose we want to predict the next experimental result, that is, given that D gets the next predicted distribution:
And because for Dirichlet distribution:
So the distribution prediction for a class is:
The problem of machine learning----distribution (two yuan, multivariate variable distribution, beta,dir)