ml-Maximum Likelihood estimation
map-Maximum posteriori Estimate
Bayesian estimation
The relationship and difference between the three
One. Machine learning
The core idea is to learn the rules from past experience to predict new things. For supervised learning, the more useful the number of samples, the more accurate the training.
Used to represent the process of machine learning and the knowledge involved:
In simple terms:
- The first step is to define our hypothetical space (Model assumption): linear classification, linear regression, logistic regression, SVM, deep learning network, etc.
- How to measure the quality of the models we have learned? Define the loss function (target function), lost function, such as square loss
- How to optimize the hypothetical model and optimization the process. Simply put, it is to choose an algorithm (such as gradient descent, Newton method, etc.), to optimize the objective function, and finally get the optimal solution;
- Different models use different algorithms, such as logistic regression is usually solved by gradient descent method, the neural network is solved by inverse derivation, and Bayesian model is solved by MCMC.
- Machine learning = model + optimization (different algorithms)
- There is one more question, how is the complexity of the model measured? Because complex models are prone to overfitting (overfitting). The party that solves the Fit is the join regular (regularization)
- After all the above problems have been solved, how can we judge that the solution is really good? Use cross-validation to verify.
Two. ML vs MAP vs Bayesian
- ML (maximum likelihood estimate): is the parameter given to a model, and then try to maximize p (d| parameter). In the case of a given parameter, the probability of the sample set is seen. The goal is to find the parameters that make the preceding probabilities the most.
- The logistic regression is based on ML;
- Disadvantage: We will not add our prior knowledge to the model.
- MAP (maximum posteriori estimate): Maximize p (Parameter | D).
- Bayesian: Our prediction is to consider all possible parameters, i.e. all parameter spaces (the distribution of parameters).
- Both ML and map belong to the same category, called (Freqentist), and the final goal is the same: find an optimal solution and then use the optimal solution to make predictions.
Three. ML
We need to maximize P (D|\theta), this part of the optimization we can usually get by setting the derivative to 0. However, ML estimates do not take prior knowledge into account and can easily lead to overfitting.
For example, for cancer, a doctor may have received 100 patients a day, but in the end the patient diagnosed with cancer was 5, and in ML-estimated mode we had a 0.05 chance of getting cancer.
This is obviously not practical, as we know from our experience that this probability will be much lower. However, ML estimates do not incorporate this knowledge into the model.
Four. MAP
According to the above deduction, we can find that the largest difference between map and ML is p (parameter), so it can be said that map is just the disadvantage of the lack of prior knowledge of ML, adding prior knowledge to optimize the loss function.
In fact, p (parameter) is just the regularization of the function. For example, if the P (parameter) is assumed to be a Gaussian distribution, it is equivalent to adding a L2 norm, and if the P (parameter) is assumed to obey the Laplace distribution, it is equivalent to adding a L1 norm.
Five. Bayesian
Again: ML and map will only give an optimal solution, but the Bayesian model will give a distribution of the parameters, such as the parameters of the model, assuming parameter space parameters 1, Parameter 2, parameter 3,... Parameter n, the Bayesian model learns the importance of these parameters (that is, the distribution), and then when we predict the new sample, we will let all the models together to predict, but each model will have its own weight (weight is the distribution of learning). The final decision is made by all estimates based on their weights.
The ensemble advantage of the model is that it can reduce variance, a similar investment, such as when we invest in a variety of different types of stocks, which is less risky than investing in a particular stock.
Six. What is the difference between the frequentist and the Bayesian mentioned above?
Use a simple example to summarize (because this part is the focus of today). For example, you are the class monitor, you have a question to know the answer, you can ask all the students in the class. One option is to ask one of the best students to learn. Another option is to ask all the students, and then synthesize the answers, but when combined, will be in accordance with the performance of each student to do a weight. The idea of the first scenario is similar to Ml,map, and the second scenario is similar to the Bayesian model.
Seven. The difficulty of Bayesian
So the core technique of the entire Bayesian field is to approximate the computational p (\theta| D), as we call Bayesian inference, the central problem here is to approximate this complex integral (integral), and a solution is to use the Monte Carlo algorithm. For example, I want to calculate the average height of all employees in a company, so the simplest way to do this is to let the executive go to the one and measure it and then calculate the average. But want to calculate the average height of all Chinese people, how to do? (Obviously one measurement is impossible)
That is sampling. We randomly selected some people to measure their reviewers, and then estimate the reviewer of the nation according to their height. Of course the more accurate the number of samples, the more representative the more accurate the sampled data. This is the butler idea of the Monte Carlo algorithm.
Another example:
Suppose we don't know π, but want to calculate the area of the circle. can also be approximated by sampling method. Randomly re-shown in the square to scatter some points, remember the number of points falling into the red area is N1, the number of falling into the white area is N2, then the area of One-fourth Circle is n1/(N1+N2).--Monte Carlo thought
So, how to estimate the continuous function? Sampling n multiple data to approximate the final integral value.
Assuming we want to calculate the expected value of f (x), we also have this distribution of P (x), and we can constantly do some sampling from the P (x) distribution, such as x1,x2,... xn, and then use the values of these samples to calculate f (x), so the final result is (f (x1) + f (x2) ,, + f (xn))/n
The samples mentioned in the example above are independent. That is, each sample is independent from the other samples and does not affect the sampling between the samples. However, there are times when we want to speed up the sampling of effective samples in real-world problems. This question discusses how to optimize the sampling process, is also a relatively large topic in machine learning.
Again, using the sampling method mentioned above, we can approximate the complex integrals, calculate the area of the circle, and calculate the average height of the national population. But this sampling method is independent, sometimes we want to use fewer samples to more accurately approximate a certain target, so there is a field of sampling research, is to study in what way to optimize the entire sampling process, making the process more efficient.
MCMC This sampling method, all known as the Markov chain Monte Carlo sampling method, is that each sample samples are interrelated.
However, the MCMC algorithm needs to be computed over the entire data set. This means that in order to get a sample, all the data needs to be iterated. This obviously does not apply when n is large. And the main reason restricting the development of Bayesian method is that the computational complexity is too high. So the question that people are most concerned about in the area now is: How to optimize the sampling so that it can learn the Bayesian model in the Big Data environment?
An example of reducing iteration complexity:
For logistic regression, when updating parameters using gradient descent method, there is a batch gradient descent method (that is, using the whole data set to update the parameters), in order to reduce the computational complexity, a random gradient descent method is used, that is, randomly selecting samples from the data set to update the parameters.
So, can this idea be used in MCMC sampling?
yes! Langevin Dynamic (one of the MCMC algorithms), and stochastic Optimizaiton (such as the random gradient descent method) can be combined together. In this way, we can sample by a small number of samples, this time the efficiency of the sampling is not dependent on N, but depends on M, M is far less than N.
(not to be continued)
Bayesian thought--Li Wenzhe Teacher's lecture notes