MLE (maximum likelihood) vs. LS (least squares) and MAP (maximum posteriori)
Preface
Maximum likelihood estimation is a common basic idea in machine learning, and many specific algorithms and models are based on it, or can be based on it to find explanations, such as:
The MLE can explain why the commonly used linear regression uses the square (i.e. least squares) instead of the four-time square
The relation and difference between MLE thought and MAP thought; this is about the frequency school vs. Bayesian School of Probability and statistics, and the understanding of regularization in machine learning; (MAP and Bayesian estimation, naive Bayesian classifier and even Logistic Regressio n LR are relevant, these other articles are discussed again)
The MLE idea, one of the ten algorithms used in machine learning, is an EM algorithm (expected maximization, K-means actually uses em;em other articles to start the discussion)
This article will elaborate the maximum likelihood of the idea, and discuss LS, MAP and maximum likelihood of the association. 1. MLE Maximum likelihood estimation
MLE (Maximum likelihood estimation)
Here is the first key question: What is the difference between likelihood and probability? Basically, likelihood refers to a reverse process, a known result to reverse the model or hypothesis, the result itself is meaningless, the proportion of different results is meaningful; probability refers to a forward process, known as a specific model parameter, to deduce the possibility of the result, the result itself has a probabilistic meaning. 1.1 Problem definition (applicable scenario) given a set of samples (data), they are all sampled from the same distribution (identically) , and each sample is independent (i.e., independent event, independently) We don't know the specific distribution, but we think it belongs to a distribution family , so we just need to determine the specific parameters, that is, "the model is determined, the parameters are unknown"
At this time, the maximum likelihood estimation can be used to estimate the model parameters, that is, to find a set of parameters, so that the probability of the model output observation data is greatest. For example, if we determine a distributed Gaussian distribution, our goal is simply to confirm its mean and variance.
(three-point hypothesis with a very strong maximum likelihood estimate in the three parts of the above definition) 1.2 likelihood function
After defining the problem, we use the likelihood function (likelihood) to quantitatively represent the probability of the model's output observation data , which can be understood as quantitative identification of conditional probability P (x|θ) p (X|\theta), where Θ\theta is the model parameter we want to estimate, and X x Is the data that has been observed. The likelihood function is precisely defined as follows:
L (θ;x1,x2,..., xn) =f (x1,x2,..., xn|θ) =∏i=1nf (xi|θ) L (\theta;x_1,x_2,..., x_n) =f (x_1,x_2,..., X_n|\theta) =\prod_{i=1 }^n{f (X_i|\theta)}
We express the likelihood by the probability density function f F of the model, e.g. the probability density function of the Gaussian distribution is F (x|θ) =12π2−−−√exp (− (x−μ) 22σ2) F (x|\theta) =\displaystyle\frac{1}{\ SQRT{2\PI^2}} \exp (-\frac{(X-\MU) ^2}{2\sigma^2}) since we assume that sampling is independent, we can split the joint probabilities based on all sampling into the product practice of N-n independent probabilities, often using logarithmic likelihood function, This is more convenient on some simplification, and the maximization is equivalent; called LOG-LIKELIHOOD:LN (L) =∑ni=1f (xi|θ) ln (l) =\sum_{i=1}^nf (X_i|\theta) 1.3 Maximum likelihood estimation
After defining the problem and determining the objective function (likelihood function), what we want to do is to maximize the objective function, that is, to find a set of model parameters with the largest probability of the model output observation data Θ^mle \HAT{\THETA}_{MLE}: