MLE (maximum likelihood) vs. LS (least squares) and MAP (maximum posteriori)

Last Update:2018-07-26 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

MLE (maximum likelihood) vs. LS (least squares) and MAP (maximum posteriori) Preface

Maximum likelihood estimation is a common basic idea in machine learning, and many specific algorithms and models are based on it, or can be based on it to find explanations, such as:

The MLE can explain why the commonly used linear regression uses the square (i.e. least squares) instead of the four-time square

The relation and difference between MLE thought and MAP thought; this is about the frequency school vs. Bayesian School of Probability and statistics, and the understanding of regularization in machine learning; (MAP and Bayesian estimation, naive Bayesian classifier and even Logistic Regressio n LR are relevant, these other articles are discussed again)

The MLE idea, one of the ten algorithms used in machine learning, is an EM algorithm (expected maximization, K-means actually uses em;em other articles to start the discussion)

This article will elaborate the maximum likelihood of the idea, and discuss LS, MAP and maximum likelihood of the association. 1. MLE Maximum likelihood estimation

MLE (Maximum likelihood estimation)
Here is the first key question: What is the difference between likelihood and probability? Basically, likelihood refers to a reverse process, a known result to reverse the model or hypothesis, the result itself is meaningless, the proportion of different results is meaningful; probability refers to a forward process, known as a specific model parameter, to deduce the possibility of the result, the result itself has a probabilistic meaning. 1.1 Problem definition (applicable scenario) given a set of samples (data), they are all sampled from the same distribution (identically) , and each sample is independent (i.e., independent event, independently) We don't know the specific distribution, but we think it belongs to a distribution family , so we just need to determine the specific parameters, that is, "the model is determined, the parameters are unknown"

At this time, the maximum likelihood estimation can be used to estimate the model parameters, that is, to find a set of parameters, so that the probability of the model output observation data is greatest. For example, if we determine a distributed Gaussian distribution, our goal is simply to confirm its mean and variance.
(three-point hypothesis with a very strong maximum likelihood estimate in the three parts of the above definition) 1.2 likelihood function

After defining the problem, we use the likelihood function (likelihood) to quantitatively represent the probability of the model's output observation data , which can be understood as quantitative identification of conditional probability P (x|θ) p (X|\theta), where Θ\theta is the model parameter we want to estimate, and X x Is the data that has been observed. The likelihood function is precisely defined as follows:
L (θ;x1,x2,..., xn) =f (x1,x2,..., xn|θ) =∏i=1nf (xi|θ) L (\theta;x_1,x_2,..., x_n) =f (x_1,x_2,..., X_n|\theta) =\prod_{i=1 }^n{f (X_i|\theta)}

We express the likelihood by the probability density function f F of the model, e.g. the probability density function of the Gaussian distribution is F (x|θ) =12π2−−−√exp (− (x−μ) 22σ2) F (x|\theta) =\displaystyle\frac{1}{\ SQRT{2\PI^2}} \exp (-\frac{(X-\MU) ^2}{2\sigma^2}) since we assume that sampling is independent, we can split the joint probabilities based on all sampling into the product practice of N-n independent probabilities, often using logarithmic likelihood function, This is more convenient on some simplification, and the maximization is equivalent; called LOG-LIKELIHOOD:LN (L) =∑ni=1f (xi|θ) ln (l) =\sum_{i=1}^nf (X_i|\theta) 1.3 Maximum likelihood estimation

After defining the problem and determining the objective function (likelihood function), what we want to do is to maximize the objective function, that is, to find a set of model parameters with the largest probability of the model output observation data Θ^mle \HAT{\THETA}_{MLE}:

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

MLE (maximum likelihood) vs. LS (least squares) and MAP (maximum posteriori)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

MLE (maximum likelihood) vs. LS (least squares) and MAP (maximum posteriori)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support