In the course of learning Andrew Ng's machine learning, he thought that he had the maximum likelihood estimate, the maximum posterior probability estimate, and the logistic regression, the Bayesian classification of the shutdown was very clear, but after the school pattern recognition course, my outlook on life completely overturned .... Let me give you a word.
First of all, the concept of maximum likelihood (MLE) and maximum posterior probability (MAP), the two concepts represent the two factions of academia, frequency and Bayesian, the maximum likelihood estimation (frequency faction) that the unknown parameter in the model is unknown, but it is a constant (constant variable), and the maximum posterior probability (Bayesian faction) that the unknown parameter in the model is a random variable (random variable), is also subject to a certain distribution, but this distribution is not recognized by the people, which we need to solve.
Let's start with the whole reasoning, first do some foreshadowing, given a set of training sets, {x (i), Y (i))},i=1,..., m \left \{x^{(i)}, y^{(i)} \right \}, I=1,..., m, first we assume that any two training samples are independent of each other, Then all the reasoning can be established, and then began to describe the MLE and map respectively,
For MLE, the unknown parameter in the model is a constant, as mentioned above, and thus the probability of Y (i) y^{(i)} occurring in the case of X (i) x^{(i)} can be used P (Y (i) |x (i), θ) p\left (y^{(i)}|x^{(i)}; \ Theta \right) to say that the meaning of ";" is because Θ\theta is a constant, which is what ng means by "parameterized Byθ\theta", and then the training samples are independent of each other, thus the probability of the entire training set occurring, That is, the likelihood probability l (θ) L (\theta) is the product of these p (Y (i) |x (i), θ) p\left (y^{(i)}|x^{(i)}; \theta \right), which is the following formula,
L (θ) =∏i=1mp (y (i) |x (i), θ) l (\theta) = \prod_{i=1}^{m}p\left (y^{(i)}|x^{(i)}; \theta \right)
And how to determine the Θ\theta, because the training set this thing has happened, so that L (θ) L (\theta) the largest time of the Θ\theta is the real θ\theta, thus the L (θ) L (\theta) to the Θ\theta derivative, to obtain the derivatives of The Θ\theta of 0 is the Θ\theta we need to solve.
Here's a look at the map, which says the concept is based on the model