Stanford University Machine Learning public Class (IV): Newton's method, exponential distribution family, generalized linear model

Last Update:2016-04-21 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

(i) Newton method maximum likelihood estimation

The Newton method (Newton ' Smethod), like the function of the gradient descent (Gradientdescent) method, is a method of searching the solution space. The basic ideas are as follows:for a function f (x), if we ask for a function value of 0 o'clock X,:
We first randomly select a point and then find the tangent of that point, the derivative, that extends it to the x-axis, with the value of the X at the intersection as the value of the next iteration. the update rules are:
so how to apply Newton method to machine learning problem solving? for machine learning problems, the objective function of our optimization is maximum likelihood estimate L, when the maximal likelihood estimation function obtains maximal, its derivative is 0, so it is consistent with the problem that the above function f takes 0. So it's going to be used in the upper- This is the case when the parameter θ is a real number, and when the argument is a vector, the update rule becomes as follows:

wherein, H is a n*n matrix, n is the length of the parameter vector, that is, the number of features, H is the function of the two derivative matrix, known as the Hessian matrix, one of its elements hij calculation formula is as follows: That is , it is a bit similar to dividing the first derivative by the two derivative; therefore, a vector representing the first derivative is multiplied by the inverse of the matrix representing the second-order derivative. The advantage of Newton's method relative to gradient descent is that the convergence speed is fast, and usually more than 10 iterations can converge. It is also referred to as two convergence, because each iteration can cause the error to be squared when the iteration is closer to the convergence value of the distance. The disadvantage is that when the parameter vectors are large, each iteration needs to calculate the inverse of the Hessian matrix once, which is time consuming. if the target function evaluates to the minimum value, then the update rule does not change. How can you tell if the resulting parameter is the maximum or minimum of the target function? Can be determined by judging the value of the second derivative, when the second derivative is less than 0 o'clock, which is the maximum value, and when the second derivative is greater than 0 o'clock, the minimum value. (ii) index distribution familiesexponential distribution family refers to the probability distribution that can be expressed as an exponential form. The exponential distribution is in the following form:
Wherein, the η is called the distribution of the natural parameters (nature parameter); t (y) is a sufficient statistic (sufficientstatistic), usually T (y) =y. When the parameters a, B, t are fixed, a function family with the η parameter is defined. in fact, most probability distributions can be expressed in the form of a top-up. For example:1) Bernoulli distribution: to model 0, 1 problems;2) Polynomial distribution: The modeling of events with K-discrete results;3) Poisson distribution: Modeling the counting process, such as the counting of website visits, the number of radioactive decay, the number of shoppers in stores, and so on;4) Gamma distribution and exponential distribution: Model the number of positive intervals, such as bus arrival time problem;5) Beta distribution: modeling of decimals;6) Dirichlet distribution: Model the probability distribution;7) Wishart Distribution: The distribution of covariance matrices;8) Gaussian distribution; now, we represent the form of the Gaussian distribution and the Bernoulli distribution as exponential distribution families. The Bernoulli distribution is a distribution that models 0, 1 problems, which can be expressed in the following form:
The form of its transformation is deduced as follows:
By means of the above formula, the Bernoulli distribution is expressed as a form of exponential distribution, wherein:
As you can see, theη form is consistent with the previously mentioned logistic function, because the logistic model estimates the probability of the problem as the Bernoulli distribution. The linear model can be deduced from the Gaussian distribution, and the variance of the Gaussian distribution is independent from the hypothesis function, so the variance is set to 1 for the sake of simple calculation. So the derivation of the Gaussian distribution into the exponential distribution family is as follows:
The above-indicated:
The key of derivation is to move the pure y item inside the exponent to the outside, the pure non-Y term as function A, and the mixed item as. (iii) generalized linear model
What is the use of defining an exponential distribution family? We can elicit generalized linear models (Generalizedlinear MODEL,GLM) by exponential distribution families. In the formula of the form of the Bernoulli distribution expressed as an exponential distribution family, therelationship between η and the parameter φ is a logistic function, and the logistic regression can be obtained by derivation (the derivation process is below), and in the formula of the form of the exponential distribution family, the Gaussian distribution is The relationship between η and the parameter μ that is being too distributed is equal, we can push the derivation of the least squares model (ordinaryleast squares). With these two examples, we can generally conclude thatη is associated with different mapping functions and parameters in other probability distribution functions to obtain different models, The generalized linear model formally expands all the members of the exponential distribution family (each member has exactly one such connection) as a linear model, and the linear function is mapped to other spaces through various nonlinear connection functions, which greatly expands the problem that the linear model can solve. the formal definition of GLM is shown below, and GLM has three assumptions:1) y|x;θ~expfamily (η); Given the sample x and the parameter θ, the sample classification y obeys a distribution in the exponential distribution family; 2) Given an X, the desired target function is 3)
Based on these three assumptions, we can deduce the ogistic model and the least squares model. The derivation process of the logistic model is as follows: In the above equation, the first row is the nature of the Bernoulli distribution, and the second line is introduced by hypothesis two and hypothesis three.
Similarly, for the least squares model, the derivation process is as follows:
wherein, the function of η and the parameters in the original probability distribution is called the regular response function (canonical responsefunction), such as the regular response function. The inverse of the regular response function is called the Regular association function (Canonicallink functions). So, for the generalized linear model, we need to decide what kind of distribution to choose, when we choose the Gaussian distribution, we get the least squares model, when we choose the Bernoulli distribution, we get the logistic model, the model described here is the form of the hypothetical function H. To sum up, the generalized linear model obtains different models by assuming a probability distribution, whereas the previously discussed gradient descent, the Newton method, is to obtain the parameter θ of the linear part of the model. (iv) GLM example--Polynomial distribution

The glm derived from the polynomial distribution can solve the multi-classification problem and is the extension of the logistic model. Application questions such as message categorization, predicting what disease patients are suffering from. the target value of the polynomial distribution is yε{1,2,3,..., k}, and its probability distribution is:
which, because, so we can just keep k-1 parameters, make:
in order for a polynomial distribution to be written in the form of an exponential distribution family, first define T (Y) as follows:

In this way, we can also introduce the indicator function I so that an element in the T (y) vector can also be represented as: For example, when y=2,. According to the above formula, we can also get: Thus, the derivation of two distributions into exponential distribution families is as follows:The probability distribution P (y;φ) is written in the form of φ 's exponential multiplication,the exponent of φ is the value of theindicator function, so it is not 0 or 1,φi represents the probability of being divided into Class I, so the φi is represented by the combination of the exponent: the result is either φi, or 1, is written in the form of a hyphen, and when one of the elements is 1 o'clock, there is no effect on the result of the final multiplication. Therefore, the polynomial probability distribution is written in this form, there is a reason for it. Above , the exponential logarithm transformation is performed first, then the product of the logarithm is added to the addition of the logarithm, and then the part after the ' + ' is incorporated into the ' + ' sign, and finally the addition and conversion to the form of vector multiplication, where T (y) represents a vector (k-1) * (k-1)
The components of the last step in the previous formula are as follows: by the η-expression:
to express convenience, define again: as a result, you can get:
substitute, get：
Thus, we get the connection function, and with the connection function, we can express the probability of the distribution of the polynomial, and we will put it into theto:
notice that each parameter η in the formula is represented by an available linear vector, so θ here is actually a two-dimensional matrix. Thus, you can get the assumption that the function h is as follows:
so how to calculate the parameter θ and, of course, the maximum likelihood function according to the assumption function H, the maximum likelihood function is as follows:
to take the logarithm of the upper, the following maximum likelihood function is obtained:
Then, you will
The logarithm of the maximum likelihood function can be obtained by substituting the upper formula, and then using the gradient descent algorithm or Newton method to obtain the parameters, the new sample can be predicted by using the assumption function H, and the multi-classification task is accomplished. The solution of this multi-classification problem is called Softmaxregression.

Stanford University Machine Learning public Class (IV): Newton's method, exponential distribution family, generalized linear model

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Stanford University Machine Learning public Class (IV): Newton's method, exponential distribution family, generalized linear model

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support