Deep leaning Study notes (i)

Source: Internet
Author: User

More specifically, read http://www.cnblogs.com/tornadomeet/archive/2013/03/14/2959138.html wrote the blog's reading book and his own understanding. The blue font is your own understanding.

The model expression is to give the function relation between input and output, of course, this function has the premise hypothesis, inside can contain the parameter. At this point, if there are many training samples, it is also possible to give a training sample of the average correlation error function, generally this function is also called the loss function (Loss functions). Our goal is to find out the parameters in the model expression, which is obtained by minimizing the loss function. The general minimization loss function is through the gradient descent method (that is, a set of values that give the parameters at random, and then update the parameters, so that the structure after each update can make the loss function smaller, and eventually achieve a minimum). In the gradient descent method, the objective function can actually be regarded as the function of the parameter, because after giving the sample input and output value, the objective function is left only the parameter part, then the parameter can be regarded as the independent variable, then the objective function becomes the function of the parameter. The gradient drops each time each parameter is updated, and each parameter is updated in the same form, that is, using the value of the previous parameter minus the learning rate and the target function of the partial derivative of the parameter (if only 1 parameters, that is, the derivative), why do this? By taking the parameters at different points, it can be seen that doing so can make the original objective function values lower, thus meeting our requirements (that is, the minimum value of the function). Even when the learning rate is fixed (but not too large), the gradient descent method can be convergent to a local minimum point, because the gradient value will become smaller, it and the fixed learning rate multiplied by the product will be smaller. In the linear regression problem, we can use the gradient descent method to find the parameters in the regression equation. Sometimes the method is also called the batch gradient descent method, where the bulk refers to every time the parameter update is used for all training samples.

      vectorized implementation refers to the vector implementation, because many variables in the actual problem are vectors, all if you want to write each component will be very inconvenient, should be written as a vector form ( is to put the original one variable a formula into a vector a formula, the variable is replaced by vector ). For example, the above gradient descent method of parameter update formula can also be implemented in vector form. Vector form of formula is simple, and easy to use MATLAB programming. Since the gradient descent method converges to the extremum according to the gradient direction, if the dimensions of the input sample are different (that is, the range is different), the contours of these parameters are different in the direction of the different directions, which leads to the extreme convergence rate of the parameters is very slow. ( I don't know why it causes the extremum convergence of the parameter to be slow, but you want to change the variable to a vector at this time, so all the variables in the vector correspond to the same learning rate, so if you enter a sample with a different scale, Some average of more than hundreds of, and some only fraction, the same learning parameters for different learning speed of variable rate is greatly dissimilar, so only to the scale of their each dimension changes ) Therefore, before the gradient descent method to find the parameters, we need to first feature scaling this one , it is common to turn each dimension in the sample into a 0 mean, that is, subtract the mean of the dimension first, and then divide by the range of the variable ().

      The next step is the effect of the learning rate on the gradient descent method. If the learning rate is too large, this can occur every iteration of the phenomenon of overshoot, will be on both sides of the extreme point of divergence ( Why do you keep diverging, after learning, parameters should be function of the local concave place to move ah, if we learn the rate is relatively large, then certainly will cross the local lowest point, but beyond the future will not fall back to the lowest point, why will be more and more large, this is not very clear ), the final loss function value is more and more large, Rather than getting smaller. In the loss function value-the graph of the number of iterations ( is the figure for each iteration of the resulting cost function value), as you can see, the curve is incremented upward. Of course, when the learning rate is too large, there may also be a situation where the curve is constantly vibrating. ( This is easy to understand, your learning rate is too big, you may appear in the iteration of the bulge, the next time in the recess, because the learning rate is too large, can not be a little smooth slide into the concave place caused by the ) if this Learning rate is too small,

      The same problem can be selected with different features and different models, features, such as a single area feature can be written to grow and width of 2 characteristics. For different models, such as when using a polynomial fit model, you can specify the maximum number of indices for X. When testing data with training samples, all the training data is generally organized into a matrix, each line of the matrix is a training sample (if you want to use the formula below, each row of the X matrix is a training sample, Instead of columns) , such matrices are sometimes called "design matrix." When solving the parameters of a polynomial model in the form of matrices, the parameter W=inv (X ' *x) *x ' *y ( this is in the form of Matlab ), this equation is also called normal Equations. () Although the X ' *x is a phalanx, its inverse does not necessarily exist (when the inverse matrix of a matrix does not exist, The Phalanx is also called sigular). For example, when x is a single element 0 o'clock, its countdown does not exist, this is a sigular matrix, of course, this example is too special. Another common example is that the number of parameters is longer than the number of training samples and is a non-invertible matrix. At this point, you need to introduce regularization items, or remove some feature items (typically dimensionality reduction, remove those strong correlation characteristics). In addition, (This is a theoretical basis) (don't know why) .

  The above functions are generally regressive, that is, the predicted value is continuous, if we need to predict the value of only 2, either is either not, that is, the predicted value is either 0 or 1, then the classification problem. So we need to have a function that maps the original predictions to between 0 and 1, usually this function is a logistic function, or it's called sigmoid function. Because this function value is still a continuous value, the interpretation of the logistic function is the probability of outputting a Y value of 1 under a given X value. ( Here I want to talk about the difference between linear regression and logistic regression, the goal of linear regression is to fit the value of the regression equation and the original Yi of each sample as close as possible, and this regression equation is the linear superposition of each dimension of the sample x, because it is Linear so called linear regression, just the least squares, the minimization of the squared error is the linear regression model. Then we get the parameters of the regression equation, we have a new sample x, we can predict what the x corresponds to the Y value according to the regression equation. At this time, some people come up with different ideas, because sometimes we want more than the x predicted y is how much, we want to be you according to this x predicted y exactly how much probability, you can not just black and White said that the predicted value is how much, we have to see how big your grasp. If our predictive value Y is only predicted from the middle of two values, then the model established according to the preceding requirements is the logistic regression model. So how to build this model, the model must have parameters, then how do these parametric models get? First say this model, our previous linear regression equation must not, because you are the sample x each dimension of the linear superposition, how to add bar will also exceed [0,1] interval, this is certainly not the probability, then someone think of this way, assuming that Y's two values are 0 and 1, then in the case of known X , the probabilities of Y=1 and 0 are P (y=1|x) =exp (w*x+b)/(1+exp (W*X+B)) and P (y=0|x) =1/(1+exp (w*x+b)) respectively, and you see that the value of the model must not escape [0,1]. and the probability of a prediction of 1 and 0 is added to 1. It just fits the probability. And through the model we can see that our sample X each dimension of the linear superposition of the value closer to the positive infinity, then we predict this sample x corresponds to the probability of the y=1 of the closer to the 1,y=0 probability of the more close to 0, negative infinity. Then the model parameter estimates, because here is no longer like the previous linear regression model, our model is the predicted Y value, we only need to minimize the real value and the predicted value of the error can be trained to model parameters, here Our model value is the probability, this time our cost   function is the likelihood function of probability theory. In order to minimize this cost function, then we certainly hope that this cost function is a convex function, because only the convex function, only a minimum value, this time our iterative algorithm will be very easy to get that the most value point corresponding parameter. The following is why we used the exp () formula form in the logistic model just now.

The convex function actually refers to a function with only one extremum point, while Non-convex may have multiple extremum points. In general we all hope that the form of the loss function is convex. In the case of classification problems, first consider the sample set of 1 in the training sample, when my loss function requires us when the predicted value is 1 o'clock, the loss function value is the smallest (0), when the predicted value is 0 o'clock, the value of the loss function is the largest, Infinity , (So logistic regression model thought is different from the linear regression model, the linear regression model thought is the minimization of the real value and the model prediction value error, and logistic regression model thinking is more ruthless, predictive value prediction of the loss function is 0, the wrong loss is infinity, my personal understanding) So in this case the loss function is usually-log (h (x)), just to meet the requirements. Similarly, when the training sample value is 0 o'clock, the general loss function is-log (1-h (x)). So the two are integrated together as-y*log (H (x))-(1-y) *log (1-h (x)), the result is the same as above, but the expression is more compact, The loss function chosen in this form is obtained by the maximum relief estimate (MLE). In this case, the gradient descent method can still be used to solve the optimal value of the parameter. In the calculation of the parameters of the iterative formula, also need to find the derivative of the loss function, very strange, when the partial derivative function and multivariate linear regression of the structure of the biased function similar ( do not know ), just one of the prediction function is a normal linear function, One is a compound function of a linear function and a sigmoid.

The gradient descent method is used to find the minimum value of the function value, and the Newton method is used to find the value of the value of the parameter 0, the purpose of the first view is different, but then carefully observe the Newton method is to find the function value of 0 o'clock, if the function is a derivative of a function A, Newton's method is also considered to be the minimum value of function A (and of course, it may be the maximum value), so the two methods are similar in nature. The parametric solution of Newton's method can also be expressed in the form of vectors, in which there are hession matrices and one-element derivative vectors.

The following comparison of the gradient method and Newton method, the first difference is that the gradient method needs to choose the learning rate, and Newton method does not need to select any parameters. The second difference is that the gradient method requires a large number of iterations to find the minimum value, and Newton's method requires only a small number of times to complete. But the cost of each iteration in the gradient method is small, its complexity is O (n), and the cost of each iteration of Newton's method is greater, O (n^3). Therefore, when the number of features n hours is suitable for the selection of Newton method, when the number of features n is relatively large, it is preferable to choose the gradient method. The size here is bounded by n equals 1000. ( Newton's method is not very understanding, but later )

If there are multiple input characteristics of the system, and the system training samples are relatively small, it is very easy to cause over-fitting ( The parameters you have trained because you have too few training samples, not universal, or because the sample is small, and the results of the statistics after some of the sample itself does not have the nature, This will cause some parameters to be too large, resulting in an over-fitting problem. In this case, either reduce the number of features by the Dimensionality reduction method (also through the model selection method), or through the regularization method, usually through the regularization method in the case of a lot of characteristics is the most effective, However, these characteristics are required to predict only a small part of the final result. Because the rule item can function on the parameter, so that the final parameter is very small, when all the parameters are very small, these assumptions are simple assumptions , which can solve the over-fitting problem well. In general, when the parameters are regularization, there is a penalty coefficient, this coefficient is called regularization parameter, if the coefficient is too large, it is possible that all the parameters of the system will eventually be close to 0, all of the phenomenon of lack of fitting. In multivariate linear regression, rule items generally punish parameters 1 to n (or, of course, parameter 0 can be added to a penalty, but not common). As the training sample increases, the effect of these rule items decreases gradually, so the parameters of the learning system tend to increase slowly. Rule items also have many forms, and some rule items do not contain the number of features, such as L2-norm regularization (or 2-norm regularization). Of course, there are l1-norm regularization. Because there are many types of rule items, this is also known as the common variations of rule items.

In linear regression problem solving with regular terms, if the gradient descent method is used, the parameter update formula is similar (where the formula for parameter 0 is the same, because there is no penalty parameter 0 in the rule item), the difference is that the other The update of the parameter in the formula update is not to use its own parameters to lose the back of the string, but with its own parameters multiplied (1-alpha*lamda/m) and then minus the other , (α is learning fast record, Lamuda is a regular parameter) of course, this number in many cases and 1 is equal, It is similar to the gradient descent method of the non-rule item in front of it. Its normal equation is similar to the previous one, roughly for the Inv (X ' *x+lamda*a) *x ' *y (the final calculation formula for linear regression with regular term problems ), one more, where a is a diagonal matrix, in addition to the first element is 0, All other elements are 1 (in the case of a general rule). In this case, the previous matrix is generally reversible, that is, when the number of samples is less than the number of features is solvable. In the case of logistic regression (where the loss function has a logarithmic term), if the gradient descent method is used, then the parameter's update equation is also similar to the linear regression, and is multiplied by (1-alpha*lamda/m), Nomal equation is also a matrix, This is the same as solving the irreversible problem. In the process of solving the Newton method, the one-element direction after adding the rule term is changed, and the Hession matrix is added to the lamda/m*a matrix at the end, where a is the same as the previous one. (That is, whether it is linear or logistic regression, the cost function is a convex function, there is a ready-made model parameter calculation formula)

Logistic regression and multi-charge linear regression actually have many similarities, the biggest difference is that their dependent variable, the other basic is similar, because of this, the two regression can be attributed to the same family, the generalized linear model (generalized linear models). The model of the family is basically the same, the difference is the dependent variable, if it is continuous, is multiple linear regression, if it is a two-item distribution, is the logistic regression, if it is Poisson distribution, is Poisson regression, if the negative two distribution, is negative two regression, and so on. Just be careful to differentiate their dependent variables. The dependent variables of logistic regression can be two or more classified, but the two classification is more common and easier to explain. So the most common in practice is the logistic regression of the two classification.

Deep leaning Study notes (i)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.