Deep Learning: One (basic knowledge _1)

Source: Internet
Author: User

Preface:

Recently, I intend to learn some theoretical knowledge of deep learing in a slightly systematic way, and intend to use Andrew Ng's Web tutorial Ufldl Tutorial, which is said to be easy to read and not too long. But before this, or review the basic knowledge of machine learning, see Web page: http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=DeepLearning. The content is actually very short, a few minutes per bar, and very good speaking.

  some of the terminology in the tutorial:

  Model representation:

In fact, refers to the learning function of the expression form, can be expressed in matrix.

  vectorized Implementation:

Specifies a vector implementation of a function expression.

  Feature Scaling:

Refers to the characteristics of each dimension of a scale change, for example, the average value of 0 and so on.

  Normal equations:

This refers to the matrix form of the parametric solution in multivariate linear regression, which is called normal equations.

  Optimization Objective:

Refers to the objective function that needs to be optimized, such as the formula derivation of loss function expression in logistic. Or a regular objective function in multivariate linear regression.

  Gradient descent, Newton ' s Method:

Is the method of finding the minimum value of the objective function.

  Common Variations:

Refers to the diversity of expression forms of rules.

  Some notes:

The model expression is to give the function relation between input and output, of course, this function has the premise hypothesis, inside can contain the parameter. At this point, if there are many training samples, it is also possible to give a training sample of the average correlation error function, generally this function is also called the loss function (Loss functions). Our goal is to find out the parameters in the model expression, which is obtained by minimizing the loss function. The general minimization loss function is through the gradient descent method (that is, a set of values that give the parameters at random, and then update the parameters, so that the structure after each update can make the loss function smaller, and eventually achieve a minimum). In the gradient descent method, the objective function can actually be regarded as the function of the parameter, because after giving the sample input and output value, the objective function is left only the parameter part, then the parameter can be regarded as the independent variable, then the objective function becomes the function of the parameter. The gradient drops each time each parameter is updated, and each parameter is updated in the same form, that is, using the value of the previous parameter minus the learning rate and the target function of the partial derivative of the parameter (if only 1 parameters, that is, the derivative), why do this? By taking the parameters at different points, it can be seen that doing so can make the original objective function values lower, thus meeting our requirements (that is, the minimum value of the function). Even when the learning rate is fixed (but not too large), the gradient descent method can be convergent to a local minimum point, because the gradient value will become smaller, it and the fixed learning rate multiplied by the product will be smaller. In the linear regression problem, we can use the gradient descent method to find the parameters in the regression equation. Sometimes the method is also called the batch gradient descent method, where the bulk refers to every time the parameter update is used for all training samples.

Vectorized implementation refers to the vector implementation, because many variables in the actual problem are vectors, so if you want to write each component will be very inconvenient, should try to write the form of vectors. For example, the above gradient descent method of parameter update formula can also be implemented in vector form. Vector form of formula is simple, and easy to use MATLAB programming. Since the gradient descent method converges to the extremum according to the gradient direction, if the dimensions of the input sample are different (that is, the range is different), the contours of these parameters are different in the direction of the different directions, which leads to the extreme convergence rate of the parameters is very slow. Therefore, before the gradient descent method to find the parameters, it is necessary to first feature scaling this, is generally the sample of the various dimensions into 0 mean, that is, the first minus the mean of the dimension, and then divided by the range of the variable.

Next is the effect of the learning rate on the gradient descent method. If the learning rate is too large, then each iteration is likely to appear overshoot, will be on both sides of the extreme point of divergence, the final loss function value is greater, rather than less and less. In the loss function value-the number of iterations of the graph, you can see that the curve is ascending. Of course, when the learning rate is too large, there may also be a situation where the curve is constantly vibrating. If the learning rate is too small, this curve drops very slowly, even at many iterations where the curve value remains the same. So what's the value of the choice? This is generally based on experience, such as from ... 0.0001,0.001,.0.01,0.1,1.0 ... These parameters are selected to see that the function curve between the loss value and the iteration count is the fastest.

The same problem can be selected with different characteristics and different models, features, such as a single area of the feature can be written to grow and the width of 2 characters. For different models, such as when using a polynomial fit model, you can specify the maximum number of indices for X. When testing data with a training sample, all the training data is generally organized into a matrix, and each line of the matrix is a training sample, which is sometimes called the "design Matrix". When solving the parameters of a polynomial model in the form of a matrix, the parameter W=inv (X ' *x) *x ' *y, the equation is also called normal equations. Although the X ' *x is a phalanx, its inverse does not necessarily exist (when the inverse of a square matrix does not exist, the square is also called sigular). For example, when x is a single element 0 o'clock, its countdown does not exist, this is a sigular matrix, of course, this example is too special. Another common example is that the number of parameters is longer than the number of training samples and is a non-invertible matrix. At this point, you need to introduce regularization items, or remove some feature items (typically dimensionality reduction, remove those strong correlation characteristics). In addition, before solving the normal equations equation in linear regression, it is not necessary to feature scale for the characteristics of the input sample (this is a theoretical basis).

The above functions are generally regressive, that is, the predicted value is continuous, if we need to predict the value of only 2, either is either not, that is, the predicted value is either 0 or 1, then the classification problem. So we need to have a function that maps the original predictions to between 0 and 1, usually this function is a logistic function, or it's called sigmoid function. Because this function value is still a continuous value, the interpretation of the logistic function is the probability of outputting a Y value of 1 under a given X value.

The convex function actually refers to a function with only one extremum point, while Non-convex may have multiple extremum points. In general we all hope that the form of the loss function is convex. In the case of classification problems, first consider the sample set of 1 in the training sample, this time my loss function requires us when the predicted value is 1 o'clock, the loss function value is the smallest (0), when the predicted value is 0 o'clock, the value of the loss function is the largest, infinity, so this case is generally used-log (h (x)), Just meet the requirements. Similarly, when the training sample value is 0 o'clock, the general loss function is-log (1-h (x)). So the two are integrated together as-y*log (H (x))-(1-y) *log (1-h (x)), the result is the same as above, but the expression is more compact, The loss function chosen in this form is obtained by the maximum relief estimate (MLE). In this case, the gradient descent method can still be used to solve the optimal value of the parameter. In the calculation of the parameters of the iterative formula, it is also necessary to find the derivative of the loss function, very strange, when the partial derivative function and multivariate linear regression of the structure of the derivative function similar, just one of the prediction function is a normal linear function, a linear function and sigmoid compound function.

The gradient descent method is used to find the minimum value of the function value, and the Newton method is used to find the value of the value of the parameter 0, the purpose of the first view is different, but then carefully observe the Newton method is to find the function value of 0 o'clock, if the function is a derivative of a function A, Newton's method is also considered to be the minimum value of function A (and of course, it may be the maximum value), so the two methods are similar in nature. The parametric solution of Newton's method can also be expressed in the form of vectors, in which there are hession matrices and one-element derivative vectors.

The following comparison of the gradient method and Newton method, the first difference is that the gradient method needs to choose the learning rate, and Newton method does not need to select any parameters. The second difference is that the gradient method requires a large number of iterations to find the minimum value, and Newton's method requires only a small number of times to complete. But the cost of each iteration in the gradient method is small, its complexity is O (n), and the cost of each iteration of Newton's method is greater, O (n^3). Therefore, when the number of features n hours is suitable for the selection of Newton method, when the number of features n is relatively large, it is preferable to choose the gradient method. The size here is bounded by n equals 1000.

If there are multiple input features of the system, and the training samples of the system are relatively young, it is easy to cause over-fitting problems. In this case, either reduce the number of features by the Dimensionality reduction method (also through the model selection method), or through the regularization method, usually through the regularization method in the case of a lot of characteristics is the most effective, However, these characteristics are required to predict only a small part of the final result. Because the rule item can function on the parameter, so that the final parameter is very small, when all the parameters are very small, these assumptions are simple assumptions, which can solve the over-fitting problem well. In general, when the parameters are regularization, there is a penalty coefficient, this coefficient is called regularization parameter, if the coefficient is too large, it is possible that all the parameters of the system will eventually be close to 0, all of the phenomenon of lack of fitting. In multivariate linear regression, rule items generally punish parameters 1 to n (or, of course, parameter 0 can be added to a penalty, but not common). As the training sample increases, the effect of these rule items decreases gradually, so the parameters of the learning system tend to increase slowly. Rule items also have many forms, and some rule items do not contain the number of features, such as L2-norm regularization (or 2-norm regularization). Of course, there are l1-norm regularization. Because there are many types of rule items, this is also known as the common variations of rule items.

In the linear regression problem solving with regular terms, if the gradient descent method is used, the update formula of the parameter is similar (where the formula for parameter 0 is the same, because there is no penalty parameter 0 in the rule item), the difference is that the update in the other parameter's update formula does not use its own parameters to lose the back string, Instead of using its own parameters multiplied by (1-alpha*lamda/m) and then minus the others, of course, the number in many cases and 1 is equal, but also very much in front of the rule-less gradient descent method is similar. Its normal equation is similar in front, roughly for inv (X ' *x+lamda*a) *x ' *y, one more, where a is a diagonal matrix, except the first element is 0, the other elements are 1 (under the general rule). In this case, the previous matrix is generally reversible, that is, when the number of samples is less than the number of features is solvable. In the case of logistic regression (where the loss function has a logarithmic term), if the gradient descent method is used, then the parameter's update equation is also similar to the linear regression, and is multiplied by (1-alpha*lamda/m), Nomal equation is also a matrix, This is the same as solving the irreversible problem. In the process of solving the Newton method, the one-element direction after adding the rule term is changed, and the Hession matrix is added to the lamda/m*a matrix at the end, where a is the same as the previous one.

Logistic regression and multi-charge linear regression actually have many similarities, the biggest difference is that their dependent variable, the other basic is similar, because of this, the two regression can be attributed to the same family, the generalized linear model (generalized linear models). The model of the family is basically the same, the difference is the dependent variable, if it is continuous, is multiple linear regression, if it is a two-item distribution, is the logistic regression, if it is Poisson distribution, is Poisson regression, if the negative two distribution, is negative two regression, and so on. Just be careful to differentiate their dependent variables. The dependent variables of logistic regression can be two or more classified, but the two classification is more common and easier to explain. So the most common in practice is the logistic regression of the two classification.  

  References:

Http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=DeepLearning

Http://deeplearning.stanford.edu/wiki/index.php/UFLDL_Tutorial


Tornadomeet Source: Http://www.cnblogs.com/tornadomeet Welcome to reprint or share, but be sure to declare the source of the article. (Sina Weibo: Tornadomeet, Welcome to Exchange!) )

Deep Learning: One (basic knowledge _1)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.