Deep Learning: 1 (basic knowledge _ 1)

Source: Internet
Author: User

 

  Preface:

Recently, I plan to systematically learn some theoretical knowledge about deep learing and use Andrew Ng's webpage tutorial ufldl tutorial. It is said that this tutorial is easy to understand and is not too long. But before that, I will review the basic knowledge of machine learning, see the web page: http://openclassroom.stanford.edu/MainFolder/CoursePage.php? Course = deeplearning. The content is actually very short. Every section is just a few minutes and very good.

 

  Some terminologies in the Tutorial:

  Model Representation:

Actually, it refers to the expression of the learned function, which can be expressed in a matrix.

  Vectorized implementation:

Specifies the vector Implementation of the function expression.

  Feature Scaling:

A scale change is performed on each dimension of a feature. For example, the mean value is 0.

  Normal equations:

This refers to the matrix form of parameter solutions in multivariate linear regression. This equation is called normal equations.

  Optimization objective:

It refers to the target function to be optimized, for example, the formula derivation of the loss function expression in logistic. Or the objective function with regularity in multivariate linear regression.

  Gradient Descent, Newton's method:

They are all methods for finding the minimum value of the target function.

  Common variations:

It refers to the diversity of expression forms of Rule items.

 

  Notes:

Model expression is the function relationship between input and output. Of course, this function has a premise assumption that it can contain parameters. In this case, if there are many training samples, we can also provide the average correlation error function of the training sample. This function is also called the loss function ). Our goal is to find the parameters in the model expression, which is achieved by minimizing the loss function. The gradient descent method is usually used to minimize the loss function (that is, a set of values of parameters are randomly given, and then the parameters are updated so that the structure after each update can reduce the loss function, the minimum value can be reached ). In the gradient descent method, the target function can be regarded as a parameter function, because after the sample input and output values are given, the target function has only the parameter part left, in this case, we can regard the parameter as an independent variable, and the target function becomes the Parameter Function. Gradient Descent updates each parameter at a time, and each parameter is updated in the same form, that is, the learning rate and the partial derivative of the objective function to this parameter (if there is only one parameter, It is the derivative) are removed from the value of the previous parameter. Why? By taking the parameters at different points, we can see that this can make the original target function value lower, so it meets our requirements (that is, to find the minimum value of the function ). Even if the learning rate is fixed (but not too large), the gradient descent method can converge to a local minimum point, because the gradient value will become smaller and smaller, the product is smaller and smaller after it is multiplied by a fixed learning rate. In linear regression, we can use the gradient descent method to obtain the parameters in the regression equation. Sometimes this method is also called the batch gradient descent method. In this example, batch refers to all training samples used for parameter updates at each time.

Vectorized implementation refers to vector implementation. Due to actual problems, many variables are vectors. it is inconvenient to write every component. Therefore, we should try to write them as vectors. For example, the parameter update formula of the gradient descent method above can also be implemented in Vector Form. Vector formulas are simple and easy to use in Matlab programming. Since the gradient descent method converges to the extreme value according to the gradient direction, if the dimensions of the input sample are different (that is, the ranges are different), the contour lines of these parameters are different in different directions, this will lead to extremely slow convergence of extreme values of parameters. Therefore, the feature scaling item must be implemented before the gradient descent method is used to obtain parameters. Generally, the dimension in the sample is changed to the 0 mean, that is, the mean of the dimension is first reduced, divide the value by the range of the variable.

The next step is the influence of the learning rate on the gradient descent method. If the learning rate is too high, this iteration may be out-of-tune, and will spread on both sides of the extreme point. The value of the final loss function increases, not decreases. In the loss function value-the number of iterations curve, we can see that the curve increases upwards. Of course, when the learning rate is too high, this curve may continue to fluctuate. If the learning rate is too small, the curve decreases slowly, and even the curve value remains unchanged in many iterations. What value should I choose? This is generally selected based on experience, such as from... 0.0001, 0.001,. 0.01, 0.1, 1.0... Select these parameters to see which parameter makes the function curve between the loss value and the number of iterations the fastest.

Different Features and different models can be used for the same problem. For example, a single area feature can write growth and width features. For different models, for example, when using polynomial fitting models, you can specify the maximum number of X index items. When training samples are used for data testing, all training data is usually organized into a matrix, and each row of the matrix is a training sample, such a matrix is sometimes called "design matrix ". When a polynomial model parameter is obtained in the form of a matrix, the parameter W = inv (X '* X) * x' * Y is also called normal equations. although x' * X is a matrix, its inverse does not necessarily exist (when the inverse matrix of a matrix does not exist, this matrix is also called Sigular ). For example, if X is a single element 0, its reciprocal does not exist. This is a Sigular matrix. Of course, this example is too special. Another common example is that the number of parameters is more than the number of training samples, which is also a non-reversible matrix. At this time, you need to introduce regularization items or remove some feature items (typical is dimensionality reduction, remove those highly correlated features ). In addition, feature scale (which has a theoretical basis) is not required before solving the normal equations equation in linear regression ).

The functions mentioned above are generally regression. That is to say, the predicted values are continuous. If there are only two types of values to be predicted, either not or 0 or 1, this is the classification problem. In this case, we need a function to map the original predicted value to 0 to 1. Generally, this function is a logistic function or sigmoid function. Because such function values are continuous values, the logistic function is interpreted to output the probability that the value of Y is 1 under the given value of X.

The Convex Function actually refers to a function with only one extreme point, while the non-convex function may have multiple extreme points. In general, we want the form of the loss function to be convex. In the case of classification, first consider the sample sets whose training sample value is 1. At this time, my loss function requires us to have the minimum loss function value (0) when the predicted value is 1 ), when the predicted value is 0, the loss function has the largest value and is infinite. Therefore, in this case,-log (h (x) is generally used to meet the requirements. Similarly, when the training sample value is 0, the loss function is-log (1-H (x )). therefore, when these two types are integrated, they are-y * log (h (x)-(1-y) * log (1-H (x )), the result is the same as the above, but the expression is more compact. The loss function in this form is obtained through the maximum relief estimation (MLE. In this case, you can still use the Gradient Descent Method to Solve the optimal value of the parameter. When calculating the iteration formula of a parameter, the deviation of the loss function is also required. In a strange case, the structure of the partial function is similar to that of the multivariate linear regression, only one of the prediction functions is a common linear function, and the other is a combination of linear functions and sigmoid.

The gradient descent method is used to find the parameter value at the smallest part of the function value, while the Newton method is used to find the parameter value at the zero part of the function value. The purpose of the two is different, however, if the Newton method is used to evaluate the function value 0, if the function is a derivative of function, the Newton method is also used to calculate the minimum value of function a (of course, it may also be the maximum value), so the two methods are still of the same purpose. The parameter solving of the Newton method can also be expressed in the form of a vector. The expressions include the hession matrix and the unary function vector.

Next we will compare the gradient method with the Newton method. The first difference is that the learning rate needs to be selected in the gradient method, while the Newton method does not need to select any parameters. The second difference is that the gradient method requires a large number of iterations to find the minimum value, while the Newton method can be completed with only a small number of iterations. However, the cost of each iteration in the gradient method is small, and the complexity is O (n). The cost of each iteration of the Newton method is large, which is O (n ^ 3 ). Therefore, when the number of features N is relatively small, the Newton method is suitable. When the number of features N is large, the gradient method is recommended. The value here is calculated based on N equal to 1000.

If there are multiple input features in the system, and the system training sample is small, it will easily cause the over-fitting problem. In this case, either the dimensionality reduction method is used to reduce the number of features (or the method selected by the model) or the regularization method, generally, regularization is the most effective method in the case of many features, but requires that these features only play a small role in the final result prediction. Because rule items can act on parameters and make the final parameters very small, when all parameters are very small, these assumptions are simple assumptions, in this way, the over-fitting problem can be effectively solved. When regularization is performed on a parameter, there is a penalty coefficient first. This coefficient is called the regularization parameter. If the coefficient of this rule item is too large, it is possible that all the parameters of the system are close to 0, and all the parameters are not fit. In multivariate linear regression, rule items generally punish parameters 1 to n (of course, some can also add the parameter 0 to the penalty item, but it is not common ). With the increase of training samples, the role of these rule items is gradually decreasing, so the learned system parameter tendency is gradually increasing. Rule items also have many forms, some rule items do not contain the number of features, such as L2-norm regularization (or called 2-norm regularization). Of course, there is L1-norm regularization. Because there are many types of Rule items, this situation is also called the common variations of Rule items.

When the gradient descent method is used to solve linear regression problems with rule items, the parameter update formula is similar (where the formula for parameter 0 is the same, because there is no penalty parameter 0 in the Rule item), the difference is that the update in the update formula of other parameters does not use its own parameters to remove the Next string, instead, multiply (1-Alpha * Lamda/m) by its own parameter and then subtract other values. Of course, this number is equal to 1 in many cases, the gradient descent method without rule items is similar. Its normal equation is similar to the previous one. It is roughly inv (X '* x + Lamda * A) * x' * Y. If there is one more item, A is a diagonal matrix, except that the first element is 0, all other elements are 1 (in the case of general rules ). In this case, the preceding matrix is generally reversible, that is, it can be solved when the number of samples is smaller than the number of features. In the case of Logistic regression (in this case, the loss function contains logarithm items), if the gradient descent method is used, the parameter update equation is similar to that in linear regression, it is also multiplied by (1-Alpha * Lamda/m). The nomal equation also has a matrix, which solves the irreversible problem. In the process of solving the Newton method, the one-dimensional guide volume after the rule item changes, and the hession matrix should also be added to the Lamda/M * A matrix at the end, A is the same as the previous one.

In fact, there are many similarities between logistic regression and multi-charge linear regression. The biggest difference is that their dependent variables are different, and the other variables are similar, these two types of regression can be attributed to the same family, namely, the generalized linear model ). The model forms in this family are basically the same. The difference is that the dependent variables are different. If they are continuous, they are multiple linear regression. If they are two distributions, they are logistic regression, if it is a Poisson distribution, it is a Poisson regression, if it is a negative binary distribution, it is a negative binary regression, and so on. You only need to differentiate their dependent variables. The dependent variables of Logistic regression can be binary or multiclass classification, but binary classification is more common and easier to explain. Therefore, binary logistic regression is the most common practice.

 

  References:

Http://openclassroom.stanford.edu/MainFolder/CoursePage.php? Course = deeplearning

Http://deeplearning.stanford.edu/wiki/index.php/UFLDL_Tutorial

 

 

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.