Deep Learning-Optimizing notes

Last Update:2016-09-11 Source: Internet

Author: User

Tags svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Redot
Links: https://zhuanlan.zhihu.com/p/21360434
Source: Know
Copyright belongs to the author. Commercial reprint please contact the author for authorization, non-commercial reprint please specify the source.

Translator Note: This is the first smart unit, translated from Stanford cs231n Course Note Optimization Note, course teacher Andrej Karpathy authorized translator. This tutorial was completed by Redot translation, Kun kun and Li Yiying for proofreading and modification. The translation contains formulas and codes that are recommended for PC-side reading. The original is as follows

Table of Contents:

Brief introduction
Visualization of loss functions
Optimized

Strategy #: Random Search
Strategy #: Random Local Search
Strategy #: Follow the Gradient translator Note: The closing point of the previous article

Gradient calculation

Numerical calculations using finite difference values
Differential calculation gradient

Gradient Descent
Summary

Brief introduction

In the previous section, we covered two key parts of the image classification task:

A parameter-based scoring function. This function maps the original image pixels to a categorical score value (for example, a linear function).
loss function . The function can measure the quality of a specific parameter set according to the consistency of the classification score and the actual classification of the training set image data. The loss function has multiple versions and different implementations (for example: Softmax or SVM).

In the previous section, the linear function is in the form, while the SVM implementation formula is:

For image data, if the classification prediction based on the parameter set is consistent with the real situation, the calculated loss value is very low. Now the third and final key part: Optimizing Optimization. Optimization is the process of finding a parameter that minimizes the value of a loss function.

foreshadowing : Once you understand how these three parts work together, we will go back to the first part (parameter-based function mapping) and expand it to a function that is far more complex than a linear function: first, neural networks, then convolutional neural networks. The two parts of the loss function and optimization process will remain relatively stable.

Visualization of loss functions

The loss function discussed in this lesson is generally defined in a high dimensional space (for example, the weight matrix size of a linear classifier in CIFAR-10 is [10x3073], there are 30,730 parameters), so it is difficult to visualize it. However, there are some ways to slice the high-dimensional space in the direction of 1 dimensions or 2 dimensions to get some visual perception. For example, a weight matrix is randomly generated that corresponds to a point in a high-dimensional space. The change in the value of the loss function is then recorded along the direction of a dimension. In other words, a random direction is generated and the loss value is calculated in this direction, and the calculation method is calculated based on the different values. This process will generate a chart whose x-axis is the value, and the y-axis is the loss function value. The same method can be used in two dimensions, the change to calculate the loss value, thus giving the two-dimensional image. In the image, the x and Y axes can be represented separately, and the value of the loss function can be represented by a color change:

————————————————————————————————————————

An illustration of the loss function of a non-regularization multi-class SVM. There is only one sample data on the left and the middle, and 100 data in CIFAR-10 on the right. left : A change in the corresponding loss value of a value change in a dimension direction. neutralization Right : a slice of the loss value in two dimension directions, the blue part is the low loss value area, and the red part is the high loss value area. Notice the piecewise linear structure of the loss function. The loss value of multiple samples is the average of the population, so the bowl-shaped structure on the right is the average of many piecewise linear structures (such as the middle one).

—————————————————————————————————————————

We can use mathematical formulas to explain the piecewise linear structure of the loss function. For a single data, a loss function is calculated as follows:

With the formula visible, the data loss value for each sample is the sum of the linear function of the parameter (the 0 threshold is derived from the function). Each row (that is), sometimes it is preceded by a positive sign (such as when it corresponds to an error classification), and sometimes it is preceded by a minus (for example, when it is correctly categorized). For further clarification, suppose there is a simple data set that contains 3 points with only 1 dimensions, and data sets have 3 categories. Then the complete non-regularization SVM loss value is calculated as follows:

Because these examples are all one-dimensional, data and weights are numbers. Observation, you can see that some of the above equation is a linear function, and each item will be compared with 0, whichever is the maximum value. Can be plotted as follows: ——————————————————————————————————————

The presentation of data loss values from one dimension direction. The x-axis direction is a weight, and the y-axis is the loss value. Data loss is a combination of multiple parts. Each part is either a separate part of a weight, or a linear function of that weight compared to the 0 threshold value. The complete SVM data loss is the 30730-D version of this shape.

——————————————————————————————————————

One more thing to say is that you might guess that it is a convex function based on the bowl-like appearance of the loss function of the SVM. There are a lot of papers on how to minimize convex functions efficiently, and you can also learn about the courses at Stanford University on (convex function optimization). But once we extend the function to the neural network, the objective function is no longer a convex function, and the image will not be as bowl-shaped as above, but rather rugged and complex terrain shape.

non-conductive loss function. as a technical note, you should note that there are some non-conductive points (kinks) in the loss function due to max operations, which make the loss function non-existent, because the gradient is undefined at these non-conductive points. But the secondary gradient (subgradient) is still present and often used. In this lesson, we will exchange two terms using sub-gradients and gradients .

Optimized optimization

Again: The loss function can quantify the quality of a specific weight set W . The goal of optimization is to find a W that minimizes the value of the loss function. We are now moving towards this goal and implementing a method that optimizes the loss function. For some experienced classmates, this lesson looks a bit strange, because the example used (SVM loss function) is a convex function problem. But remember, the ultimate goal is not only to optimize the convex function, but to optimize a neural network, and for the neural network is not easy to use the convex function optimization techniques.

Strategy # #: A bad initial scenario: Random Search

Since it is quite simple to confirm that the parameter set W is good or bad, the first thought (bad) method is that you can randomly try a lot of different weights and see which one is the best. The process is as follows:

# Suppose each column of X_train is a sample of data (e.g. 3073 x 50000)# Suppose Y_train is a category label for a data sample (such as a one-dimensional array of length 50000)# assuming function L evaluates loss functionsBestloss=Float("INF")# Python assigns the highest possible float valueForNumInchXrange(1000):W=Np.Random.Randn(10,3073)*0.0001# Generate Random ParametersLoss=L(X_train,Y_train,W)# Get the loss over the entire training setIfLoss<Bestloss:# keep track of the best solutionBestloss=LossBestw=WPrint' in attempt%d The loss was %f%f %  (numlossbestloss) # output: # in attempt 0 The loss was 9.401632, Best 9.401632# in attempt 1 The loss is 8.959668, best 8.959668# in attempt 2 the loss Was 9.044034, best 8.959668# in attempt 3 the loss were 9.278948, best 8.959668# in Attem PT 4 The loss is 8.857370, best 8.857370# in attempt 5 The loss is 8.943151, best 8.857370# in attempt 6 The loss is 8.605604, best 8.605604# ... (Trunctated:continues for lines)

In the above code, we tried some randomly generated weights matrix W, some of which had smaller loss values and some lost values. We can take out the best weighted W found in this random search and then run the test set:

# assuming x_test size is [3073 x 10000], y_test size is [10000 x 1]scores = wbest. Dot (xte_cols) # x 10000, the class scores for all Test Examples# find the index with the highest scoring value in each column (that is, the predicted classification) yte_predict = np. Argmax (scoresaxis = 0 # and calculation accuracy np. Mean (yte_predict == yte) # return 0.1555

The best-performing weights on the validation set are the 15.5% of the W Run test set, and the exact probability of a full random guess is 10%, so it seems that this accuracy rate is good for a strategy that does not go through the brain.

core idea: Iterative optimization . Of course, we can certainly do better. The core idea is that although finding the optimal weight W is very difficult and even impossible (especially when the weight of the entire neural network is in w ), if the problem is converted to: a weight matrix set W to optimize, so that its loss value slightly reduced. Then the difficulty of the problem is greatly reduced. In other words, our approach starts with a random W and then iterates over it, making its loss value smaller each time.

Our strategy is to start with random weights and then iterate to get a lower loss value.

the metaphor of a blindfolded hiker : a metaphor for understanding is to think of yourself as a blindfolded hiker walking on mountainous terrain with the goal of walking slowly to the bottom of the mountain. In the case of CIFAR-10, the mountain is 30730-dimensional (because W is 3073x10). Each point we tread on the hill corresponds to a loss value, which can be regarded as the altitude of the point.

Strategy #: Random Local Search

The first strategy can be seen as trying a few random directions every step, and if a direction is going down the hill, take one step in that direction. This time we start with a random, and then generate a random disturbance, and only if the loss value becomes lower will we update. The specific code for this process is as follows:

W=Np.Random.Randn(10,3073)*0.001# Generate Random Initial WBestloss=Float("INF")ForIInchXrange(1000):Step_size=0.0001Wtry=W+Np.Random.Randn(10,3073) * step_size loss = l (xtr_colswtry) if loss < bestloss: w = wtry bestloss = loss print  "iter %d loss is %f %  (ibestloss)

Using the same data (1000), this method can get the classification accuracy rate of 21.4% . This is better than strategy, but it's still too wasteful of computing resources.

Strategy #: Follow the gradient

In the first two strategies, we are trying to find a direction in the weight space, which can reduce the loss value of the loss function in this direction. In fact, there is no need to randomly find the direction, because the best direction can be calculated directly, which is mathematically calculated in the steepest direction. This direction is the gradient of the loss function (gradient). In the parable of the blindfolded hiker, this method is like feeling the tilt of the mountain beneath us, and then descending toward the steepest descent.

In one-dimensional functions, the slope is the instantaneous rate of change in a function at a certain point. A gradient is a generalized expression of the slope of a function, which is not a value but a vector. In the input space, the gradient is a vector (or derivative derivatives) that consists of the slope of each dimension. The derivation formula for one-dimensional function is as follows:

When a function has more than one parameter, we call the derivative a partial derivative. The gradient is the vector formed by the partial derivative on each dimension.

Most optimized notes (top) finish.

Translator Feedback

reprint must be reproduced in full text and note the original link, otherwise reserved rights
Please refer to the comments and private messages, etc., the contributors will add.

Deep Learning-Optimizing notes

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More