Optimization:stochastic Gradient Descent

Last Update:2016-05-06 Source: Internet

Author: User

Tags svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Original address: http://cs231n.github.io/optimization-1/

####################################################################### #3

Table of Contents:

1. Introduced

2. Visual loss function

3. Optimized

3.1. Strategy 1: Random Search

3.2. Strategy 2: Random Local Search

3.3. Strategy 3: Follow the gradient

4. Calculate gradient

4.1. Finite difference (numerically with finite differences)

4.2. Calculus calculation (analytically with calculus)

5. Gradient Descent

6. Summarize

############################################################################

Introduction

In the previous section, we covered two key parts of the image classification task:

1.　A (parameterized) score function (scorefunctions): Used to map raw pixel values to category scores (e.g. Linear function)

2. A loss function (lossfunctions): Used to measure the degree of satisfaction of a given set of parameters in terms of the calculated achievement and the consistency of the true category result. We see a lot of ways (e.g. SOFTMAX/SVM)

Specifically, the format of the recall linear function is, while the SVM loss function format is as follows:

If the parameter w is set reasonably, the predicted results for the sample will be consistent with the real result, and there will also be a very low loss value. Now we're going to introduce the third and final key part: optimization (optimization). Optimization is the process of discovering this set of parameter w that minimizes the value of the loss function.

Foreshadowing (preview): Once we understand how these three core components interact, we'll revisit the first component (parametric mapping function) and then extend to a more complex form than a linear function: first the neural network, then the convolutional neural network. The loss function and the optimization process are relatively unchanged.

Visualizing the loss function

The loss functions we see in this section are usually defined in high-dimensional spaces (e.g. in CIFAR-10, the weight matrix of a linear classifier is the size), which is difficult to visualize. However, there are some ways to get a visual picture, such as truncating a high-dimensional space along a ray (1-D), or truncating a high-dimensional space along a plane (2-D). For example, we can generate a random weight vector (relative to a point in the space) and then record the loss function values along a ray of the way. That is, we can generate a random direction matrix, and then calculate the loss function value in this direction, the formula is, the value is the x-axis, the loss function value is the y-axis, a point graph is generated. We can also calculate the loss function value on two dimensions in the same way, and the formula is the variable parameter. In one diagram, corresponding to the x-axis, the y-axis, and then the loss function values can be visualized by color:

We can explain the loss function of piecewise linear structure by examining the mathematical equation. The formula is as follows:

As can be seen from the formula, the data loss of each sample is the summation of the linear function of the parameter (Zero-threshold due to themax (0,-)functions). In addition, each line of the parameter is sometimes positive (if the line corresponds to the wrong category), sometimes it is negative (if the line corresponds to the correct category). To make this process clearer, consider a simple data set with three 1-D points and 3 categories. The complete SVM loss (no rule) is as follows:

Because these samples are all one-dimensional, both the data and the weights are numbers. For example, looking at, some terms above was linear function of and each was clamped at zero. The visualization results are as follows:

In other words, you may have seen from the shape of its bowl that the SVM loss function is a special case of the convex function (convex functions). There is a lot of literature devoted to minimizing the types of these functions, and you can also take part in Stanford's course on the subject (convex optimization). Once we extend the performance function to a neural network, our loss function will no longer be a convex function, and its visual image is not a bowl but a complex, bumpy shape.

non-differentiable loss functions. As a technical note, kinks in the loss function (due to the max operation) makes the loss function non-micro because there are no gradients on these kinks. However, Thesubgradient still exists and is often used. In this section, we assume that the terms subgradient and gradient can be replaced with each other.

Optimization

Again, the loss function quantifies the quality of a specific set of weights. The goal of optimization is to discover the set of weights that minimize the loss function value. We will now motivate and slowly develop a method to optimize the loss function. This section may seem strange to someone who has previously studied, because the work sample we will be using (the SVM loss) is a convex problem, but please note that our goal is ultimately to optimize the neural network, and that the neural network is not easily optimized using any of the tools developed by the convex optimization theory.

Strategy #1: A first very bad idea solution:random search

The first way to think about it is to try to use a variety of weights and then compare the best group. The process is as follows:

# Assume X_train is the data where each column is a example (e.g. 3073 X 50,000) # Assume Y_train is the labels (e.g. 1D Array of 50,000) # Assume the function L evaluates the loss Functionbestloss = float ("inf") # Python assigns the highest PO ssible float valuefor num in xrange (+): W = Np.random.randn (Ten, 3073) * 0.0001 # Generate random parameters loss = L (     X_train, Y_train, W) # Get the loss over the entire training set if loss < bestloss: # keep track of the best solution Bestloss = Loss BESTW = W print ' in attempt%d the loss is%f, best%f '% (num, loss, Bestloss) # prints:# in Atte MPT 0 The loss is 9.401632, best 9.401632# in attempt 1 The loss is 8.959668, best 8.959668# in attempt 2 the loss is 9 .044034, best 8.959668# in attempt 3 The loss is 9.278948, best 8.959668# in attempt 4 The loss is 8.857370, best 8.8573 70# in attempt 5 The loss is 8.943151, best 8.857370# in attempt 6 The loss is 8.605604, best 8.605604# ... (Trunctated:continues forS

In the above code, we tried the different weights vectors in 1000, some of which were better than the other weights. We find the best group of results from these weights and then use this set of weights for testing in the test set:

# assume X_test is [3073 x 10000], y_test [10000 x 1]scores = Wbest.dot (xte_cols) # Ten X 10000, the class scores for all T  EST examples# find the index with Max score in each column (the predicted class) Yte_predict = Np.argmax (scores, axis = 0) # and calculate accuracy (fraction of predictions that is correct) np.mean (yte_predict = yte) # returns 0.1555

It is learned from the code that the best weights are measured in terms of accuracy. Given that guessing classes completely at random achieves only 10%, that's not a very bad outcome for a such a brain-dead Random Search solution!

Core idea:iterative Refinement (iterative refinement). It turns out that we can get better results. The core idea of the above operation is as follows: Finding the best set of weights is a difficult or impossible task (especially when it contains the weight of the entire complex neural network), but it is obviously less difficult to find a set of weights that can be better than the current weights. In other words, our approach is to start with a set of random weights, and then iterate the refinement so that each time can be better than the previous one.

Blindfolded hiker analogy. There is a good analogy to imagine that you are a traveler on a hilly terrain, but you are blindfolded and this is what you want to reach the bottom. In CIFAR-10, because the dimension is 30730x10, the mountain is 30,730 dimensions high. In every dimension of the mountain, you can reap a specific loss (the height of the terrain).

Strategy #2: Random Local Search

The first strategy you can think of is to try to take a step in any direction, only in this direction to move on. The second strategy is as follows: We will start with a random, generate a random change in value, if the weight of the calculation of the loss is small, then we will have a weight update. The code is as follows:

W = NP.RANDOM.RANDN (3073) * 0.001 # generate random starting Wbestloss = float ("inf") for I in Xrange (£):  step_s ize = 0.0001  wtry = W + Np.random.randn (Ten, 3073) * step_size  loss = L (Xtr_cols, Ytr, wtry)  if loss < Bestl OSS:    W = wtry    bestloss = loss  print ' iter%d loss is%f '% (i, Bestloss)

The above steps are also carried out 1000 times, and this time the accuracy of the test image set is. This was better, but still wasteful and computationally expensive.

Strategy #3: Following the Gradient

In the previous section we tried to find a way to optimize the weight vector in the weight space (and get a smaller loss). It turns out that there is no need to go to random search in this good direction: we can calculate this in the best direction, mathematically proving that this is a steepest descent direction (its step size is close to 0). This direction is also associated with the gradient of loss loss function. In We hiking analogy, this approach roughly corresponds to feeling the slope of the hill below we feet and stepping down The direction that feels steepest.

This slope is the change in the instantaneous rate of the function of any one point relative to the one-dimensional function. A gradient is a generalization of the slope, and no longer uses only a single number but a set of vectors. In addition, a gradient is a vector of each dimension slope (often referred to as a derivative (derivative)) in a set of input spaces. The derivative expression of one-dimensional function is as follows:

When the input of a function is a set of vectors instead of a single number, we call These derivatives the partialderivatives, and the derivative is the set of partial derivatives of each dimension.

Computing the gradient

There are two ways to calculate gradients: a slow, approximate but very simple way ( numerical gradients, numerical gradient); another fast, precise, but error-prone way (analytic gradients, analytic gradient), which requires calculus.

Computing the gradient numerically with finite differences

The formula given above allows us to calculate the numerical gradient. There is a general formula that uses a function to calculate a gradient on a vector. The formula is as follows:

Def eval_numerical_gradient (f, x): "" "   a naive implementation of numerical gradient of f at x   -F should be a  function that takes a single argument  -X was the point (NumPy Array) to evaluate the gradient at "" "   FX = f (x) # Evaluate function value at original point  grad = Np.zeros (x.shape)  h = 0.00001  # Iterate-All indexes I n x  it = np.nditer (x, flags=[' Multi_index '), op_flags=[' ReadWrite ']) and not  it.finished:    # Evaluate function at X+h    IX = it.multi_index    old_value = X[ix]    X[ix] = old_value + H # Increment by h    fxh = f (x ) # Evalute F (x + H)    X[ix] = old_value # Restore to Previous value (very important!)    # Compute the partial derivative    grad[ix] = (FXH-FX)/h # the slope    it.iternext () # Step to next Dimension
   return Grad

The gradient formula above calculates the gradient of the input vector x by iterating through each dimension and then calculating the derivative of the loss function along that dimension. In the end, the variable has the entire gradient value.

Practical considerations. In the mathematical equation, the gradient refers to the amount of change in the function when the variable grows toward 0, but in real life it is only possible to set a small value (for example). Ideally, you want to use a minimum step size (which does not result in numerical problems). In addition, real life uses the Central Difference Formula (thecentered Difference formula) to calculate the numerical gradient effect better:. Please see the wiki for specific details.

We can use the above formula to calculate the gradient of any specific point of any formula. The gradient of the loss function for a few random points is computed below on the CIFAR-10:

# to use the generic code above we want a function this takes a single argument# (the weights in our case) so we close ove R X_train and Y_traindef Cifar10_loss_fun (w):  return L (X_train, Y_train, W) w = Np.random.rand (ten, 3073) * 0.001 # Rand Om Weight vectordf = eval_numerical_gradient (Cifar10_loss_fun, W) # Get the gradient

The gradient tells us the slope of the loss function on each dimension, so we can use the gradient to update the weights:

Loss_original = Cifar10_loss_fun (W) # The original Lossprint ' original loss:%f '% (loss_original,) # lets see the effect of multiple step sizesfor step_size_log in [ -10,-9,-8,-7,-6, -5,-4,-3,-2,-1]: step_size = ten * * Step_size_log w_new = W-step_size * DF # New position in the weight space loss_new = Cifar10_loss_fun (w_new) print ' For step size%f new L OSS:%f '% (Step_size, loss_new) # prints:# original loss:2.200718# for step size 1.000000e-10 new loss:2.200652# for Ste P size 1.000000e-09 New loss:2.200057# for step size 1.000000e-08 new loss:2.194116# for step size 1.000000e-07 new loss : 2.135493# For step size 1.000000e-06 new loss:1.647802# for step size 1.000000e-05 new loss:2.844355# for step size 1. 000000e-04 new loss:25.558142# For step size 1.000000e-03 new loss:254.086573# for step size 1.000000e-02 new loss:2539 .370888# for step size 1.000000e-01 new loss:25392.214036

Update in negative gradient direction. In the above code, we use gradient values to update the weights, the direction of the update is the opposite of the gradient direction, because we want the loss function to be smaller rather than larger.

Effect of step size. The gradient tells us that the function can grow in the quickest direction, but does not tell us how far we should go along this direction. After this course, the choice of step size is one of the most important parameter setting problems in training a neural network. In the previous analogy of the mountain, we could feel a sloping direction in the hillside, but the step we should take out is not certain. If we move carefully, in order to have a correct but small progress (this corresponds to a small step). Instead, we can move in the quickest direction of descent, but the effect may not be good. As you can see from the code example above, if the step size is too large at some point, a higher loss will occur as if we were "overstep".

A problem of efficiency. You may have noticed a linear relationship between the complexity of calculating a numerical gradient and the number of parameters. In our case, we have a total of 30,730 parameters, so we have to perform 30,731 loss function calculations before we can calculate the gradient value, we can do a parameter update. This problem will become even worse, because the parameters of neural networks can be easily reached in millions now. To be exact, this approach is not continuous, so we have to find other better ways.

Computing the gradient analytically with calculus

The numerical gradient is easy to calculate by means of finite difference approximation, but the disadvantage is that it is approximate (because we choose a small H value, and in the gradient formula, the value of H is defined as approaching 0) and the computational amount is very large. The second way to calculate the gradient is calculus, which allows us to derive the gradient directly (not approximate) and to calculate it quickly. However, it is more prone to errors in the application process, so in practice it is common practice to calculate analytic gradients and then compare them with numerical gradients to determine if they are correct. This step is called gradient checking (gradientcheck).

The SVM loss function formula for a single data point is as follows:

We can differentiate this function. For example, the calculation of the gradient, the formula is as follows:

This is an indicator function that outputs the value of the conditional expression if the condition expression is true; otherwise, the output is 0. The expression above looks difficult to remember, but as long as your code implements it, you will find it easy to count the number of categories that do not match the expected (and hence contributed to the loss function), and the value of the data vector being scaled is the gradient. Note that this gradient value is only relative to a row of weights (relative to the correct category). For the other rows where the gradient is:

If you understand the gradient formula above, then you can apply the expression directly to perform the gradient update.

Gradient descent

Now we can calculate the gradient of the loss function. The process of repeatedly calculating gradients and then updating the gradient is called gradient descent (Gradient descent). Itsvanilla version is as follows:

# Vanilla Gradient descentwhile True:  Weights_grad = evaluate_gradient (loss_fun, data, weights)  weights + =-step _size * Weights_grad # Perform parameter update

This simple loop is the core of all neural network libraries. There are other ways to perform optimization operations (e.g. Lbfgs), but gradient descent is the most common way to optimize the loss function of a neural network at present. In this lesson, we'll add some extra functionality to this loop (put some bells and whistles on the details of the This loop) (e.g. Update the exact details of the equation), but the core idea has not changed.

Mini-batch gradient descent. In large-scale applications (such as the ILSVRC Challenge), training data can be millions of dollars. Therefore, it is wasteful to calculate the loss function of the entire training set in order to perform a parameter update. A common way to solve this problem is to batch process the gradient of training data. For example, in the current state of the art convnets, the entire training set has a total of 1.2 million samples, and a typical batch processing quantity is 256. The following code for batch processing parameter updates:

# Vanilla Minibatch Gradient descentwhile True:  data_batch = sample_training_data (data, $) # sample examples
   
    weights_grad = Evaluate_gradient (Loss_fun, Data_batch, weights)  weights + =-Step_size * Weights_grad # perform param Eter Update

The reason that this approach can get good results is because the training data are relevant. To understand this, consider an extreme situation where the 1.2 million images on the ILSVRC are actually copied from 1000 different images (one for a class, or 1200 for each image). It is clear that the gradient of the 1200 identical copies we have calculated is the same, and when we average the data loss through all 1.2 million images we would get the exact same loss As if we are evaluated on a small subset of 1000. In fact, datasets do not have duplicate images, and a small batch of gradients is a good approximation of the gradient of all loss functions. As a result, more frequent parameter updates are performed by calculating a small batch gradient, which can converge faster.

The most extreme example of batch updating is that the number of batches is only one sample. This process is called random gradient descent (Stochastic Gradient descent,SGD) (sometimes also called online gradient descent (on-line Gradient descent)). However, in practical use is not much, because the vector code optimization, a calculation of 100 samples of the gradient than the calculation of 100 times a single sample gradient more efficient. Although technically, SGD represents a gradient for only one sample at a time, in fact people use SGD to represent a small batch gradient update (i.e. MGD refers to "Small batch gradient update", bgd means "batch gradient update", these are not used too. The size of a small batch is a hyper-parameter, but usually does not require cross-validation to get. It is usually based on memory constraints (if any), or set some values, such as 32,64 or 128. We use powers of 2 in practice because many vectorized operation implementations work faster when their inputs is sized I n Powers of 2.

Summary

In this section

1) We develop a high-dimensional optimized hillside (high-dimensional optimization landscpe) to visualize the loss function, our goal is to reach the bottom as far as possible. The analogy we develop is a blindfolded hiker who wants to reach the bottom of the mountain. In particular, we see that the SVM loss function is piecewise linear and is bowl-shaped.

2) We illustrate the idea of optimizing the loss function by iterative solution (iterative refinement), that is, we start with a set of random weights and gradually solve them until the loss value is minimal.

3) We know that the gradient of the function (gradient) is the fastest upward direction, and we discuss the numerical calculation of gradients using finite difference approximation, which is simple but inefficient (the finite difference being The value of H used in computing the numerical gradient).

4) in the process of updating the parameters, the step size, or the learning rate , is a slightly tricky setting: if it is too low, then progress is slow If it is too high, progress can be faster, but at greater risk. We'll discuss the matter in more detail later.

5) We discuss the tradeoff between numerical gradient (numerical) calculations and analytic gradient (analytic) calculations. The numerical gradient is simple, but it is approximate and computationally significant. The analytic gradient calculation is accurate and fast, but it is more error-prone because of the gradient derivation. In fact, we often use analytic gradients and then perform gradient checks (gradient check), which are compared with numerical gradients.

6) The gradient descent algorithm is introduced: iteratively calculating gradients in a loop and then updating gradients.

Coming up: The core idea of this section is to calculate the gradient of the loss function, which is the weight of the argument. In the next section we will use chain rules to improve the efficiency of analytic gradient calculations, or reverse propagation (backpropagation). This would allow us to efficiently optimize relatively arbitrary loss functions that express all kinds of neural Networks, including convolutional neural Networks.

Optimization:stochastic Gradient Descent

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More