From today onwards, the formal step into the ranks of in-depth study. have been touched before, but not system. This is the beginning of the cs231 course.
Why Mini-batch gradient descent can work ...
http://cs231n.github.io/optimization-1/
Mini-batch gradient descent. In large-scale applications (such as the ILSVRC Challenge), the training data can have on order of millions of examples. Hence, it seems wasteful to compute the full loss function over the entire training set in order to perform only a single Parameter update. A very common approach to addressing this challenge was to compute the gradient over batches of the training data. For example, under the art convnets, a typical batch contains 256 examples from the entire training set of 1 .2 million. This batch was then used to perform a parameter update:
# Vanilla Minibatch gradient descent while True:data_batch = Sample_training_data (data, 256) # sample 256 Example s Weights_grad = Evaluate_gradient (Loss_fun, data_batch, weights) weights + =-Step_size * Weights_grad # perform Para Meter Update
The reason the examples in the training data are correlated. To the "This", consider the extreme case where all 1.2 million images in ILSVRC are on fact made up of exact duplicates of O NLY 1000 unique images (one for each class, or in the other words 1200 identical to each image). Then It is clear this gradients we would compute for all 1200 identical copies would all are the same, and when we aver The age of the data loss over all 1.2 million images we would get the exact same loss as if we are evaluated on a small subset O F 1000. In practice of course, the dataset would not contain duplicate images, the gradient from a mini-batch is a good approximat Ion of the gradient of the full objective. Therefore, much faster convergence can is achieved in practice by evaluating the mini-batch gradients to perform more freq Uent parameter updates.
The extreme case was a setting where the Mini-batch contains only a single example. This process is calledstochastic gradient descent (SGD) (or also sometimes on-line gradient descent). This is relatively less common to the because in practice due to vectorized code optimizations it can be computationally m Uch more efficient to evaluate the gradient for examples, than the gradient for one example. Even though SGD technically refers to using a-example at a time to evaluate the gradient, your'll hear people use The term SGD even when referring to Mini-batch gradient descent (i.e. mentions of MGD for "Minibatch gradient descent", or BGD for ' Batch gradient descent are rare to ', where it is usually assumed this mini-batches are. The size of the Mini-batch is a hyperparameter but it isn't very common to cross-validate it. It is usually based on memory constraints (if no), or set to some value, e.g. 128. We use powers of 2 inPractice because many vectorized operation implementations work faster when their inputs are sized in powers of 2.
Three most commonly used gates in neural networks (Add,mul,max)
http://cs231n.github.io/optimization-2/
It is interesting to this in many cases the backward-flowing gradient can being interpreted on a intuitive level. For example, the three most commonly used gates-neural networks (Add,mul,max), all have very simple interpretations in Terms of how they act during backpropagation. Consider this example circuit:
Looking at the diagram above as a example, we can:
The add Gate always takes the gradient on its output and distributes it equally, inputs Less of what their values were during the forward pass. This follows from the fact to the local gradient for the add operation are simply +1.0, so the gradients on all inputs WI ll exactly equal the gradients on the output because it would be multiplied by x1.0 (and remain). In the example circuit above, which is the + gate routed the gradient of 2.00 to both of it inputs, equally and unchange D.
The max gate routes the gradient. Unlike the add gate which distributed the gradient unchanged to all its inputs, the Max Gate distributes the gradient (UNC hanged) to exactly one of their inputs (the input that had the highest value during the forward pass). This is because the local gradient to a Max gate is 1.0 for the highest value, and 0.0 to all other values. In the example circuit above, the max operation routed the gradient of 2.00 to the z variable, which had a higher value th A W, and the gradient on W remains zero.
The multiply gate is a little less easy to interpret. Its local gradients are the input values (except switched), and it is multiplied by the gradient in its output during th E chain rule. In the example above, the gradient on X is-8.00, which is-4.00 x 2.00.
Unintuitive effects and their consequences. Notice that if one of the "inputs to" multiply gate is very small and the other are very big, then the multiply gate would Do something slightly unintuitive:it'll assign a relatively huge gradient to the small input and a tiny gradient to th e large input. Note that in linear classifiers where the weights are dot producted wtxi (multiplied) with the inputs, this implies that th E Scale of the data has an effect on the magnitude of the gradient for the weights. For example, if your multiplied all input data examples XI by 1000 during preprocessing, then the gradient on the weights W Ill is 1000 times larger, and your d have to lower the learning from that rate to factor. This is why preprocessing matters a lot, sometimes in subtle ways! and has intuitive understanding for how the gradients, can help, debug some of these cases.
Interpretation of the gradient
The derivative on each variable tells your sensitivity of the whole on its value.
Commonly used activation functions describes the advantages and disadvantages of each activation function is more comprehensive. http://cs231n.github.io/neural-networks-1/
Every activation function (or non-linearity) takes a single number and performs a certain fixed mathematical operation on It. There are several activation functions you could encounter in practice:Left:Sigmoid non-linearity squashes real numbers To range between [0,1] right:the tanh non-linearity squashes real numbers to range between [ -1,1].
Sigmoid. the sigmoid non-linearity has the mathematical form σ (x) =1/(1+e−x) Σx11ex and is shown in the Imag E above on the left. As alluded to the previous section, it takes a real-valued number and "squashes" it into range between 0 and 1. In particular, large negative numbers become 0 and large positive numbers 1. The sigmoid function has seen frequent use historically since it has a nice interpretation as the firing rate of a neuron: From does firing at all (0) to fully-saturated firing at a assumed maximum frequency (1). In practice, the sigmoid non-linearity has recently fallen out of favor and it is rarely ever. It has two major drawbacks:sigmoids saturate and kill gradients. A very undesirable property of the sigmoid neuron was that then the neuron ' activation ' saturates at either tail of 0 or 1, The gradient at this regions is almost zero. Recall that during backpropagation, this (local) gradient'll is multiplied to the gradient of this gate's outPut for the whole objective. Therefore, if the local gradient is very small, it would effectively "kill" the gradient and almost no signal would flow thr Ough the neuron to it weights and recursively to its data. Additionally, one must pay extra caution when initializing the weights of sigmoid, neurons to prevent. For example, if the initial weights are too large then most neurons would become saturated and the network would barely Lea Rn. Sigmoid outputs are not zero-centered. This is undesirable since neurons in later layers to processing in a neural network (more on this soon) would to be receiving The data is not zero-centered. This has implications on the dynamics during gradient descent, because if the data coming to a neuron is always positive (E.g. x>0x0 elementwise in f=wtx+bfwtxb)), then the gradient on the Weights ww will during BackPropagation become either all is positive, or all negative (depending on the gradient of the whole expression FF). This could introduce undesirable zig-zagging dynamics to gradient for the updates. However, notice that once this gradients are added up across a batch of data the final update for the weights can have VA Riable signs, somewhat mitigating this issue. Therefore, this is a inconvenience but it has less severe consequences compared to the saturated activation problem .
The
tanh. the Tanh non-linearity is shown on the image above on the right. It squashes a real-valued number to the range [-1, 1]. Like the sigmoid neuron, its activations saturate, but unlike the sigmoid neuron it output is zero-centered. Therefore, in practice The tanh non-linearity are always preferred to the sigmoid nonlinearity. also E Tanh Neuron is simply a scaled sigmoid neuron, in particular the following Holds: tanh (x) =2σ (2x) −1tanhx2σ2x1. left: rectified Linear Unit (relu) activation function, which are zero when x < 0 and then Linear with SL Ope 1 when x > 0. right: a plot from Krizhevsky et al. (PDF) paper indicating the 6x improveme NT in convergence and the Relu unit compared to the Tanh unit.
Relu. the rectified Linear unit has become very popular at the last few years. It computes the function f (x) =max (0,x) fx0x. In the other words, the activation are simply thresholded at zero (the. There are several pros and cons to using the Relus: (+) It is found to greatly accelerate (e.g. a factor of 6 In kri Zhevsky et al.) The convergence of stochastic gradient descent compared to the Sigmoid/tanh functions. It is argued which is due to its linear, non-saturating form. (+) Compared to tanh/sigmoid neurons that involve expensive operations (exponentials, etc.), the Relu can is implemented by SI Mply thresholding a matrix of activations at zero. (-) Unfortunately, Relu units can be fragile during training and can "die". For example, a large gradient flowing through a relu neuron could cause the weights to update in such a way that neuro N would never activate on any datapoint again. If This is happens, then the gradient flowing through the unit WIll forever is zero from this point on. The "is", the Relu units can irreversibly die during training since they can get knocked off the data manifold. For example, I can find this as much as 40% of your network could be ' dead ' (i.e. neurons that never activate across the E Ntire training DataSet) If the learning rate is set too high. With a proper setting of the learning rate it less frequently an issue.
Leaky Relu. Leaky Relus are one attempt to fix the "dying Relu" problem. Read the text yourself, old truncated.