Adit Deshpande
CS undergrad at UCLA (' 19)
Blog Abouta Beginner ' s Guide to Understanding convolutional neural Networks Part 2
Introduction
Link to Part 1
In this post, we'll go to a lot more of the specifics of Convnets. Disclaimer: Now, I did realize that some of these topics is quite complex and could be made in whole posts by themselves. In a effort to remain concise yet retain comprehensiveness, I'll provide links to my papers where the topic is EX plained in more detail.
Stride and Padding
Alright, let's look Good old conv layers. Remember the filters, the receptive fields, the convolving? Good. Now, there is 2 main parameters that we can change to modify the behavior of each layer. After we choose the filter size, we also has to choose the Stride and the padding .
Stride controls how the filter convolves around the input volume. In the example we had in Part 1, the filter convolves around the input volume by shifting one unit at a time. The amount by which, the filter shifts is the stride. In this case, the stride is implicitly set at 1. Stride is normally set in a-a-so, the output volume is an integer and not a fraction. Let's look at an example. Let's imagine a 7 x 7 input volume, a 3 x 3 filter (disregard the 3rd dimension for simplicity), and a stride of 1. This is the case, we ' re accustomed to.
Same old, same old, right? See if you can try to guess what would happen to the output volume as the stride increases to 2.
So, as a can see, the receptive field was shifting by 2 units now and the output volume shrinks as well. Notice If we tried to set our stride to 3, then we ' d has issues with spacing and making sure the receptive fields fi T on the input volume. Normally, programmers would increase the stride if they want receptive fields to overlap less and if they want smaller spat ial dimensions.
Now, let's take a look at padding. Before getting into and let's think about a scenario. What happens when do three 5 x 5 x 3 Filters to a (x) x 3 input volume? The output volume would is x 3. Notice that the spatial dimensions decrease. As we keep applying conv layers, the size of the volume would decrease faster than we would like. In the early layers of our network, we want to preserve as much information about the original input volume so that we can Extract those low level features. Let's say we want to apply the same conv layer and we want the output volume to remain by X x 3. To does this, we can apply a zero padding of size 2 to the that layer. Zero padding Pads The input volume with zeros around the border. If we think about a zero padding of the 3, then this would the result in a (x) of x, the input volume.
If you had a stride of 1 and if you set the size of zero padding to
Where K is the filter size, then the input and output volume would always have the same spatial dimensions.
The formula for calculating the output size of given conv layer is
Where O is the output height/length, W is the input height/length, K is the filter size, P is the padding, and S is the St Ride.
Choosing Hyperparameters
How does we know how many layers to use, how many conv layers, what is the filter sizes, or the values for stride and Paddi Ng? These is not trivial questions and there isn ' t a set standard, which is used by all researchers. This is because the network would largely depend on the type of the data it has. Data can vary by size, complexity of the image, type of image processing tasks, and more. When looking at your dataset, one-to-think-about-choose-the-hyperparameters-to-find the right combination th At creates abstractions of the image at a proper scale.
ReLU (rectified Linear Units) Layers
After each conv layer, it's convention to apply a nonlinear layer (oractivation Layer) immediately afterward. The purpose of this layer are to introduce nonlinearity to a system that basically have just been computing linear operation s during the conv layers (just element wise multiplications and summations). In the past, nonlinear functions like Tanh and sigmoid were used, but researchers found out of thatReLU LayersWork far better because the network was able to train a lot faster (because of the computational efficiency) without making A significant difference to the accuracy. It also helps to alleviate the vanishing gradient problem, which is the issue where the lower layers of the network train Very slowly because the gradient decreases exponentially through the layers (explaining this might is out of the scope of This post, but see here and here for good descriptions). The ReLU layer applies the function f (x) = max (0, X) to any of the values in the input volume. In basic terms, this layer just changes all the negative activations to 0.This layer increases the nonlinear properties of The model and the overall network without affecting the receptive fields of the conv layer.
Paper by the Great Geoffrey Hinton (aka The Father of Deep learning).
Pooling Layers
after some ReLU Layers, programmers may choose to apply a pooling layer . It is the also referred to as a downsampling layer. In this category, there is also several layer options, with maxpooling being the most popular. This basically takes a filter (normally of size 2x2) and a stride of the same length. It then applies it to the input volume and outputs the maximum number in every subregion that the filter convolves around.
Other options for pooling layers is average pooling and l2-norm pooling. The intuitive reasoning behind this layer is a, once we know that a specific feature are in the original input volume (th Ere would was a high activation value), it exact location was not as important as it relative location to the other feature S. As can imagine, this layer drastically reduces the spatial dimension (the length and the width of the change but not the D epth) of the input volume. This serves the main purposes. The first is, the amount of parameters or weights are reduced by 75%, and thus lessening the computation cost. The second is the It'll controloverfitting. This term refers-to-when a-model is-tuned to the training examples that it's not able-generalize-well for the valid ation and test sets. A symptom of overfitting is have a model that gets 100% or 99% on the training set, and only 50% on the test data.
Dropout Layers
Now,Dropout LayersThere is a very specific function in neural networks. In the last section, we discussed the problem of overfitting, where after training, the weights of the network is so tune D to the training examples they is given that the network doesn ' t perform well when given new examples. The idea of dropout was simplistic in nature. This layer ' drops out ' a random set of activations in the so layer by setting them to zero in the forward pass. Simple as that. Now, what is the benefits of such a simple and seemingly unnecessary and counterintuitive process? Well, on a, it forces the network to be redundant. By that I mean the network should is able to provide the right classification or output for a specific example even if som E of the activations is dropped out. It makes sure that the network isn ' t getting too "fitted" to the training data and thus helps alleviate the overfitting PR Oblem. An important note is the this layer was only used during training, and not during test time.
Paper by Geoffrey Hinton.
Network in Network Layers
A network in Network layer refers to a conv layer where a 1 x 1 size filter is used. Now, at first look, you might wonder why this type of layer would even be helpful since receptive fields is normally larg Er than the space they map to. However, we must remember, these 1x1 convolutions span a certain depth, so we can think of it as a 1 x 1 x N Convoluti On where N was the number of filters applied in the layer. Effectively, this layer is performing a n-d element-wise multiplication where N was the depth of the input volume into the Layer.
Paper by Min Lin.
Classification, Localization, Detection, segmentation
In the example we used in Part 1 of the This series, we looked at the task of the image classification. The the process of taking an input image and outputting a class number out of a set of categories. However, when the we take a task-like object localization, we are not the only produce a class label but also a Bo Unding box that describes where the object was in the picture.
We also has the task of object detection, where localization needs to being done with all of the objects in the image . Therefore, you'll have multiple bounding boxes and multiple class labels.
Finally, we also has the object segmentation where the task is to output a class label as well as an outline of ever Y object in the input image.
More detail on how these is implemented to come in Part 3, but for those who can ' t wait ...
DETECTION/LOCALIZATION:RCNN, Fast rcnn, Faster rcnn, Multibox, Bayesian optimization, multi-region, rcnn minus R, Image Windows
Segmentation:semantic Seg, unconstrained Video, Shape Guided, Object regions, shape sharing
Yeah, there ' s a lot more.
Transfer Learning
Now, a common misconception in the DL community are that without a google-esque amount of data, you can ' t possibly hope To create effective deep learning models. While data was a critical part of creating the network, the idea of transfer learning have helped to lessen the data demands . Transfer Learning is The process of taking a pre-trained model (the weights and parameters of A network that had been trained on a large datasets by somebody else) and "fine-tuning" the model with your own dataset. The idea is, this pre-trained model would act as a feature extractor. You'll remove the last layer of the network and replace it with your own classifier (depending in what your problem SPAC E is). Freeze the weights of all the other layers and train the network normally (freezing the layers means not changing The weights during gradient descent/optimization).
Let's investigate why this works. Let's say the pre-trained model that we ' re talking about is trained on ImageNet (for those that aren ' t familiar, ImageNet is a dataset, that contains, million images with over classes). When we think on the lower layers of the network, we know that they would detect features like edges and curves. Now, unless you has a very unique problem space and dataset, your network is going to need to detect curves and edges as Well. Rather than training the whole network through a random initialization of weights, we can use the weights of the Pre-train Ed model (and freeze them) and focus on the more important layers (ones that is higher up) for training. If your dataset is quite different than something like ImageNet, then you ' d want to train more of your layers and freeze O Nly a couple of the low layers.
Paper by Yoshua Bengio (another deep learning pioneer).
Paper by Ali Sharif Razavian.
Paper by Jeff Donahue.
Data Augmentation Techniques
By now, we ' re all probably numb to the importance of data in convnets, so let's talk about ways this you can make your Exi Sting DataSet even larger, just with a couple easy transformations. Like the we ' ve mentioned before, when a computer takes an image as an input, it'll take in an array of pixel values. Let's say that the whole image is shifted left by 1 pixel. To your and me, this is imperceptible. However, to a computer, this shift can is fairly significant as the classification or label of the image doesn ' t change, W Hile the array does. Approaches that alter the training data in ways this change the array representation while keeping the label the same is Known asData AugmentationTechniques. They is a-to-artificially expand your dataset. Some popular augmentations People use is grayscales, horizontal flips, vertical flips, random crops, color jitters, trans Lations, rotations, and much more. By applying just a couple of these transformations to your training data, you can easily double or triple the number of TR Aining examples.
Link to Part 3
Dueces.
Sources
Written on July 29, 2016
(EXT) A beginner ' s Guide to Understanding convolutional neural Networks Part 2