ResNet, AlexNet, Vgg, inception:understanding various architectures of convolutional Networksby
koustubh This
blog from: http://cv-tricks.com/cnn/understand-resnet-alexnet-vgg-inception/
convolutional neural Networks is fantastic For visual recognition Tasks. good convnets is beasts with millions of parameters and many hidden layers. In fact, a bad rule of thumb is: ' higher the number of hidden layers, better the network '. AlexNet, Vgg, Inception, ResNet is some of the popular networks. Why Do these Networks? How is they designed? Why does they have the structures they? One wonders. The answer to these questions are not trivial and certainly, can ' t being covered in one blog post. However, in this blog, I shall try to discuss some of these questions. Network architecture design is a complicated process and would take a while to learn and even longer to experiment Designin G on your own. But first, let's put things in perspective:
Why is convnets beating traditional computer vision?
Image classification is the task of correctly classifying a given image as a predefined category. Traditional methods, the process is divided into two modules:feature extraction and classification.
Feature Extraction:involves extracting a higher level of information from raw pixel values that can Capt Ure the distinction among the categories involved. This feature extraction was done in an unsupervised manner wherein the classes of the image has nothing to do with Informa tion extracted from pixels. Some of the traditional and widely used features are GIST, HOG, SIFT, LBP etc. After the feature are extracted, a classification module is trained with the images and their associated labels. A Few examples of this module is SVM, Logistic Regression, Random Forest, decision trees etc.
But the problem with this process is:feature extraction process cannot be fine-tuned according to classes and images (the feature extraction cannot is tweaked according to the classes and images)。 Therefore, if the selected feature lacks expression to differentiate between categories, the accuracy of the model classification will not be very good, regardless of the strategy of the classification you adopt. A common theme among the state of the art following the traditional pipeline have been, to pick multiple feature extractors and club them inventively to get a better feature. Involves too many heuristics as well as manual labor to tweak parameters according to the domain to reach a Decen T level of accuracy. By decent I mean, reaching close to human level accuracy. That's why it took years to build a good the computer vision system (like OCR, face verification, image classifiers, object det Ectors etc), that can work with a wide variety of data encountered during practical application, using traditional compute R Vision. We once produced better results using Convnets for a company (a client of my start-up) in 6 weeks, which took them close to A year to achieve using traditional computer vision.
Another problem with this method is, it is Completel Y different from what we humans learn to recognize things. Just after birth, a was incapable of perceiving his surroundings, but as he progresses and processes data, he learns to identify things. The philosophy behind deep learning, wherein no hard-coded feature extractor are built in. it combines the extraction and classification modules into one integrated system and it learns to Extract, by discriminating representations from the images and classify them based on supervised data.
One such system is multilayer perceptrons aka neural networks which be multiple layers of neurons densely connected to EA CH Other. A deep vanilla neural network have such a large number of parameters involved that it's impossible to train such a system Without overfitting the model due to the lack of a sufficient number of training examples. But with convolutional neural Networks (convnets), the task of training the whole network from the scratch can be carried O UT using a large dataset like ImageNet. The reason behind this was, sharing of parameters between the neurons and sparse connections in convolutional layers. It can be seen in this figure 2. In the convolution operation, the neurons in one layer is only locally connected to the input neurons and the set of Para Meters is shared across the 2-d feature map.
In order to understand the design philosophy of convnets, one must ask: What's the objective here?
A. Accuracy :
If you were building an intelligent machine, it was absolutely critical that it must be as accurate as possible. One fair question to ask here's that's accuracy not only depends on the network but also on the amount of data available F or training '. Hence, these networks is compared on a standard dataset called ImageNet.
ImageNet Project is an ongoing effort and currently have 14,197,122 images from 21841 different categories. Since, ImageNet has been running a annual competition in visual recognition where participants is provided with 1.2 Million images belonging to different classes from Imagenet Data-set. So, each network architecture reports accuracy using these 1.2 million images of classes.
B. Computation:
Most convnets has huge memory and computation requirements, especially while training. Hence, this becomes an important concern. Similarly, the size of the final trained model becomes an important to consider if you is looking to deploy a model to RU n locally on mobile. As can guess, it takes a more computationally intensive the network to produce more accuracy. So, there are always a trade-off between accuracy and computation.
Apart from these, there is many other factors like ease of training, the ability of a network to generalize well etc. The networks described below is the most popular ones and is presented in the order that they were published and also ha D increasingly better accuracy from the earlier ones.
AlexNet
This architecture is one of the first deep networks to push ImageNet classification accuracy by a significant stride in C Omparison to traditional methodologies. It is composed of 5 convolutional layers followed by 3 fully connected layers, as depicted in Figure 1.
AlexNet, proposed by Alex Krizhevsky, uses ReLu (rectified Linear Unit) for the non-linear part, instead of a Tanh or Sigmo ID function which is the earlier standard for traditional neural networks. ReLu is given by
f (x) = max (0,x)
The advantage of the ReLu over sigmoid are that it trains much faster than the latter because the derivative of sigmoid Bec Omes very small in the saturating region and therefore the updates to the weights almost vanish (Figure 4). This is called vanishing gradient problem.
In the network, ReLu layer is put after each and every convolutional and fully-connected layers (FC).
Another problem that this architecture solved is reducing the over-fitting by using a dropout layer after every FC LA Yer. Dropout layer has a probability, (p), associated with it and was applied at every neuron of the response map separately. It randomly switches off the activation with the probability p, as can is seen in Figure 5.
Why does dropout work?
The idea behind the dropout was similar to the model ensembles. Due to the dropout layer, different sets of neurons which is switched off, represent a different architecture and all the Se different architectures is trained in parallel with weight given to each subset and the summation of weights being one . For n neurons attached to dropout, the number of subset architectures formed is 2^n. So it amounts-prediction being averaged over these ensembles of models. This provides a structured model regularization which helps in avoiding the over-fitting. Another view of dropout being helpful is that since neurons was randomly chosen, they tend to avoid developing co-adaptati ONS among themselves thereby enabling them to develop meaningful features, independent of others.
VGG16
This architecture are from Vgg Group, Oxford. It makes the improvement over AlexNet by replacing large kernel-sized filters (one and 5 in the first and second convolution Al layer, respectively) with multiple 3x3 kernel-sized filters one after another. With a given receptive field (the effective area size of input image on which output depends), multiple stacked smaller siz E kernel is better than the one with a larger size kernel because multiple non-linear layers increases the depth of the NE Twork which enables it to learn more complex features, and so too at a lower cost.
for example, three 3x3 filters on top of all other with stride 1 ha a receptive size of 7, but the number of parame Ters involved is (9c^2) in comparison to 49c^2 parameters of kernels with a size of 7. Here, it's assumed, the number of input and output channel of layers is C.Also, 3x3 Kernels help at retaining finer level properties of the Image. the network architecture is given in the table.
You can see it in vgg-d, there is blocks with same filter size applied multiple times to extract more complex and repre Sentative features. This concept of blocks/modules became a common theme in the networks after Vgg.
The Vgg convolutional layers is followed by 3 fully connected layers. The width of the network starts at a small value of $ and increases by a factor of 2 after every sub-sampling/pooling lay Er. It achieves the top-5 accuracy of 92.3% on ImageNet.
(EXT) ResNet, AlexNet, Vgg, inception:understanding various architectures of convolutional Networks