AlexNet
- contribution : ILSVRC2012 champion, showing the depth of CNN in the image task of the astonishing performance, the upsurge of CNN research, is now deep learning and the rapid development of AI important reason. The Imagenet competition provides a platform for the Hinton that has been studying neural networks, Alexnet was published by Hinton and his two students, and deep learning has been silent for a long time before alexnet.
- Network structure : As shown, the 8-layer network, the parameters of approximately million, using the Relu function, the first two fully connected layer using 0.5 of the dropout. Using LRN and overlapping pooling, LRN is not used now, and BN is generally used as a normalization. Multi-GPU training was used at the time.
- Pre-treatment : first down-sample into the shortest edge of 256 image, and then cut out the middle of the 256x256 image, and then reduce the mean to do normalization (over training set). when training , do data enhancement, for each image, randomly extract the image of 227x227 and horizontal image version. In addition to data enhancement, PCA has been used to reduce the cross-fit problem by reducing the dimensionality of RGB pixels.
- Forecast : Extract 5 (Four Corners and middle) and horizontal mirrored versions of each image, with a total of 10, with an average of 10 predictions as the final forecast.
- Super parameter : SGD, Learning rate 0.01,batch size is 128,momentum for 0.9,weight decay for 0.0005 (paper has a weight update formula), whenever validation error no longer drops, The learning rate is divided by 10. The Gaussian distribution of the weight initialization (0,0.01), the 245 convolution layer and the bias of the fully connected layer are initialized to 1 (providing a positive value to the Relu for accelerated pre-training), and the remaining bias initialized to 0.
Zfnet
- contribution : The champion of the ILSVRC2013 classification task, using the deconvolution to visualize the CNN's intermediate feature map, to analyze the feature behavior to find a way to lift the model, and fine-tune the alexnet to improve performance. Zfnet's Z and F refer to Zeiler and Fergus, a former student of Hinton, who later studied at the University of New York, Zeiler, who teamed up with the New York University to study neural networks Fergus proposed zfnet.
- The champion? : In strict terms, the classification champion was Clarifai, but the ILSVRC2013 Champion (winner) We usually talked about was zfnet. ZF's Zeiler is the founder and CEO of Clarifai.
- Network structure : as shown, as in Alexnet, the first two fully connected layers are appended with a 0.5 dropout. Compared to alexnet, the main difference is the use of smaller convolution cores and steps, the 11x11 convolution nucleus into 7x7 convolution kernel, stride from 4 to 2. In addition, the visualization of the first layer of the convolution core impact, so the first layer of the convolution core is normalized, if the RMS (Root Mean Square) more than 0.1, the convolution core of the RMS normalize is fixed 0.1.
- other datasets : Zfnet also experimented with migration studies on caltech-101,caltech-256,pascal AOC-2012.
- preprocessing and hyper-parameters : basically consistent with alexnet. Weight initialization differs, the weight is initialized to 0.01,bias initialized to 0.
- more : Specific content I mentioned in another paper note: visual CNN.
Overfeat
- contribution : ILSVRC2013 positioning task Champion, with a CNN integrated classification, positioning and detection of three tasks, proposed a multi-scale approach. Overfeat is proposed by the Yann LeCun team, LeCun proposed lenet can be said to be the beginning of CNN, proposed and did not fire up, because at that time the machine performance is not high and SVM can achieve similar effects or even more than.
- Network Structure : LRN is no longer used compared to alexnet, uses non-overlapping pooling, and with smaller steps, large steps can increase speed but damage precision. As with Alexnet, the first two fully connected layers are appended with a 0.5 dropout.
- preprocessing and hyper-parameters : basically consistent with alexnet. The weights are initialized differently and all are initialized with (0,0.01) Gaussian distributions. Momentum was 0.6 and the learning rate was 0.005. In the (30,50,60,70,80) epoch The learning rate was halved (decreased a factor by 0.5).
- prediction : In the testing phase, Alexnet's ten Views method (4 corners and center, with horizontal flip) is no longer used, and a multi-scale method for averaging predictions is explored. The multi-scale prediction is carried out directly from the original image rescale into multiple scale images input network.
- Multi-scale (and full convolution): As shown, the full-connected layer to complete convolution (5x5 convolution), at the end of the network with the global maximum pool, you can enter a multi-scale image, for example, the input 14x14 image, and finally get a 1x1 classification features, The input 16x16 image will finally get 2x2 classification features, but the global maximum pool can be converted to 1x1 classification features, for multi-scale input, the output is consistent. And from the blue color block can be seen, in the 16x16 roll product can be seen as a window with 14x14 sliding 2 steps of the 4 convolution results.
- more : Specific content I have mentioned in another paper note: Overfeat.
Vgg
- contribution : ILSVRC2014 positioning Task Champion (Winner), category Task Runner (runner-up). The network is characterized by a structured structure, through the repeated stacking of 3x3 convolution, the number of convolutional cores is gradually doubled to deepen the network, and many subsequent CNN structures have adopted this 3x3 convolution idea, which is a big impact. Zfnet and Overfeat both use smaller convolution cores and smaller steps to improve the performance of alexnet, whereas Vgg explores the depth of CNN by fixing other parameters and then steadily stacking the depth.
- Network structure
- As shown in the vgg-16,16 layer, the parameters are approximately 138 million. The experiment found that the addition of LRN did not improve, but rather worse, discard the use. The experiment found that 1x1 effect is worse, so no use, 1x1 convolution in the network in Network (Mishing) to promote, is very important ideas, in googlenet and resnet are useful.
- Using the small convolution kernel 3x3 can capture the left and right information, and benefit the stacking depth (ensure that the parameters are not too large). The step size is 1. Same convolution.
- The two 3x3 convolution can reach the same feeling field as the 5x5 convolution. Three 3x3 convolution can reach the same feeling field as the 7x7 convolution. The advantage of using three 3x3 is that 3 nonlinear transformations are used, and the parameters are reduced at the same time, assuming that the number of input and output channels is the same, then there are
\[3 (3^2c^2) =27c^2 < 7^2c^2 = 49c^2 \]
- As with Alexnet, the first two fully connected layers are appended with a 0.5 dropout.
- Super Parameters : basically consistent with alexnet. Batch size is 256. Initialization is also used (0,0.01) of the Gaussian distribution, but Vgg first training a shallow network, and then the shallow network of some parameters to initialize the deep network part of the parameters, and other parameters are Gaussian distribution. It is worth noting that after the submission of the paper Vgg found that using Glorot initialization method can be used without pre-training.
- pretreatment : Unlike Alexnet, when the next sample is not turned into 256, it becomes a s,s with two methods to set. The first method is a fixed s (single-scale), fixed at 256 or 384. To speed up 384 of the network, the weights are initialized with 256 pre-trained models. In addition, the learning rate was 0.001 smaller. The second method randomly sampled from [256, 512] s (multi-scale, note here is Multi-scale training, and multi-scale test in overfeat), which can be seen as size jitter (scale jittering) to enhance the training set. To speed up, use the 384 pre-trained model for weight initialization.
- Prediction : The alexnet of the Ten Views method (which is referred to as the Multi-crop evaluation in the VGG paper) and the Overfeat multi-scale prediction method (Vgg in the thesis is called the dense evaluation) combination. The overfeat has already mentioned that Multi-crop is flawed, there is a redundant convolution calculation, so the dense evaluation is used, but Inceptionv1 's paper mentions that Multi-crop uses a lot of crops to improve accuracy because of its finer sampling. Vgg that the accuracy of the implementation of the increase is not enough to compensate for the speed, but for the sake of reference, or run the Multi-scrop method. In the experiment, the combination of the two is better than multi-crop better than dense, so a little difference.
- Integration : The experiment was finally fused with multiple models (integrations), the best models were fused with VGG-16 and VGG-19, trained using multi-scale training, and tested using multi-crop and dense evaluations. Integration is used in the final experiments of alexnet,zfnet and Overfeat, and the best models are generally the result of integration.
- positioning : The model of Vgg's positioning task is modified on the basis of overfeat. There are two predictions for bounding box, and SCR (Single-class regression) is a box shared by all classes, when the final output is a 4-dimensional vector. PCR (Per-class regression) is a box for each class so that the final output is 4x1000, where 1000 represents 1000 classes.
- generalization : Like Zfnet, Vgg also did migration learning, pre-trained with ILSVRC data, and then migrated to other datasets voc-2007,voc-2012,caltech-101,caltech-256.
Googlenet (INCEPTIONV1)
- contribution : Champion of the ILSVRC2014 classification task. The network designs the inception block instead of manual to select the convolution type, then stacks the inception block (increase depth) to form the inception network. The full connection layer (which occupies most of the network's parameters) is removed, and the global mean pooling (thought from network in network) is used, greatly reducing the amount of the parameter. These two ideas are reflected in some of the papers behind Googlenet, one is the automatic selection of the network structure of the inception block (after Google published a number of automatic selection of network parameters, network optimizer, network activation function of the paper), The other is to reduce the model parameters and compute resources (Google's mobilenet, similar work and face++ shufflenet).
- Network structure : As shown in the inception block. The network has a total of 22 layers, the chart is too large, here is a table. You can see that while the full-join replacement has been replaced by a global mean pool (which is followed by the use of 0.4 dropout), the network diagram eventually has an all-connected layer, which is designed to facilitate the network fine tune to other datasets.
- parameters : In order to improve the model performance, the typical way is to increase the model (increase depth or width), but this will bring too large parameters, and then lead to increased computing resources and need more data (and high-quality data is often expensive), so consider reducing the parameters. INCEPTIONV1 Although there are 22 layers of parameters but only 5 million, is the same period VGG16 (138 million) 1/27, is alexnet (million) 1/12 and accuracy is far better than alexnet.
- 1x1 convolution benefits : Reduce the parameters, allow to increase depth, can be reduced dimension, build bottleneck layer to reduce the computational cost, inception block is by adding 1x1 after the 3x3 and 5x5 to reduce the calculation, enhance the network expression ability (according to their own wishes, Or to compress or increase or hold the number of channels). In addition to the global mean pooling to replace the full-join layer, this is to greatly reduce the parameters of the model. 1x1 's thoughts also come from the network in network.
- Hyper-parameter and preprocessing : Because the process of the game has made a lot of changes, including sampling methods and various parameters, it is difficult to define an effective guide to train the network. Only a few super-parameters were given, and the fixed learning rate per 8epoch dropped 4%,momentum was 0.9.
- Prediction : The first drop of the sample 256,288,320 and 352 size, respectively, from the left and right three bearings (if the portrait from the upper and lower three), and then cut from 4 corners and center 224x224 plus the square to 224, and their horizontal mirrors. In this way, we can get 4x3x6x2, which is 144 crops, and finally the average of crops.
- integration : As with previous networks, the integration was finally used, training 7 versions of the network for integration, averaging with polyak averaging, 7 networks using the same initialization and learning rate settings, except for the method and order of data sampling.
- target detection : INCEPTIONV1 's target detection is done using a similar r-cnn approach.
- Auxiliary Output : There are two auxiliary outputs in the INCEPTIONV1, it is said that because of the limited gradient return capacity of the network (gradient disappears), then the middle layer of the other two branches to take advantage of the characteristics of the middle layer, can increase the gradient callback, there are additional regularization effect. Then, in V3 's paper, it was suggested that the idea of using the middle-tier feature might be wrong, because removing the lower-level auxiliary (the first auxiliary output) had no effect on the final result, but emphasized the regularization effect of the auxiliary output, since adding bn and dropout to the auxiliary output could improve the main output performance.
Googlenet (INCEPTIONV2)
- contribution : Learn Vgg, use two 3x3 convolution instead of 5x5 's large convolution (while maintaining the sensing field while reducing the number of parameters), and use the well-known bn. It is noteworthy that inception-bn in V4 's paper that this network is called V2, and in V3 's paper there is another V2 (V3 a low version). The V2 mentioned in these two papers are not the same v2, usually v2 refers to this inception-bn.
- Network structure : As shown, the main changes have, with two 3x3 instead of 5x5,28x28 inception block from 2 to 3, some of the pooling is avg some is Max, There is no additional max-pool between the incetpion blocks, but the stride of the convolution and pooling is set to 2 directly. BN is used behind each input layer (BN is activated first). Batch size is 32. Network use Distbelief (TensorFlow predecessor) training
- Other changes : increase the learning rate and accelerate the learning rate attenuation (for bn data), remove the dropout and reduce the L2 weight attenuation (bn has a certain regular effect), remove LRN (the discovery of bn after the use of no LRN), more thoroughly shuffle training samples, Reduce the optical distortion of data when data is enhanced (because BN trains faster, each sample is trained less often, and the model needs to focus on a more realistic sample).
- Integration : The best results are, of course, integrated (not mentioned in the previous bn note). The integration of 6 networks, are based on the bnx30,6 version of the convolution layer to increase the initialization weight, using dropout 5%, using dropout 10%, Inceptionv1 applicable dropout 40%, not using convolution, first activate the bn. The integration method also has the same multi-crop use as the INCEPTIONV1.
- more : About Bn (including the BNx30 I mentioned in the previous article) I mentioned in another paper paper note: bn, and this paper is about Inception-bn's thesis.
Googlenet (INCEPTIONV3)
- contribution : Promote the model through a series of volume integration and regularization methods. This paper is called V3 paper, there is a low version of V3 called V2, where the V2 is just v3 this paper v2, this section refers to the V2 all refer to this v2. The usual V2 refers to the V2 of the BN essay.
- Training configuration : Using TensorFlow training, the learning rate is 0.045, with a 0.94 index rate per two-cycle attenuation. Gradient clipping threshold value is 2.
- v2 Network Structure : A total of 42 layers, the network diagram will not be released, the main changes are as follows. Figure a bit more, here will not put the diagram, each change module diagram can refer to the structure of the following V4 diagram.
- Modify the portion of the inception block to decompose the 5x5 into 2 3x3 convolution (refer to the following V4 Inceptiona).
- Modify part of the inception block, decomposed into asymmetric convolution (NXN decomposition into 1xn and nx1 convolution, here n=7. Note that the original structure does not have a 7x7 convolution) (refer to the following V4 inceptionb).
- Modify part of the inception block to enlarge the number of convolution cores (the number of branches that converge) (refer to the following V4 Inceptionc)
- Modify part of the inception block to reduce the feature map size (using parallel stride of 2 convolution and pooling) (refer to the reduction below V4)
- V3 Network Structure : On the basis of the above, add the following changes
- With rmsprop training, decay for 0.9,\ (\epsilon\) for 1.0
- Model regular with label smoothing
- The first layer of 7x7 is decomposed into 3 3x3 convolution
- Join the auxiliary classifier with Bn
Googlenet (inceptionv4,inception-resnet)
- contribution : Based on V3, the residual structure is introduced, and the INCEPTION-RESNET-V1 and Inception-resnet-v2 are proposed. At the same time modified inception proposed INCEPTIONV4, found that Inceptionv4 can achieve incetpion-resnet-v2 similar results, that the residual structure for training depth network is not necessary (previously read a fractal network paper also put forward " The residual block is not the necessary component of the training depth network ", I am in the paper note: Fractal network has mentioned).
- V4 Network Structure : The first figure below is v4.
- inception-resnet: Explored a variety of inception-resnet, the paper only elaborated two. where Inceptin-resnet-v1 and Inceptinv3 calculate the cost of the same, Inceptin-resnet-v2 and Inceptionv4 calculate the same cost, but the implementation of Inceptionv4 slow many may be due to too many layers. In the inception with ResNet, there is also a difference between the pure inception and the only use of bn on the traditional layer, not on the BN layer, so that the calculation can be reduced to stack more inceptin blocks.
- INCEPTION-RESNET-V2 Structure : The second figure below is inception-resnet-v2 (the shape of the output is INCEPTION-RESNET-V1).
- Training Configuration : Tensorflow,rmsprop,decay is 0.9,\ (\epsilon\) is 1.0, the learning rate is 0.045, with 0.94 exponential rate per two-wheel attenuation.
ResNet
- contribution : ILSVRC2015 Champion (classification, detection, positioning), by the MSRA He Hingming and others, by using the residual block training 152 layers of network, reducing the error rate. Resolves the degradation problem (the plain network deepens with the network, the error rate increases), and when the residuals are used, the error rate can decrease as the network deepens.
- Network Burn : For network deepening, there will be gradient vanishing or gradient explosion, this problem can be solved by regular initialization (He Keming initialization, etc.) and bn.
- degradation : However, deep networks reach a certain depth, accuracy is approaching saturation, and continuing to deepen will reduce accuracy, which is called degradation (degradation), and this problem is not caused by overfitting (over fitting in the training set should be better), Nor is the gradient lost (the paper examines the gradient).
- residuals block : In order to solve the degradation problem, the residual learning is proposed, as shown in the residual block, assuming that it is to learn h (x), adding an identity map after we have to learn is f (x) = H (x)-X, (assuming) learning F (x) is easier than learning H (x), The most extreme scenario is that assuming that the mapping we're going to learn is X, then it's easier to have F (x) 0, than to learn to H (x) for identity mapping. The motivation of this approach is that if the added layer can be built into a mapping layer such as naritsune, then the accuracy of a deeper network will be at least less than that of shallow networks.
- Residual block addition : when the input and output of the residual block is not the same dimension, there are two ways to ensure that the dimension is consistent, one is 0, the other is multiplied by the W matrix to do the mapping (using 1x1 convolution). When these two methods are implemented, the residuals block uses a convolution of stride 2.
- Training Configuration : When preprocessing, like vgg randomly sampling scale of [256, 480], and then like alexnet crop out 224x224 image and horizontal flip, and then do mean substracted. When predicting the use of Alexnet's 10-crop test, the best result is to follow the full convolution multi-scale evaluation in Vgg, scale {224, 256, 384, 480, 640}. The BN is used before the activation of each convolution, and dropout is not used. He Keming initialization. Sgd,batch size is 256, the learning rate starts at 0.1 each time the error rate is flat divided by 10, the model training 600,000 iteration, the weight attenuation of 0.0001,momentum is 0.9.
- Identity and projection: for the addition of residual blocks, there are three configurations, a configuration is a shortcut with identity, while the need to increase the dimension of 0. b configuration is a shortcut with identity, but the mapping is used when adding dimensions. C configuration is a shortcut and the addition of dimensions both use mappings. Performance is C>b>a, but the three differences are not big, the implementation will not use C, because C added parameters and calculations.
- Network Structure : the paper expounds the resnet-18-34-50-101-152. Where RESNET-18/34 uses configuration B, the bottleneck structure is also used, as shown in the diagram on the right of the first diagram below.
- Other experiments : In addition to Imagenet, the experiment was done on the CIFAR-10. Also in Pascal and Ms Coco to do a target detection experiment.
ResNetv2
- contribution : modified on the basis of V1 and improved performance.
- Analysis : The formula of RESNETV1 is as follows, the paper analyzes the selection of H function and f function, namely the function selection of shortcut path, and the operation selection after addition. In RESNETV1, the H function is an identity map, and the F function is the Relu function, as shown in (a).
\[y_l = h (x_l) + f (x_l, w_l), \ \ X_{l+1} = f (y_l) \]
- the selection of H function : This paper analyzes the performance of H function selection as identity mapping, constant scale, XOR, 1x1 convolution, dropout, and finds the best performance of identity mapping, mainly through experiments.
- F function Selection : Because the H function uses an identity mapping to perform best, the H function in the analysis uses the identity map . (a) indicates that the F function is Relu, which is the practice of RESNETV1. (b) indicates that the F function is bn+relu. (c) indicates that the F function is an identity map (relu before addition). D means the F function is an identity map, but the last Relu is placed in the f-path of the next residual block. E is similar to Figure D, except that the f-path,resnetv2 of the next residual block, which is also placed on the addition, is the structure of Figure e, which shows that the structure is best performed by experiments.
- The F and H functions are identity mappings : RESNETV2 Take the structure of figure E, when the F and H functions are identity mappings, then the above can be written in the following formula, you can see that the formula has several characteristics, first, no matter how many layers of L and L,\ (x_l\) and \ (x_l\) is always a difference between the residual function, and secondly, the normal network input and output relationship is a lot of wx multiplication (ignoring the activation and bn words), and here is the sum of the residual functions; In addition, from the derivative view, 1+ The latter item is not always 1 (for a mini-batch sample), so the gradient is difficult to 0.
\[X_{l+1} = x_l + F (x_l, w_l), \ \ x_l = x_l + \sum_{i=l}^{l-1}{f (x_i, w_i)}, \ \ \frac{\partial \varepsilon}{\partial x_l } = \frac{\partial \varepsilon}{\partial x_l}\frac{\partial x_l}{\partial x_l}=\frac{\partial \varepsilon}{\partial x_ L} (1+\frac{\partial x_l}{\partial x_l}\sum_{i=l}^{l-1}{f (x_i, w_i)}) \]
Training Configuration : basically consistent with RESNETV1. For Cifar, the first 400 iteration were used with 0.01 (warming up) and then recovered 0.1, although it was observed that this was not necessary for the residual block. For imagenet experiments, the learning rate is 0.1 (no warming up), in 30 and 60 rounds divided by 10. At the beginning of the ResNet the first residual block and the last residual block are special case, the activation of the first residual block is placed after the "individual convolution" after the next, and the activation of the last residual block is placed after its addition.
Reference documents
- AlexNet (NIPS): ImageNet classification with deep convolutional neural Networks
- zfnet (ECCV): Vi Sualizing and understanding convolutional Networks
- overfeat (ICLR): overfeat:integrated recognition, Localization and Detection using convolutional Networks
- Vgg (ICLR): Very deep convolutional Networks for Larg E-scale Image recognition
- inception-v1 (CVPR): Going deeper with convolutions
- Inception-v2 (ICS ML): Batch normalization:accelerating deep Network Training by reducing Internal covariate Shift
- inception-v3 (20 CVPR): Rethinking the Inception Architecture for computer Vision
- inception-v4 (ICLR): Inception-v4, Incept Ion-resnet and the Impact of residual Connections on learning
- RESNETV1 (CVPR): Deep residual learning for Ima GE recognition
- ResNetv2 ECCV: Identity Mappings in deep residual Networks