In the previous article we brought out the network structure of Googlenet InceptionV1, in this article we will detail inception V2/V3/V4 's development process and their network structure and highlights.
Googlenet Inception V2
Googlenet Inception V2 in "Batch normalization:accelerating deep Network Training by reducing Internal covariate Shift" appears, the largest The highlight is the batch normalization method, which plays the following role:
- use larger learning Not particularly concerned with optimization issues such as gradient explosions or disappearance;
-
- can accelerate convergence, to a certain extent can not use dropout this method of reducing convergence speed, but played a regularization role to improve the model generalization;
-
-
In machine learning, we usually make the assumption that the training sample is independent of the same distribution (IID) and that the training sample is consistent with the test sample distribution, and that if the actual data conforms to this hypothesis the model effect may be good, and vice versa, this is called Covariate Shift, so from the sample (external) point of view , the same is true for neural networks. From the structure (internal) Point of view, because the neural network is composed of multiple layers, the sample in the layer and layer edge of the feature edge forward propagation, if each layer of input distribution is inconsistent , then will cause either the model effect is not good, or learning slow, academic called internalcovariate Shift.
Suppose:$y $ for the sample callout , $X =\{x_{1},x_{2},x_{3},...... \}$ for the sample $x The input of each layer after several layers of the neural network ;
theoretically:the joint probability distribution of $p (x, y) $ should be consistent with the joint probability distribution of any of the inputs in the set $x$, such as : $p (x, y) =p (x_{1},y) $;
however : $p (x, y) =p (y|x), where the conditional probability $p (y|x) $ is consistent, i.e. $p (y|x) =p (Y|x_{1}) =p (Y|x_{1}) =......$, However, due to the change of the input distribution in each layer of neural network, the edge probability is inconsistent, namely $p (x) \neq p (x_{1}) \neq P (x_{2}) ... $, even as the depth of the network deepens, A slight change in the front layer can result in a huge change in the back layer .
bn The entire algorithm process is as follows:
- In the form of batch training, after the expectation and variance of m samples to the training data whitening, through the whitening operation can remove the feature correlation and the data to scale on a sphere, the benefits of this can not only speed up the optimization algorithm optimization speed may also improve the optimization accuracy, an intuitive explanation:
On the left is the original feasible field without whitening, and the right is the feasible domain of whitening;
- The original input can be restored when the original input is more favourable to the model learning (and the residual network is a bit of a likeness):
Here the parameters $\gamma$ and $\sigma$ are needed to learn.
bn in convolutional neural networks
The weighted sharing strategy is used in convolutional networks, with each feature map having only one pair of $\gamma$ and $\sigma$ to learn.
googlenet Inception V3
Googlenet Inception V3 is presented in rethinking the Inception Architecture for computer Vision (note that in this paper the author calls this network structure v2 version, We use the final V4 version of the paper as the standard), the highlight of this paper is:
- A general design criterion of network structure is proposed
- Introducing volume integral solution to improve efficiency
- Introduction of efficient feature map dimensionality reduction
Guidelines for network structure design
As mentioned earlier, the exploration of deep Learning Network is more an experimental science, in the experiment, people summed up a number of structural design guidelines, but honestly I think not necessarily all have the actual sex:
- Avoid bottlenecks on feature representations, especially in the first few layers of neural networks
The neural network contains a process for extracting features automatically, such as multilayer convolution, which is intuitive and common sense: If the feature extraction is too coarse at the beginning of the network, the details are lost, and the subsequent structure can not be effectively expressed even fine; for an extreme example: to discern a planet in the universe, normally through from near, From houses and trees to oceans and continents to the whole universe, and if we start off directly to the universe, you will find that all planets are spheres, and there is no way to tell which is the earth and which is mercury. Therefore, the size of the feature map should be gradually smaller as the number of layers deepens, but in order to ensure that the features can be effectively represented and the number of channels will gradually increase.
Violation of this principle, just opened directly from the 35x35x320 was sampled to the 17x17x320, feature details are lost, even if there are inception to do a variety of feature extraction and combination is useless.
- For a certain layer of the neural network, the more active output branches can produce mutually decoupled feature representations, resulting in higher order sparse features, which accelerates convergence, noting the 1x3 and 3x1 activation outputs:
- Reasonable use of dimension reduction does not disrupt network feature representation but can accelerate convergence, typically by replacing a 5x5 dimensionality reduction strategy with two 3x3, regardless of padding, with two 3x3 instead of one 5x5 to save the compute consumption of the 3x3+3x3/(5x5) =28%.
- And a nxn convolution core is reduced by two 1xn and nx1 connected sequentially (a bit like matrix decomposition), if n=3, the computational performance can be increased by 1-(3+3)/9=33%, but if high performance computing performance is considered, this decomposition may cause L1 cache miss rate to rise.
- Optimize Network computing consumption by reasonably balancing the width and depth of the network (this sentence is especially not practical).
- Sampling dimensionality reduction, the traditional sampling method for the pooling+ convolution operation, in order to prevent the occurrence of the bottleneck of the feature expression, often need more convolution core, such as the input of N dxd feature map, a total of k convolution kernel, pooling stride=2, for the non-occurrence of features to express bottlenecks, The value of k is often 2n, through the introduction of inception module structure, that is, to reduce computational complexity, and will not appear the bottleneck of the characteristics, the implementation of the following two ways:
Smoothing Sample Callouts
For multi-class sample labeling is generally one-hot, such as [0,0,0,1], using a loss function similar to cross-entropy will make the ground Truth label allocation in the model learning the probability of excessive confidence, and because ground The logit value of the truth tag is too large for other labels, resulting in overfitting, resulting in reduced generalization. One solution is home Plus, which is to adjust the sample label to a probability distribution, so that the sample label becomes "soft", such as [0.1,0.2,0.1,0.6], which reduces the error rate of top-1 and top-5 by 0.2% in the experiment.
Network Structure
googlenet Inception V4
Googlenet Inception v4/and ResNet V1/V2 three structures in Inception-v4, inception-resnet and the Impact of residual Connections on Learni Ng, the highlight of the paper is that: the googlenet Inception v4 network structure with better effect is proposed, and the structure of the network with residual error is more effective than V4 but the training speed is faster.
googlenet Inception V4 Network Structure
googlenet Inception resnet Network Structure
Code practices
TensorFlow code in the Slim module has a complete implementation, Paddlepaddle can refer to the previous article written in inception V1 code to write.
Summary
This article compares the partial theory, mainly discusses the development of Googlenet's inception module, including the volume integration level and the more general network structure criterion proposed in the batch Normalization,v3 proposed in V2, and the combination of the V4 with the residual network, etc. In the actual application process can be the same data with different network structure to run a run, to see how the results, the actual experience of the different network structure of the loss decline rate, to improve the accuracy rate.
"Deep Learning Series" with Paddlepaddle and TensorFlow for Googlenet inceptionv2/v3/v4