**googlenet incepetion V1**
This is the earliest version of Googlenet, appearing in the 2014 going deeper with convolutions. It is called "googlenet" rather than "googlenet", and the article says it is to salute the early lenet.

**Motivation**
Deep learning and the rapid development of neural networks, people are no longer focused on more hardware, larger datasets, larger models, but more attention to new idea, new algorithms and model improvements.

In general, the most straightforward way to improve network performance is to increase network depth and width, which means a huge number of parameters. However, a large number of parameters prone to **overfitting** can also greatly increase the **computational capacity** .

The paper thinks that the basic method to solve the two shortcomings is to convert all-connected and even general convolution into sparse connection. On the one hand, the connection of the real biological nervous system is sparse, on the other hand, 1 shows that: for large-scale sparse neural networks, the statistical characteristics of the activated values can be analyzed and the highly correlated outputs are clustered to build an optimal network on a per-layer basis. **This indicates that a bloated sparse network can be simplified without sacrificing performance. **while mathematical proofs have strict conditionality, the Hebbian guidelines strongly support this: fire together,wire together.

Earlier, in order to break the network symmetry and improve learning ability, the traditional network has used random sparse connection. However, the computational efficiency of the computer software and hardware is very poor for the non-uniform sparse data, so the full-connection layer is re-enabled in Alexnet in order to better optimize the parallel operation.

So the question now is whether there is a way to **keep the network structure sparse and to take advantage of the high computational performance of dense matrices** . A large number of literatures indicate that sparse matrix clustering can be used to improve computing performance, and the structure named inception is proposed to achieve this goal.

**Architectural Details**
The main idea of **Inception** structure is how to use dense components to approximate the optimal local sparse structure.

The author first proposes such a basic structure:

To do the following instructions:

**1.** The use of different size of convolution kernel means that different size of the field of perception, the final stitching means the fusion of different scale features;

**2.** The convolution kernel size is 1, 3, and 5, mainly for easy alignment. After setting the convolution step stride=1, as long as set pad=0, 1, 2 respectively, then convolution can get the same dimension characteristics, then these features can be directly spliced together;

**3.** The article said many places have shown that pooling is very effective, so inception inside also embedded.

**4.** The more the network comes to the back, the more abstract the feature, and the greater the sensitivity of each feature, so the proportion of 3x3 and 5x5 convolution increases as the number of layers increases.

**However, the use of 5x5 convolution cores will still result in a huge amount of computation. **for this reason, the article uses the 1x1 convolution kernel to **reduce dimension** by reference to NIN2.

For example, the output of the previous layer is 100x100x128, after a 5x5 convolution layer with 256 outputs (stride=1,pad=2), the output data is 100x100x256. The parameters of the convolution layer are 128x5x5x256. If the previous output passes through a 1x1 convolution layer with 32 outputs and then passes through a 5x5 convolution layer with 256 outputs, the final output data is still 100x100x256, but the number of convolution parameters has been reduced to 128x1x1x32 + 32x5x5x256, about 4 times times less.

Specifically improved inception module such as:

**googlenet**
The overall structure of the googlenet is as follows:

To do the following instructions:

**1.** Obviously googlenet adopts the modular structure, which is convenient to add and modify;

**2.** The network finally adopted the average pooling to replace the full connection layer, the idea from Nin, it turns out that TOP1 accuracy can be increased by 0.6%. However, the actual in the end or add a full connection layer, mainly for the convenience of everyone finetune;

**3.** Although the full connection is removed, dropout is still used in the network;

**4.** To avoid gradients disappearing, the network added 2 additional auxiliary Softmax for forward conduction gradients. The article says that the loss of these two auxiliary classifiers should add a attenuation factor, but the model in Caffe does not have any attenuation. In addition, the two additional softmax will be removed during the actual test.

is a relatively clear structure diagram:

**Conclusion**
Googlenet is the Google team in order to participate in the ILSVRC 2014 competition and carefully prepared, in order to achieve the best performance, in addition to using the above network structure, but also do a lot of auxiliary work: including training multiple model averaging, cutting different scales of the image to do multiple verification and so on. These are detailed in the experimental section of the article.

The main idea of this paper is to construct a dense block structure to approximate the optimal sparse structure, so as to improve the performance without increasing the computational amount. The Caffemodel size of the googlenet is about 50M, but the performance is excellent.

**googlenet Inception V2**
V2 put forward the introduction of bn,http://blog.csdn.net/app_12062011/article/details/57083447, in addition, the reverse conduction of bn: http://www.jianshu.com/p/ 4270f5acc066.softmax Gradient calculation: http://blog.csdn.net/u014313009/article/details/51045303

**googlenet Inception V3**
Googlenet, with its excellent performance, has been studied and used by many researchers, so the Google team has further explored it, resulting in an upgraded version of Googlenet. The version described in this section is written in V3, the article: "Rethinking the Inception Architecture for computer Vision".

**Introduction**
Since the 14, the construction of a deeper network has become mainstream, but the size of the model also makes computing more inefficient. Here, the article tries to find a way to **expand the network while maximizing the computational performance** .

First of all, the Googlenet V1 appeared in the same period, the performance and the approximate only vggnet, and both in the image classification of many areas have been successfully applied. In contrast, Googlenet's computational efficiency is significantly higher than that of Vggnet, which is only about 5 million parameters, equivalent to Alexnet 1/12 (googlenet caffemodel about 50M, The Vggnet Caffemodel is more than 600M).

Googlenet's performance is good, but if you want to build a larger network by simply zooming in on the inception structure, you'll immediately increase your compute consumption. In addition, in the V1 version, the article does not give a clear description of the considerations for building inception structures. Therefore, in the article, the authors **first give some of the common Criteria and optimization methods which have been proved to be effective in amplifying networks** . These guidelines and methodologies apply but are not limited to inception structures.

**General ****Design Principles**
The following criteria are derived from a large number of experiments, and therefore contain some speculation, but the actual proof is basically effective.

**1. Avoid expressing bottlenecks, especially where the network is in front of you. **It is obvious that the process of flow forward propagation cannot pass through a highly compressed layer, that is, to express bottlenecks. The width and height of the map from input to output,feature will gradually become smaller, but not all of a sudden it becomes small. For example, you come up with a kernel = 7, stride = 5, which is obviously inappropriate.

In addition, the output of the dimension channel, generally will gradually increase (num_output per layer), otherwise the network will be difficult to train. (The feature dimension does not represent the amount of information, but as a means of estimating)

This situation generally occurs in the pooling layer, the literal meaning is, pooling after the feature map is smaller, but the useful information can not be lost, not because of the network funnel-shaped structure and create a bottleneck, the solution is the author proposed a feature map reduction method, more complex pooling.

**2. High-dimensional features are easier to handle. **high-dimensional features are easier to differentiate and accelerate training.

**3. You can spatially converge on low-dimensional embedding without worrying about losing a lot of information. **For example, before the 3x3 convolution, the input can be reduced to a dimension without having serious consequences. If the information can be simply compressed, then the training will be accelerated.

**4. Balance the width and depth of the network.**

These are not directly used to improve the quality of the network, but only to be guided in the context of the big environment.

**factorizing convolutions with Large Filter Size**
Large convolution cores can lead to greater susceptibility, but it also means that more parameters, such as the 5x5 convolution kernel parameter, are 25/9=2.78 times the 3x3 convolution kernel. For this reason, the authors suggest that a small network of 2 contiguous 3x3 convolution layers (stride=1) can be used instead of a single 5x5 convolution layer (while maintaining the sensing range while reducing the number of parameters) such as: (This is actually raised in Vgg)

Then there will be 2 questions:

**1. Will this substitution result in decreased expression capacity? **

There are a lot of experiments behind it to show that there is no lack of expression.

**2. Do you want to activate the 3x3 convolution after it has been added? **

The authors also did a contrast test, suggesting that **adding non-linear activation would raise performance** .

From the above, the large convolution nucleus can be replaced by a series of 3x3 convolution cores, which could be broken down a little bit. This paper considers the **nx1 convolution kernel** .

Replace the 3x3 convolution as shown:

Therefore, the convolution of any nxn can be replaced by the convolution of the 1xn convolution followed by NX1. In fact, the authors found that the **use of this decomposition in the early days of the network was not good, and that it would be better to use the effect on a moderate-sized feature map** . (For MXM size feature map, M is recommended between 12 and 20).

Summarized as:

**(1)** Figure 4 is the inception structure used in Googlenet V1;

**(2)** Figure 5 is a 3x3 convolution sequence to replace the large convolution core;

**(3)** Figure 6 is a nx1 convolution to replace the large convolution core, here set N=7 to deal with 17x17 size feature map. The structure is formally used in the Googlenet V2. The asymmetric convolution kernel, in fact, is similar to the convolution operation, the two-dimensional decomposition is 1-dimensional computation, which improves the computational speed.

**Optimizing Auxiliary classifiers**
The author finds that the auxiliary classifier in V1 is a bit of a problem: auxiliary classifiers does not accelerate convergence at the beginning of the training, it only slightly improves network accuracy when the training is almost over.

Auxiliary classifiers

Then szegedy the first auxiliary classifiers removed! also said, auxiliary classifiers can play the role of Regularizer. The original text reads as follows:

Instead, we argue that the auxiliary classifiers act as Regularizer. This was supported by the fact, the main classifier of the network performs better if the side branch is Batch-normaliz Ed or has a dropout layer. This also gives a weak supporting evidence for the conjecture that batch normalization acts as a regularizer.

**optimization of pooling**
Traditionally, prior to pooling, in order to prevent information loss, the expand layer, such as the right half, should be added.

Unefficient Grid Size reduction

There is a problem, it will increase the computational capacity, so szegedy came up with the following pooling layer.

Efficient Grid Size reduction

As you can see, Szegedy uses two parallel structures to complete the grid size reduction, respectively, the right half of the conv and pool. The left half is the inner structure of the right part.

Why did you do this? I mean, how is this structure designed? Szegedy no mention, perhaps this is the charm of deep learning it.

**Optimize Labels**
Szegedy used nearly a page of space to describe the label smooth, which shows how difficult this method is to understand.

Deep learning labels are generally one hot vector, used to indicate the only result of classifier, such labels a bit similar to the pulse function in the system, or called "Dirac Delta", that is, only in a certain position to take 1, the other position is 0.

The impulse properties of labels can lead to two adverse effects: one is over-fitting, the other is to reduce the adaptability of the network. I really did not understand this passage, attached to the original:

First, it may result in over-fitting:if the model learns to assign full probability to the Groundtruth label for each TRA ining example, it is not guaranteed to generalize. Second, it encourages the differences between the largest logit and all others to become large, and this, combined with th E bounded gradient, reduces the ability of the model to adapt.

After explaining the above two shortcomings, Szegedy added a sentence, saying that the bad result is because the network is too confident about what it predicts.

Intuitively, this happens because the model becomes too confident on its predictions.

OK, first of all, this label smooth how to achieve the specific? is the following formula:

Label Smooth

To make it easy to understand, convert to Python code, that's it:

`new_labels = (1.0 - label_smoothing) * one_hot_labels + label_smoothing / num_classes`

Szegedy when the network is implemented, make label_smoothing = 0.1,num_classes = 1000. Label smooth improves network accuracy by 0.2%.

I understand the label smooth, it is the original very abrupt one_hot_labels slightly smooth a little, the gun hit the bird, cut the head of the chicken, the crane, a bit of height to the chickens, to avoid the network over-learning labels and the drawbacks.

- Sanjeev Arora, Aditya Bhaskara, Rong Ge, and Tengyu Ma. Provable bounds for learning some deep representations. CoRR, abs/1310.6343, 2013.?
- Min Lin, Qiang Chen, and Shuicheng Yan. Network in Network. CoRR, abs/1312.4400, 2013.?

V4 in the back first after learning ResNet, and then add, because the author is a combination of ResNet after the proposed V4 "[V4] inception-v4, Inception-resnet and the Impact of residual Connections On learning, 3.08% test error, http://arxiv.org/abs/1602.07261 "

System Learning Deep Learning--googlenetv1,v2,v3 "Incepetion v1-v3"