Deep learning Methods (10): convolutional neural network structure change--maxout networks,network in Network,global Average Pooling

Source: Internet
Author: User

Welcome reprint, Reprint Please specify: This article from Bin column Blog.csdn.net/xbinworld.
Technical Exchange QQ Group: 433250724, Welcome to the algorithm, technology interested students to join.

Recently, the next few posts will go back to the discussion of neural network structure, before I in "deep learning Method (V): convolutional Neural network CNN Classic model finishing Lenet,alexnet,googlenet,vgg,deep residual learning" The article describes the classic CNN network structure model, which can be said to be a well-known network structure, at the end of that article, I mentioned "it is time to move a convolution calculation form", because many work proved that in the basic CNN convolution calculation mode, many simplified, extended, Changes can make convolution calculations more characteristic, such as reduced parameters, less computation, improved effects, and so on.

The next few articles will introduce the following topic:

    1. Maxout Networks
    2. Network in Network
    3. Inception Net (Google)
    4. Resnext
    5. Xception (Depth-wise convolution)
    6. Spatial Transformer Networks
    7. ...

This article first describes the work of two 13, 14: Maxout networks,network in Network. There is a lot of information on the Internet, but many authors I believe I do not fully understand, in this article I will describe as clearly as possible. The focus of this article is on network in network. This article is aimed at the paper and the network material collation, the self-writing, guarantees each beginner can understand.

1. Maxout Network

Frankly speaking, maxout itself is not a convolution structure change, but it puts forward a concept-linear change +max operation can fit arbitrary convex functions, including activation functions (such as Relu), the following Nin have a relationship, so first introduce maxout.

Maxout appeared on the ICML2013, the Great god Goodfellow (Gan's proposed person ~) will maxout and dropout combined, claiming in Mnist, CIFAR-10, CIFAR-100, Svhn These 4 data have obtained the Start-of-art recognition rate.

As can be seen from the paper, Maxout is actually a form of excitation or function. Normally, if the activation function takes the sigmoid function, the output expression of the hidden layer node in the forward propagation process is:

This is the case with the general MLP. where W is generally 2-dimensional, this means that the first column is removed (corresponding to the I output node), subscript I before the ellipsis represents all rows in column I. In the case of a maxout activation function, the output expression of its hidden layer node is:

Here the W is 3-dimensional, the size of D*m*k, where D represents the number of input layer nodes, m represents the number of hidden layer nodes, K indicates that each hidden layer node expands the K-intermediate nodes, the K-intermediate nodes are linear output, and maxout each node is to take this K intermediate node output the maximum value. Refer to a Japanese maxout ppt in the PPT as follows:

The consciousness of this picture is that the hidden nodes in the purple circle are expanded into 5 yellow nodes, taking max. Maxout's fitting ability is very strong, it can fit arbitrary convex function. From left to right, the relu,abs and two curves are then fitted in turn.

The author from the mathematical point of view also proves that this conclusion, that is, only 2 maxout nodes can fit arbitrary convex function (subtraction), if the number of intermediate nodes can be arbitrarily many, as shown, specifically can be browsed paper[1]. A strong assumption of the maxout is that the output is in the convex set of the input space .... Is this hypothesis certain? Although Relu is a special case of maxout-in fact, it is not relu the right situation, we are learning this nonlinear transformation , the combination of multiple linear transformations +max operation.

2. Network in Network

OK, described above maxout[1], next focus on the 14 Singapore NUS Mishing Teacher Group Min Lin a work network in network, to tell the truth, whether it is intentional or unintentional, some of the concepts of this article, including 1*1 convolution, global Average pooling have become the standard structure of network design later, have original opinion.


Figure 1

First look at the traditional convolution, figure 1 left:

Many students do not carefully look at the meaning of the subscript, so the understanding of ambiguity. Xij represents a patch (typically K_h*k_w*input_channel) of a convolution window, and K represents the index of the K-kernel, and the activation function is relu. It is not to say that you only do one kernel, but you refer to any kernel.

Then look at the Mlpconv Layer proposed in this article, that is, the network in network, figure 1 right. This is just adding a layer of fully connected MLP layers, what does that mean? The authors call it the "cascaded cross channelparametric poolinglayer", which is cascaded across the channel with parametric pooling to:

Each pooling layer performs weighted linear recombination on the input feature maps

See equation 2 is very clear, is the first layer or the traditional convolution, after doing a convolution, the output feature map of each pixel point fij, its corresponding all channel has done another MLP, the activation function is relu. n represents the nth layer, and the KN represents an index because there are many kernel in the nth layer, and the preceding equation 1 is a reason. So, let's look at the whole NIN network below:

Look at the first Nin, originally 11*11*3*96 (11*11 convolution kernel, output map 96) for a patch output 96 points, is the output feature map the same pixel 96 channel, but now add a layer of MLP, Make a full connection to these 96 points, and output 96 points-- very ingenious, this new MLP layer is equivalent to a 1 * 1 convolution layer , so in the neural network structure design is very convenient, as long as the original convolution layer after adding a 1*1 convolution layer, The size of the output is not changed. Note that each convolution layer is followed by a relu. So, the equivalent of the network to become deeper, I understand that the depth of the effect is the main factor of Ascension.

Example explanation

Suppose there is now a 3x3 input patch, with X, the convolution core size is also 3x3, the vector w represents, the input channel is C1, the output channel is C2. The following photo is my own hand-drawn, relatively simple, forgive me:)

    • For the general convolution layer, Direct X and W convolution, get 1*1 1 points, there are C2 kernel, get 1*1*c2;
    • Maxout, there are K of the 3x3 W (k is freely set here), respectively convolution to get K 1x1 output, and then the K input to maximize, get 1 1*1 point, for each output channel to do so;
    • NIN, there are K 3x3 W (k is also freely set here), respectively convolution to get the K 1x1 output, and then all of them are Relu, and then re-convolution, the results again relu. (This process is equivalent to a small, fully connected network)

Here, a concept is built, and an all-connected network can be converted to 1*1 's convolution equivalent, which is useful in many later networks, such as fcn[5].

Global Average Pooling

In the Googlenet network, the global Average Pooling is also used, which is actually inspired by the Internet. The Global Average pooling is typically used to place the last of the network, replacing the fully connected FC layer, why replace FC? Because in use, such as the alexnet and Vgg networks, the FC layer is concatenated between convolution and Softmax, there are some drawbacks:

(1) The number of parameters is very large, sometimes a network of more than 80~90% in the last several layers of FC layer;
(2) Easy to fit, many CNN network over-fitting is mainly from the last FC layer, because there are too many parameters, but there is no suitable regularizer; overfitting leads to weaker generalization ability of the model;
(3) Practical application of a very important point, paper did not mention: FC requires input and output is fix, that is, the image must be in accordance with the given size, and in reality, the image has a large and small, FC is very inconvenient;

The authors present the global Average Pooling, which is simple, and is to take a holistic Average of each individual feature map. The output is required to nodes the same number of categories as the category, so that the Softmax can be directly followed.

The authors note that the benefits of Global Average pooling are:

    • Because forcing the final feature map number equals the category number, feature map will be parsed as categories confidence maps.
    • No parameters, so the fitting is not done;
    • The computation of a plane makes use of spatial information, which is more robust to the change of image in space.

Dropout

Finally, a little mention of dropout, this is Hinton in improving neural networks by preventing co-adaptation of feature detectors[9] in the article. The method is that at the time of training, the node output of a layer of hidden layer output node is randomly selected p (such as 0.5), and the weights associated with those 0 nodes are not updated in this iteration training. Dropout is a very powerful regular method, why? Because some weights have not been updated, reduced overfitting, and each training can be seen as the use of the network model is not the same, therefore, the final global equivalent to refers to a number of model mixed results, mixed model generalization ability is often strong. The general dropout is used in the FC layer, mainly because FC is easy to fit.

OK, this article on to here, welcome to the students to share the DL, there are problems can be in the following message.

Resources

[1] maxout Networks, 2013
[2] Http://www.jianshu.com/p/96791a306ea5
[3] Deep learning: 45 (Maxout simple Understanding)
[4] paper notes "Maxout Networks" && "Network in Network"
[5] Fully convolutional networks for semantic segmentation, 2015
[6] http://blog.csdn.net/u010402786/article/details/50499864
[7] Deep Learning (26) Network in Network learning notes
[8] Network in Nerwork, 2014
[9] improving neural networks by preventing co-adaptation of feature detectors

Deep learning Methods (10): convolutional neural network structure change--maxout networks,network in Network,global Average Pooling

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.