convolutional Neural Networks

Source: Internet
Author: User

convolutional Neural Network (convolutional neural networks/cnn/convnets)

Convolutional neural networks are very similar to normal neural networks: the neurons that make up them all have learning weights (weights) and biases (biases). Each neuron accepts some input, performs a dot product operation, and may execute a nonlinear function to finally get the output of the neuron. The entire network can still be represented as a micro-scoring function. This function enters the pixel of the image at one end, and the other end gets the score of a category. At the same time convolutional neural networks still have loss functions on the next layer (fully-connected)--such as svm/softmax--and all the tricks we develop for conventional neural networks, are still there.

So what is the difference between convolutional neural networks? The structure of the convnet explicitly assumes that the input is an image, which allows us to integrate certain attributes into the network architecture. This makes the forward propagation function more efficient and greatly reduces the number of parameters in the network.

Architecture Overview

Review: The conventional neural network accepts an input (vector) and passes the vector through a series of hidden layers. Each hidden layer contains a series of neurons that are all connected to the previous layer of neurons, and that, even on the same level, each neuron is completely independent from each other, without any connection being shared.

Conventional neural networks do not have a good extension on a full large map. In CIFAR-10 , for example, the image size is just \ (32\times32\times3\)(32 width 32 high 3 color channel), so in the first hidden layer of the neural network, each neuron has a \ (32*32*3= 3072\) weight parameters, this number seems manageable, but it is clear that the full-join network cannot be extended to large images. For example, if a sizable size \ (200\times200\times3\) image causes each neuron to have 12,000 weight parameters, and we need many of these neurons, then the number of parameters will increase very quickly! Obviously, full joins are wasteful, and a large number of parameters produce over-fitting.

3D neurons (3D volumes of neurons). convolutional neural network makes full use of the fact that the input is an image and constrains the network structure in a reasonable way. Unlike conventional neural networks, the neurons of the convolutional neural network are arranged in 3 dimensions: width, height, depth (depth). For example, the input image inCIFAR-10 is an active input volume, and the dimension of the volume is \ (32\times32\times3\). The neurons in the layer connect only a small subset of the neurons in the previous layer, rather than using a fully-connected network. The final CIFAR-10 dimension is \ (1\times1\times10\). The following is a visual diagram:


: Conventional 3-layer neural network.
: Convnet has its neurons arranged in 3 dimensions (width, height, depth), just as one of the layers being shown. Each layer of the convnet converts a 3D input volume into a 3D neuron excitation. In the example on the right, the red input layer represents the input image, so the width and height of this layer should be equal to the dimension size of the image, depth should be 3 (RGB channel)

A convolutional neural network contains a number of layers, each of which contains a simple API: an input 3D volume can be converted to another 3D volume output using a single, parametric, and functional micro function.

Build a layer used by convolutional neural networks (Layers used to build convnets)

As mentioned above, a simple convolutional neural network is a series of layers, each of which transforms a 3D volume into another 3D volume through one of the micro functions. We use 3 major layers to build our convolutional neural network architecture: The convolution layer (convolutionallayers), the pooling layer (Poolinglayers), and the full join layer (full-connected Layer). We build our convnet architecture by stacking these network layers.

As an example,

The simple convolutional neural network of the CIFAR-10 classification task has such a structure
\[\text{input}\to\text{conv}\to\text{relu}\to\text{pool}\to\text{fc}\]

    • input [\ (32\times32\times3\)]: entry layer will record the pixel value of this input image, the scene width 32,height 32, and there are 3 color channels
    • The CONV layer calculates the output of the neurons connected to the local area, computes the dot product of the local area connected by the weights and the input volume, and obtains the computed results of each neuron. If we use 12 filters, then each filter is \ (1*1\), then we can get [\ (32\times32\times12\)] Output volume
    • The RELU layer will apply element activation functions, such as \ (max (0,x) \), passing through this layer, the resulting volume dimension remains unchanged, still [\ (32\times32\times12\)]
    • The POOL layer performs the next sampling operation at a spatial size (width, height). Get the volume for [\ (16\times16\times12\)]
    • The FC (full join) layer calculates the score for the category, resulting in a volume of [\ (1\times1\times10\)]. As the name implies, every neuron in the full-join layer, like a normal neural network, is connected to all neurons in the previous layer.

In this way, convnets the original pixel values of the original image one layer at a level, and eventually translates to a category score (scores). Note that some layers contain parameters, while others do not. The parameters of the convolution layer and the full-join layer are trained by the gradient descent method, so that the Convnet calculates the category label and the training set.

Now we'll describe the details of each individual layer, such as hyper-parameters, connectivity

convolutional layers (convolutional layer)

Convolutional layer is the core module of convolutional network construction, and the most computationally heavy work is done.
The convolution layer contains a number of learning filters, each of which is small in space (width and height), but consistent with the input volume in the depth direction, such as the convnet of the first convolutional layer Classic filter size \ (5*5*3\)

In the forward propagation phase, we slide (precisely, convolution) each filter along the width and height of the input volume, and calculate the entire filter and the dot product of any position input. When we slide the filter on the width and height of the input volume, we will generate a 2-D activation diagram that shows the response of the filter to the spatial position. Intuitively, the network learns the filter that activates when it sees some type of visual feature, such as the edge of a direction on the first layer or a spot of a color, or the entire honeycomb or wheeled pattern on the higher level of the end of the entire network. In the case of CIFAR-10 , we have a complete set of filters for each conv layer, and they generate a separate 2-D activation diagram. We will stack these activation diagrams along the depth dimension to produce the output volume

Local Connections (Local Connectivity). When we look at the high dimensional input image, we see that it is impractical to connect the neurons to the previous layer. Instead, we make each neuron only connect to a local area of the input volume. The spatial extent of this local connection is a hyper-parameter of the neuron, which we call thereceptive field. The connecting depth along the depth axis is equal to the depth of the input volume depth. It is worth emphasizing again here that we are dealing with a different approach when working with spatial dimensions (width and height) and depth dimensions (depth): "Connections on spatial dimensions are local, but always follow the entire depth of the input volume".


: Red is the input volume, Blue is the first convolutional layer example. Each neuron of the convolution layer is spatially connected only to the local area of the input volume, but is connected to the full depth (that is, all color channels).
: The operation of the neuron remains unchanged. They still calculate the dot product between weights and inputs, and then use the nonlinear activation function \ (f (x) \), but their current connectivity is limited to local space.

spatial arrangement (Spatial arrangement), we discussed the connectivity of each neuron in the convolution layer and the input volume. But we haven't discussed how many neurons are in the output volume and how they are arranged. There are three hyper-parameters to control the size of the output volume depth depth, Stride Stride and zero-padding 0-padding.

    1. First, the depth of the output volume (depth) is a hyper-parameter: it is the same as the number of filters we use. We use a set of neurons that observe the same input area as the output volume depth column depth
    2. Then another super parameter refers to the Stride Strideof our sliding filter. For example, if stride is 1, then we move the filter one pixel at a time, if stride is 2, and so on. In practical applications, over 3 of the over-the-length parameters are generally less applied. Using larger stride will result in a smaller output volume.
    3. Sometimes it is convenient to use 0 padding around the input volume. The number of zero-padding is also a super parameter,zero-padding has a bit, that is, it allows us to control the size of the output volume space

Assuming that the convolution layer's sensing field size is \ (f\), the filter's sliding stride is \ (s\), the boundary uses 0-the number of fills is \ (p\), the size of the given input volume is \ (w\) We can calculate the size of the output volume.
\[(w-f+2p)/s+1\]
For example, the \ ( 7\times7\) input uses \ (3\times3\) filter and a quantity of 1 "0-padding", which can be obtained when stride is 1 and 2 (5\times5\) and \ (3\times3\) output, such as

The weight values shared between neurons are [1, 0,-1], and are biased to 0.

use 0-fill zero-padding. For the example on the left, we see that the input dimension equals 5 and the output dimension is equal to 5. If you do not use "0-fill". The spatial dimension of the output volume is only 3. In general, if you assume Stride \ (s=1\), in order for the input volume and output volume to have the same scale, set \ (p=\frac{f-1}{2}\). This is usually done using the "0-fill" operation, for specific reasons, and we'll discuss the convnet structure later.

the limits of Stride (constrains on strides), and again, the spatial arrangement (spatial arrangement) has a limitation on the parameters of the hyperspace. Suppose that when the size of the input \ (w=10\)is not used, "0-fill" \ (p=0\), and the size of the filter \ (f=3\), then the Stride \ (s=2\)will not be used, because \ ((w-f+2p)/s+1 = 4.5\). Because it is not an integer, it indicates that the neuron cannot "fit" the entire input neatly and symmetrically. Therefore, this hyper-parameter setting would be unreasonable, and the Convnet library might throw an exception or fill the zeros to make it match, or trim the input to make it work. Adjusting the size of the convnets makes it a headache to have all dimensions "solved". Using 0 fills and some designs can significantly reduce this burden.

parameter sharing , the convolution layer uses the parameter sharing scheme to control the number of parameters. It turns out that we can significantly reduce the number of parameters by a reasonable hypothesis:

It should also be useful to compute on another location \ (( x_2,y_2) \) If a feature is useful for calculations on a spatial location \ (( x, y) \) .
With the parameter sharing scheme, our number of parameters will be greatly reduced. In the actual reverse propagation, each neuron in the volume calculates its weight gradient, which is superimposed on each depth slice, and the weight updates of each slice are independent of each other.

convolutional Neural Network Example

Pooled layers (Pooling layer)

It is common to periodically insert a pooled layer between successive convolution layers in a convnet system mechanism. Its function is to gradually reduce the spatial size of the image expression, control the number of parameters and the overall network of computing, thus controlling the fit. The pooling layer is calculated independently on each depth slice of the input using the MAX operation. The most common pool layer uses a 2x2 filter size, which is sampled at 2 for each depth slice of the input along the width and height, discarding 75% activations (activation). Max operations are performed at 4 number per max Operation . The depth dimension remains the same.

    • Assume that the size of the layer input volume is \ (w_1\times h_1\times d_1\)
    • Assuming two hyper-parameters
    • Spatial size of the filter \ (f\)
    • Pace \ (s\)
    • Assuming the size of the output volume is \ (w_2\times h_2\times d_2\), then
    • \ (w_2 = (w_1-f)/S +1\)
    • \ (h_2 = (h_1-f)/S +1\)
    • \ (d_2 = d_1\)
    • Because the fixed function is executed on the input, no parameters are introduced here
    • Using "0-fill" in a pooled layer is relatively rare

It is worth noting that in practice, it is found that the maximum pool layer usually has only two common variants

    • Overlapping pooling: \ (f=3, s=2\), pool core size greater than stride
    • Non-overlapping pooling: \ (f=2, s=2\)
      Large-size receptive field is more destructive than the pool

common pooling operations . In addition to the largest pool, pooled units can perform other functions such as averaging pooling or even l2-norm pooling. The average pooling has its place in the historical development process, but compared with the maximal pooling, the average pool is gradually out of favor. The main reason is that the maximum pooling operation has a good effect in practice.


The pooling layer is independently sampled on the space of each depth slice of the input volume
: In this example, the size of the input volume is [224x224x3], the size is 2, and the stride is a pooled operation of 2, the size of the output volume is [224x224x3]
: Shows the maximum pooling operation, with a 2x2 stride size of 2

get rid of the pool layer Many people do not like the pooling layer, and find ways to remove it. For example, the striving for simplicity:the all convolutional Net article advocates discarding the pooled layer and using only the architecture that contains the convolution layer. To reduce the representation size, it is recommended to use a larger stridein the convolution layer. It is also considered important to discard the pooling layer when training a model with good generalization capability. such as the variational Automatic encoder (VAES) and the generation of the Confrontation Network (Gans)

convolutional Neural Network Architecture (convnet architectures)

Convolutional neural networks are typically composed of 3 types of layers: convolution layer, pooling layer (default using maximum pooling), and full join layer. We will also explicitly write relu to a layer. In this section we will discuss what forms of convolutional neural networks are usually stacked.

Layer mode (Layers Patterns)

The most common form of convnets architecture is to overlay some conv-relu layers and follow a pool layer after each conv-relu layer. Repeat this pattern until the image space size is combined into a small size. In some cases, transitioning to the full join layer is a common situation. The final full-join layer produces output, such as the confidence level of the category. In summary, the most common convnets structure follows the pattern
INPUT -> [[CONV -> RELU]*N -> POOL?]*M -> [FC -> RELU]*K -> FC

The above pattern, which * represents a repetition, POOL? represents an optional pooled layer. In addition N>=0 (usually),, N<=3 M>=0 K>=0 (usually), the K<=3 following is the common convolutional neural network structure, and you will find that they all meet the above pattern

    • INPUT -> FCImplementation of a. Linear classifier,N=M=K=0
    • INPUT -> CONV -> RELU -> FC
    • INPUT -> [CONV -> RELU -> POOL]*2 -> FC -> RELU -> FCHere we see a conv layer between each pool layer.
    • INPUT -> [CONV -> RELU -> CONV -> RELU -> POOL]*3 -> [FC -> RELU]*2 -> FCHere we see a stack of two conv layers before each pool layer. This is usually a good idea for a deeper and larger network, because multiple stacked convolutional layers can construct more complex features from the input volume before destructive pooling operations.

We prefer to overlay multiple small filters with a convolution layer that uses a large sensing field. Let's say we stack three 3x3 convolutional layers on top of the neural network (of course there are non-linear activation functions between layers and layers). This arrangement, each neuron of the first convolutional layer has a 3x3 view of the input, and each neural network of the second convolutional layer has a 3x3 view of the first layer, which is equivalent to the 5x5 field of input volume. The same third layer is equivalent to the field of view of the input 7x7. Suppose we do not use three 3x3 convolutional layers and instead use a convolution layer with a 7x7 sensing field, the same spatial results can be obtained, but there are some drawbacks compared to the first.

    1. A linear operation is performed on the input of the neuron. The stacked conv layer has a non-linear operation, which makes the extracted features more expressive.
    2. Assuming that our input has a \ (c\) channel, the convolution layer using a separate 7x7 will contain \ (C\times (7\times7\times C) = 49c^2\), while the 3x3 convolution layer is only \ (3\ Times (C\times (3\times3\times C)) = 27c^2\) parameters. Intuitively, the stacking layer of a stacked small filter is more capable of extracting expressive features from the input than a convolution layer using a large filter, and has a smaller number of parameters. In fact, there is a drawback in the application that when performing the reverse propagation, more memory is needed to store the intermediate results of the convolutional layer.

Recent attempts .
It is important to note that recent linear stacking models have frequently been challenged, including Google's Inception Architecture and Microsoft Research Asia's Residual neural network resnet (currently the most advanced). Both have a more complex and different connection structure.

Layer Size

By now, we have ignored the hyper-parameters used in convnet.

Input layer (that is, the image is included)

The size of the input layer should be divided by more than 2, commonly used sizes such as (CIFAR-10), 64,96 (STL-10), 224 (convolutional networks common to imagenet), 384,512

Convolution layer

The convolution layer should use a small convolution core (filter, 3x3 Max 5x5), and the STRIDE is 1. The key is to use 0 to fill the input volume so that the convolution layer does not change the spatial dimension of the input. That is, if \ (f=3\),\ (p=1\) will retain the original dimension of the input, the same \ (f=5\) \ (p=2\). In general,\ (p= (F-1)/2\) The dimensions you enter will be retained. If you want to use a larger convolution and (such as 7x7), you will typically see it only on the first convolution layer.

Pooling Layer

The pooling layer is responsible for the next sampling of the spatial dimensions of the input, typically using a max-pooling operation with a 2x2-sensing field and a step size of 2. Doing so will discard the activation value of the previous layer 75%. Another slightly less common setting is the use of the 3x3 field, with a stride of 2. It is very rare for a pooled operation to feel wild greater than 3 because the pooling operation is extremely lossy and very aggressive, often resulting in performance degradation.

In the pattern described above, the convolution layer retains the spatial dimensions of the input, while the pooling layer is responsible for sampling the spatial dimensions of the input. Mitigating the worry of size, in another workable pattern, the convolution layer uses a stride greater than 1 and does not use a 0 padding, so we need to carefully consider the size change of the input volume over the entire network and ensure that the stride and filter work properly.

Why is the convolution stride 1?
In practice, a small stride has a good performance. A stride of 1 allows us to keep all the space under sampling to the pooled layer, while the convolution layer only transforms the input volume in the depth direction.

The compromise of the memory limit
In some cases (especially the first layers of the convolutional network), the rules described above, the memory growth becomes very fast. For example, using a three 3x3 convolution layer (64 convolution cores per layer, and using "0-fill") to apply the image to [224x224x3], there will be 10,000,000 activation values. Since memory is the bottleneck of the GPU, it is important to make this compromise. In practical applications, compromises are usually made on the first convolutional layer. For example, using the 7x7 convolution kernel on the first convolutional layer, with a stride of 2 (such as ZF net), another example is alexnet, which uses a 11x11 convolution core with a stride length of 4.

Why are you padding?
In addition to keeping the space constant after conv, as mentioned above. If the conv layer does not "0-fill" the input and only performs a valid convolution, then the size of the volume will be reduced by a small fraction after each conv, and the information at the boundary will be "washed away" too quickly.

convolutional Neural Networks

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.