Translator Note : This article is translated from the Stanford cs231n Course Note convnet notes, which is authorized by the curriculum teacher Andrej Karpathy. This tutorial is completed by Duke and monkey translators, Kun kun and Li Yiying for proofreading and revision.
The original text is as follows
Content list: structure Overview A variety of layers used to build a convolution neural network
The dimension setting regularity of the arrangement law layer of the structure layer of the layered layer of the convergent layer of the convolution layer, which transforms the whole connecting layer into a convolution neural network , the calculation of case study (lenet/alexnet/zfnet/googlenet/vggnet) Expand Resources
convolution Neural Network (cnns/convnets)
Convolution neural networks are very similar to the conventional neural networks in the previous chapter: they are all composed of neurons, which have the weights and biases of learning ability. Each neuron obtains some input data, carries on the internal product operation and then activates the function operation. The entire network is still a measurable scoring function: the input of the function is the original image pixel, the output is different categories of ratings. On the last layer (often the fully connected layer), the network still has a loss function (such as SVM or Softmax), and the various techniques and points we implement in the neural network still apply to the convolution neural network.
So what has changed. The structure of the convolution neural network is based on the assumption that the input data is an image, and based on that assumption, we add some unique properties to the structure. These unique properties make forward propagation functions more efficient and significantly reduce the number of parameters in the network.
Structure Overview
Review: Conventional neural networks. In the previous chapter, the input of a neural network is a vector, which is then transformed in a series of hidden layers. Each hidden layer is made up of several neurons, each connected to all the neurons in the previous layer. But in a hidden layer, neurons are independent of each other without any connection. The final fully connected layer is called the "output layer", in which the output value is considered to be a different class of rating.
Conventional neural network is not satisfactory for large size images. In CIFAR-10, the size of the image is 32x32x3 (width is 32 pixels wide, 3 color channels), therefore, the corresponding regular neural network in the first hidden layer, each individual fully connected neurons have 32x32x3=3072 weight. This number seems to be acceptable, but it is clear that this fully connected structure does not apply to larger sizes of images. For example, an image with a 200x200x3 size will let the neuron contain 200x200x3=120,000 weight values. And the network must be more than one neuron, then the number of parameters will increase rapidly. Obviously, this whole connection method is inefficient, and a large number of parameters will soon lead to network fitting.
The three-dimensional arrangement of neurons. Convolution neural network for the input is all images of the situation, the structure is adjusted more reasonable, gain no small advantage. Unlike conventional neural networks, the neurons in each layer of the convolution neural network are arranged in 3 dimensions: width , height , and depth (where the depth refers to the third dimension of the active data body, rather than the depth of the entire network, The depth of the entire network refers to the number of layers of the network. For example, the image in CIFAR-10 is an input to the convolution neural network, and the dimension of the data body is 32x32x3 (width, height, and depth). We will see that the neurons in the layer will only be connected to a small area of the previous layer, rather than a full connection. For convolution networks used to classify images in CIFAR-10, the final output layer dimension is 1x1x10, because the final part of the convolution neural network structure will compress the full-size image into a vector containing the classification score, which is arranged in the depth direction. Here's an example: the left is a 3-layer neural network. On the right is a convolution neural network, in which the network arranges its neurons into 3 dimensions (width, height, and depth). Each layer of the convolution neural network changes the input data of the variable to the activation data of neuron 3D and outputs. In this example, the red input layer is the image, so its width and height is the width and height of the image, its depth is 3 (representing the red, green, blue 3 color Channels).
Convolution neural networks are composed of layers. Each layer has a simple API: transform the input 3D data into 3D output data with some derivative functions that contain or do not contain parameters.
various layers used to build a convolution network
A simple convolution neural network is composed of various layers in order, and each layer in the network uses a function that can be differential to transfer the activated data from one layer to another. The convolution neural network is composed of three types of layers: convolution layer , aggregation (pooling) layer and fully connected layer (the same as the whole connection layer and the conventional neural network). By adding these stacks together, you can build a complete convolution neural network.
Network Architecture Example: This is just an overview, the following will be more detailed description of the details. The structure of a convolution neural network for CIFAR-10 image data classification can be [input layer-convolution layer-relu layer-aggregation layer-full connection layer]. The details are as follows: input [32x32x3] contains the original pixel value of the image, in this case the image width is 32, there are 3 color channels.
In the convolution layer, the neuron is connected with a local area in the input layer, and each neuron computes its own inner product of the small area connected to the input layer and its own weight. The convolution layer calculates the output of all neurons. If we use 12 filters (also called cores), the resulting dimension of the output data body is [32x32x12].
The Relu layer will activate function operations on an element-by-case basis, such as using a 0 threshold as the activation function. The layer does not change the size of the data, or [32x32x12].
The aggregation layer is sampled (downsampling) on the space dimension (width and height), and the data dimensions become [16x16x12].
The full connection layer will compute the classification score, and the data size becomes [1x1x10], of which 10 numbers correspond to the classified scoring values of 10 categories in the CIFAR-10. Like its name, the fully connected layer is the same as a conventional neural network, in which each neuron is connected to all the neurons in the previous layer.
Thus, the convolution neural network transforms the image from the original pixel value to the final classification score value layer by level. Some of these layers contain parameters, and some do not. Specifically, the convolution layer and the fully connected layer (CONV/FC) perform transformation operations on the input, not only using activation functions, but also many parameters (synaptic weights and deviations of neurons). and the Relu layer and the convergence layer is a fixed function operation. The parameters in the convolution layer and the full connection layer are trained with the gradient descent, so the classification score computed by the convolution neural network can match the label of each image in the training set.
Summary : The structure of a convolution neural network in a simple case is a series of layers that transform input data into output data (such as classification scores).
There are several different types of layers in the convolution neural network structure (currently the most popular are the convolution layer, the full connection layer, the Relu layer and the aggregation layer).
The input for each layer is 3D data and then transformed into 3D output data using a guided function.
Some layers have parameters, some have no (the convolution layer and the full connection layer have, the RELU layer and the aggregation layer are not). Some layers have extra parameters, some have no (convolution layer, full connection layer and convergence layer, Relu layer). An activation output example of a convolution neural network. The input layer on the left contains the original image pixels, and the output layer on the right contains category classification scores. Each active data body in the processing process is laid out in a column. Because of the difficulty of drawing 3D data, we cut each data body into layers and then put them in a column. The final level is for different categories of classification scores, showing only the highest score of 5 scoring and corresponding categories. The full