Issues covered:
1. How each graph convolution:
(1) How does a diagram become a few?
(2) How to choose the convolution core?
2. How are connections between nodes?
3.s2-c3 How to allocate?
4.16-120 How to connect the full connection?
5. What is the final output form?
① each layer explains:
Let's be clear: there are multiple feature maps per layer, each feature map extracts a feature of the input via a convolution filter, and then each feature map has multiple neurons.
The C1 layer is a convolution layer (why is convolution?) An important feature of convolution operation is that the original signal features can be enhanced and the noise reduced by convolution operation, which consists of 6 feature maps feature map. Each neuron in the feature diagram is connected to the 5*5 neighborhood in the input. The size of the feature map is 28*28, which prevents input connections from falling outside the boundary (for BP feedback calculations, without gradient loss, personal insights). The C1 has 156 parameters that can be trained (each filter 5*5=25 a unit parameter and a bias parameter, altogether 6 filters, a total of (5*5+1) *6=156 parameters), a total of 156* (28*28) =122,304 a connection.
The S2 layer is a lower sample layer (why is it under sampling?). By using the principle of local correlation of images, the sub-sampling of images can reduce the amount of data processing while preserving useful information, and there are 6 14*14 feature graphs. Each cell in the feature map is connected to the 2*2 neighborhood of the C1 in the corresponding feature graph. The 4 inputs of each unit of the S2 layer are added, multiplied by a training parameter, plus a trained bias. The result is calculated by the sigmoid function. The training coefficients and biases control the nonlinearity of the sigmoid function. If the coefficients are relatively small, then the operation is approximate to the linear operation, and the sub-sampling is equivalent to the blurred image. If the coefficients are larger, the sub-sampling according to the biased size can be considered as noisy "or" or noisy "and" operations. The 2*2 of each cell does not overlap, so the size of each feature plot in S2 is 1/4 of the feature plot size in C1 (row and column 1/2). The S2 layer has 12 training parameters and 5,880 connections.
Figure: Convolution and sub-sampling process: Convolution process includes: Using a trained filter FX to convolution an input image (the first stage is the input image, the later stage is convolution feature map), and then add a bias bx, to get convolution layer cx. The sub-sampling process consists of a four pixel summation of each neighborhood into one pixel, then a scalar wx+1 weighting, and then an increase in the bias bx+1, and then a sigmoid activation function, resulting in a roughly four times times reduced feature map sx+1.
So the mapping from a plane to the next plane can be considered as convolution operation, and S-layer can be regarded as a fuzzy filter, which plays the role of two feature extraction. The spatial resolution between the hidden layer and the hidden layer decreases, and the number of planes in each layer increases, which can be used to detect more characteristic information.
C3 layer is also a convolution layer, it also through the 5x5 convolution core deconvolution layer S2, and then get the feature map is only 10x10 neurons, but it has 16 different convolution cores, so there are 16 feature map. One thing to note here is that each feature map in C3 is connected to all 6 or several feature maps in the S2, indicating that the feature map of this layer is a different combination of the feature map extracted from the previous layer (this is not the only option). (see no, here is the combination, just like the human visual system that was talked about before, the underlying structure forms the upper layer of more abstract structures, such as edges that form shapes or parts of the target).
Just now, each feature map in C3 is composed of all 6 or several feature maps in the S2. Why not connect each feature map in S2 to the feature map of each C3? There are 2 reasons. First, the incomplete connection mechanism keeps the number of connections within a reasonable range. Second, and most important, it destroys the symmetry of the network. Because different feature maps have different inputs, they are forced to draw different characteristics (hopefully complementary).
For example, one way to exist is: the first 6 features of the C3 are entered with 3 adjacent feature map subsets in S2. The next 6 features are entered with a subset of 4 adjacent feature maps in S2. Then the 3 are entered with a subset of the 4 non-contiguous feature maps. The last one will have all the feature graphs in the S2 as input. This allows the C3 layer to have 1516 training parameters and 151,600 connections.
The S4 layer is a lower sampling layer consisting of 16 5*5-sized feature maps. Each cell in the feature map is connected to the 2*2 neighborhood of the corresponding feature graph in the C3, as is the connection between C1 and S2. The S4 layer has 32 training parameters (1 factors per feature figure and one bias) and 2000 connections.
The C5 layer is a convolution layer with 120 feature graphs. Each unit is connected to the 5*5 neighborhood of all 16 cells in the S4 layer. Since the size of the S4 layer feature map is also 5*5 (same as the filter), the size of the C5 feature map is 1*1: This constitutes the full connection between S4 and C5. The C5 is still labeled as a convolutional layer rather than a fully-connected layer, because if the input of LeNet-5 is larger and the others remain the same, then the dimension of the feature map will be larger than 1*1. The C5 layer has 48,120 training connections.
The F6 layer has 84 units (The reason why this number is chosen is from the design of the output layer) and is fully connected to the C5 layer. There are 10,164 parameters that can be trained. Like the classical neural network, the F6 layer calculates the dot product between the input vector and the weight vector, plus a bias. It is then passed to the sigmoid function to produce a state of unit I.
Finally, the output layer consists of a European radial basis function (Euclidean Radial Basis function) unit, one unit per class, each with 84 inputs. In other words, each output RBF unit computes the Euclidean distance between the input vector and the parameter vector. The farther away the input is from the parameter vector, the greater the RBF output. A RBF output can be interpreted as a penalty for measuring the input pattern and the degree of matching of a model of the RBF associated class. In terms of probabilistic terminology, the RBF output can be understood as the negative log-likelihood of the Gaussian distribution of the F6 layer configuration space. Given an input pattern, the loss function should be able to make the configuration of the F6 close enough to the RBF parameter vectors (i.e., the expected classification of the pattern). The parameters of these units are manually selected and remain fixed (at least initially). The components of these parameter vectors are set to-1 or 1. Although these parameters can be selected in the form of probabilities such as 1 and 1, or form an error-correcting code, they are designed as a formatted picture of the 7*12 size (or 84) of the corresponding character class. This representation is not useful for identifying individual numbers, but is useful for identifying strings that can be printed in an ASCII set.
Another reason to use this distributed encoding instead of the more commonly used "1 of N" encoding to produce output is that the non-distributed encoding is less effective when the category is large. The reason is that most of the time the output of the non-distributed encoding must be 0. This makes it difficult to achieve with the sigmoid unit. Another reason is that classifiers are used not only to identify letters, but also to reject non-letters. Using distributed coded RBF is more suitable for this target. Because unlike sigmoid, they are excited in areas where the input space is better constrained, rather than a typical pattern that is easier to fall outside.
The RBF parameter vector plays the role of the target vector of the F6 layer. It should be noted that the composition of these vectors is +1 or-1, which is within the range of the F6 sigmoid, thus preventing the sigmoid function from saturating. In fact, +1 and 1 are the most curved points of the sigmoid function. This allows the F6 unit to operate within the maximum nonlinear range. The saturation of the sigmoid function must be avoided, as this will result in a slow convergence and ill-posed problem for the loss function.
② Problem Explanation
Question 1th:
(1) Input-C1
The 32*32 image is convolution with 6 5*5 size patches (i.e. weights, training, random initialization, adjusting during training), and 6 feature graphs are obtained.
(2) S2-C3
C3 that 16 10*10 size of the feature map is how to come?
The S2 feature map with 1 input layer 150 (=5*5*6, not 5*5) nodes, the output layer is 16 nodes of the network for convolution.
How is the value of the 3rd feature map (assumed to be H3) obtained ?
First we put the network 150-16 (in this case, the surface Input layer node is 150, the hidden layer node is 16) in the input of 150 nodes into 6 parts, each part is a continuous 25 nodes. Take the 3rd part of the last node (25), and at the same time with the hidden layer 16 nodes of the 4th (because the corresponding is 3rd, counting from 0) connected that 25 values, reshape for 5*5 size, with the 5*5 size of the feature patch to convolution S2 the 3rd feature graph in the network, assuming that the resulting feature graph is H1.
Similarly, remove the 2nd part of the input in network 150-16 of the node (25), and at the same time the hidden layer 16 nodes in the 5th connected to the 25 values, reshape for 5*5 size, with the 5*5 size of the feature patch to convolution S2 the 2nd feature graph in the network, assuming that the resulting feature graph is H2.
Finally, take out the last 1 parts of the input Network 150-16 node (25), and at the same time the hidden layer 16 nodes in the 5th connected to the 25 values, reshape for the size of 5*5, with the 5*5 size of the feature patch to convolution S2 the last 1 features in the network, it is assumed that the resulting feature map is H3.
Finally, the H1,h2,h3 3 matrices are added to the new Matrix H, and each element in H is given an offset B, and through the sigmoid excitation function, we can get the feature map H3.
Second question:
Why are 150 nodes in S2? (related to weight sharing and parameter reduction )
One of the most awesome parts of CNN is that it reduces the number of parameters that neural networks need to train by feeling wild and sharing weights.
Left: If we have an image of 1000x1000 pixels, there are 1 million hidden neurons, then they are all connected (each hidden layer neuron is connected to each pixel of the image), there is a 1000x1000x1000000=10^12 connection, that is, 10^12 weight parameters. However, the spatial connection of the image is local, just like a person through a local feeling of the field to feel the external image, each neuron does not need to feel the global image, each neuron only feel the local image area, and then in the higher level, these feelings of different local neurons can be obtained the overall information. In this way, we can reduce the number of connections, that is, to reduce the number of weight parameters that neural networks need to train. such as right: if the local feeling field is 10x10, the hidden layer of each feeling field only need and this 10x10 local image connection, so 1 million hidden layer neurons have only 100 million connections, that is, 10^8 parameters. Four less than the original 0 (order of magnitude), so the training is not so laborious, but still feel a lot of ah, there is no way to do?
We know that each neuron in the hidden layer is connected to the 10x10 image area, which means that each neuron has a 10x10=100 connection weight parameter. What if the 100 parameters of each of our neurons are the same? This means that each neuron uses the same convolution kernel to deconvolution the image. So we only have how many parameters?? Only 100 parameters Ah!!! Kiss! Regardless of the number of neurons in your hidden layer, I only have 100 parameters for the connection between the two layers. Kiss! This is the weight sharing Ah! Kiss! This is the main selling point of convolutional neural network Ah! Kiss! (a bit annoying, hehe) You may ask, is this a reliable way to do it? Why is it possible? This one...... Learn together.
Well, you will think, so the extraction of features is not reliable, so you only extracted a feature ah? Yes, it's smart, we need to extract a lot of features, right? If a filter, or convolution kernel, is a feature of the proposed image, such as the edge of a certain direction. So we need to extract the different characteristics, how to do, add a few more filters will not be OK? That's right. So suppose we add 100 filters, each of which has different parameters, indicating that it presents the various features of the input image, such as different edges. So each filter goes to the convolution image to get a different feature of the image show, which we call feature Map. So there are 100 feature maps of 100 convolution cores. These 100 feature maps form a single layer of neurons. It's clear by this time. How many parameters do we have on this floor? 100 convolution cores x each convolution core shares 100 parameter =100x100=10k, which is 10,000 parameters. Only 10,000 parameters Ah! Kiss! (Come again, can't stand it!) See right: Different colors to express different filters.
Hey yo, missing a question. It is said that the number of the hidden layer parameter is independent of the number of neurons in the hidden layer, which is only related to the size of the filter and the type of filter. So how do we determine the number of neurons in the hidden layer? It is related to the original image, that is, the size of the input (number of neurons), the size of the filter and the sliding step length of the filter in the image! For example, my image is 1000x1000 pixels, and the filter size is 10x10, assuming that the filter does not overlap, that is, the step is 10, so that the number of neurons in the hidden layer is (1000x1000)/(10x10) =100x100 neurons, assuming the step is 8, That is, the convolution core overlaps two pixels, so ... I will not forget, the idea of understanding is good. Note that this is just a filter, that is, the number of neurons in a feature map Oh, if 100 feature map is 100 times times. Thus, the larger the image, the number of neurons and the number of weights needed to train the gap between the rich and poor is greater.
So here we can know that just 14*14 the image to calculate its node, according to the step size of 3, then a picture can be 5*5 a number of neurons, multiplied by 6 to get 150 neurons number.
It is important to note that the above discussion does not take into account the biased parts of each neuron. So the number of weights needs to be added 1. This is also shared with the same filter.
In short, the core idea of convolutional networks is to combine the three structural ideas of local sensation field, weighted value sharing (or weight reproduction) and time or spatial sub-sampling to obtain some degree of displacement, scale and deformation invariance.
Question three:
If the C1 layer is reduced to 4 feature plots, the same S2 is also reduced to 4 feature plots, with C3 and S4 corresponding to 11 feature graphs, then C3 and S2 connection conditions
Question Fourth:
Full connection:
C5 to the C4 layer convolution operation, the use of the full connection, that is, each C5 convolution core in S4 all 16 feature graphs for convolution operations.
Question Fifth:
Using the One-of-c method, the maximum component in the 1*10 vector of the output result is very much the result of the network output classification. The label for the training set is encoded in the same way, for example, 1000000000, which indicates the classification of the number 0.
Just beginning to learn deep learning, or novice, these are some of my doubts, sorting out hope to some of the newly-started friends some help, the middle may be some incorrect place, hoping to correct.
Deep Learning (convolutional neural Networks) Summary of some issues (RPM)