Just entered the lab and was called to see CNN. Read some of the predecessors of the blog and paper, learned a lot of things, but I think some blog there are some errors, I try to correct here, but also added their own thinking and deduction. After all, the theory of CNN has been put forward, I just want to be able to objectively describe it. If you feel that there is something wrong with this article, be sure to tell me in the comments below.
convolutional neural Network (CNN) is the foundation of deep learning. The traditional fully-connected neural network (fully connected networks) takes numerical values as input. If you want to work with image-related information, you should also extract the features from the image and sample them. CNN combines features, down-sampling and traditional neural networks to form a new network. This article assumes that you already have the concept of a simple neural network, such as "layer (layers)", "Neuron (neurons)".
I. Theoretical basis
Figure 1
Α. Simple Network topology:
As shown in 1, this is a simple convolutional neural network for CNN. The C-layer represents all the layers that are obtained after filtering the input image, also called "convolution layer". The S layer represents the layer that the input image is sampled (subsampling) to get. Where C1 and C3 are convolution layers, S2 and S4 are the next sampling layers. Each layer in the C, S layer consists of a plurality of two-dimensional planes, and each two-dimensional plane is a feature map (feature map).
Take the example of CNN, shown in Figure 1, to talk about the process of image processing:
After the image input network, the convolution is obtained through three filters (filter) to obtain three feature maps of the C1 layer (feature map). The three feature graphs of the C1 layer are respectively sampled to obtain three feature graphs of the S2 layer. These three feature graphs get three feature graphs of the C3 layer through a filter convolution, then similar to the previous one, and the next sample obtains three feature graphs of the S4 layer. Finally, the S4 layer is transformed into a vector after the fences of the feature map. This vector input is further classified into the traditional fully-connected neural network (fully connected networks).
All feature graphs in the C1, S2, C3, S4 layers in the diagram can define the image size with pixel x pixels. Would you say that the size of the image is not defined by pixel x pixels? Yes, but it's a bit special here, because these feature graphs make up the convolutional layer and the lower sampling layer of the neural network, and in the neural Network (neural networks), each layer has the concept of "neuron" (neurons). Each pixel in these feature graphs is just as old as a neuron. The number of pixels of all feature graphs in each layer is the number of neurons in the layer network, which can be calculated, how to calculate? Take a look at the back.
Beta. Hidden layers (hidden layer):
After referring to the concept of neurons, it is possible to tell a about the c-s of the hidden layer in CNN (the hidden layer of NN is not here) and B) The connection of neurons between layers and layers. Hidden layer is a black veil, it hides the truth, but a variety of clues (such as the size of the filter, step length, the range of the next sample, offset, etc.) tell you, you are very close to the truth. Well, no, it means that with those things, you can figure out the number of neurons, the number of parameters that can be learned, and the number of connections to the neuron. is divided into two kinds: the hidden layer between the other layer and the convolution layer , the convolution layer and the hidden layer between the lower sampling layer.
1. The hidden layer between the other layers and the convolution layer:
Definition of filters (convolution cores):
Filter_width, filter_height→ filtering range of width, height, filter_channels→ filter Image channel number, filter_types→ filter type.
such as 5x5x3→20: the width of the filter is 5 pixels, the number of convolution channels 3, a total of 20
The concept of local perception and weight sharing are covered below.
In general, if all neurons in the C layer are connected to each pixel on the input image, the number of connections is extremely large. The number of parameters to learn is also jaw-dropping.
For example, if the input image size is 200X200,C1 layer by 6 feature graphs, each feature map size is 20x20 (the number of neurons is 20x20=400). Also, the filter is 10x10 single channel filter (channel = 1), and the step size is 10---equivalent to the adjacent two filter area exactly do not overlap, so as to facilitate the calculation. So the total number of connections is 200x200x (6x20x20) = 96000000. God, Chiching connection, if each connection a parameter, then more than 90 million of the training parameters (learnable parameters), such a complex network, there is no calculation of the possible, the Earth is destroyed, the parameters are not trained well ...
In our lives. See a thing, usually is the first to see its parts, generally will not be able to see the entire contents of an item. This is the local sensation , CNN used this strategy. In addition to the additive bias in the convolution layer , each neuron is connected only to pixels in a local area of the input image . (The size of this local area is the width-to-height product of the filter)
A feature map of the C layer (feature map) is the input image obtained through a filter. Assume that the filter extracts the "T" feature. Each neuron above the feature map is associated with the local area (and the pixels in the local area) that corresponds to the original image. So, after each neuron gets its own T-feature of its corresponding region, does it make the whole thing, not the equivalent of the T-feature of the original image?
Let's figure out what the number of connections is now! Each neuron is connected to only one 10x10 area, which is 10x10x (6x20x20) = 240000. There are now up to 24w of available training parameters. can also be reduced!
PS: Each neuron corresponds to a value that is calculated by the filter convolution of all pixel values in the area of the original image to which it points.
Next, talk about weight sharing . We already know that the input image gets a feature map through a filter convolution. Each neuron on the feature map is connected to a pixel point on a rectangular filter area on the original image, and the above example is 10x10. Is it spicy? Each neuron is connected to the 10x10 input neurons on the original. Because these neurons want the same feature, they are filtered by the same filter. Therefore, the parameters of this 10x10 connection on each neuron are a hair-like one. Does it make sense? In fact, this 10x10 parameter is shared by all neurons on this feature map. This is the weight sharing Ah! So even if you have 6 feature graphs, only 6x10x10 = 600 parameters that need to be trained. (assuming that the input layer has only one picture)
Further, this 10x10 parameter seems to be only related to filtering, like a 6 filter, each with 100 parameters. And these parameters need to be trained. It's like a self-learning filter!
e.g. There is a 5x5x3→20 filter, input a raw image, through convolution to get the C layer of 20 features.
Available Training parameters = (5x5x3 + 1) x 20 = 1520
The number of neurons on each feature map is related to the width of the input graph, the width of the filter, and the step size of the filter (stride), and is equal to the number of desirable filtering areas on the original. The example given above is because the width of the filter is 10x10 and the step size is 10, so no two filter regions are overlapping and ideal. In general, you need to calculate the formula as follows. The width and height of the feature map are N and M respectively.
, (1)
The derivation process is very simple, the main idea is the right/bottom of the place where the filter can be placed, is the right/bottom of the filter area of the right/lower boundary, not more than the width/height of the input image. The following two sets of inequalities are listed, and the solution can be obtained (1).
, (2)
Figure 2
If the input layer (refers to the layer of the input to the C layer, not just the beginning of the input layer) of the image only one feature map, then the number of parameters of the hidden layer can be simply calculated-The hidden layer can train the number of parameters = (filter size + can be offset number) x filter type. If not, the situation will be different, because when the input layer has only one diagram, the C-layer neurons point to an area of the graph of the input layer. If the input layer has a P figure (p > 1), especially when the number of C-layer feature graph Q and P are different, the C-layer neurons will point to the R-Region of the input layer's R graph (r≤p). There can be a variety of allocation strategies, such as a feature map of the neuron points to the input layer three map of the area, another feature map of the neuron points to the input layer six graph area, which directly affects the calculation of the parameters. Specific strategies are calculated. At the end of the article, when the LeNet-5 parameter is calculated, this happens, as the following example shows.
Number of feature graphs = type of filter
The number of hidden layers can be trained: 1. The input layer has only one feature graph: (filter size + optional offset number) x filter type
I. Filter size = filter_width x filter_height x filter_channels
II. Can be added to offset, generally 1
2. Input layer more than one feature map: depending on the situation
Number of hidden layer training parameters = (filter size + number of offsets) x filter type
Total number of neurons on the convolutional layer = number of signatures x number of filters available = number of signatures x N x m
Number of connections = Total number of neurons X local area size of the input image to which the neuron is connected
Figure 3
As shown in 3, each neuron (green box) in the output feature map points to the only filter area (red box) of the input image. People lazy, only draw the first line.
You can see the red box, that is, there is overlap (overlapping) between the legal filter area, which is intentional because it allows to set the move step size of the filter on the input image. And the last red box is a small distance from the right edge of the input image, or intentionally, because this is probably the right legal filter area. Then right-click the step offset to place the red box, and the right edge of the box exceeds the right edge of the input image. Similarly, the filter down is similar.
Finally, you will find that the number of neurons on each feature map is the number of legal filter areas for the input image . Though, it looks as if each neuron has only one line pointing to the area, but, in fact, there are filter_width x filter_height connections on this line. Make k = Filter_width x filter_height, if you add an additional bias, then a neuron has (k+1) parameters. All neurons on a feature map share this k+1 parameter. As shown in Figure 3 there are three feature graphs, then the total number of hidden layer parameters of 3 is 3k+3.
2. The hidden layer between the convolution layer and the lower sampling layer:
Why do we need to sample, because while we are getting the convolutional layer, we use some strategies such as "Local sensing field" and "Weight sharing", which greatly reduce the computational amount. But the resulting image information may still be a lot. The lower sampling sample is generally halved (halve) by adding the 4 pixel values of the 2x2 area on each feature graph in the input convolution layer, multiplying by a weight of W, plus a bias B, and finally a sigmoid function to get the value of a pixel point (neuron) of the S layer. In each feature map,W and B are shared by all neurons on it, and are the only parameters that need to be learned . Concentrating a 2x2 area into a value is equivalent to summarizing a higher-level feature. Also, each 2x2 sampling area is not coincident.
Since the next sample Layer S feature map is equivalent to the C-Layer feature map width is halved, but after all, the C-Layer feature map of the width of the high may be odd ah.
In order not to lose data, some data may be generated to compensate, if the missing two lattice, then copy one copy. What if there is a place where only one lattice is picked up? involves some small strategies.
Simple and rude, using the most straightforward formula:
(3)
So the total number of neurons in the lower sample layer = Total number of pixels = number of signatures x S_width x s_height
The number of hidden layers can be trained = number of feature graphs x (+) = number of Signatures x 2 (a W plus a B)
So how does the number of connections count? As I said earlier, the 4 pixels of the 2x2 area of the convolutional layer are multiplied by a W plus b to get the value of a neuron at the bottom of the sample layer. So the neuron actually has 5 connections out of it. Why? and 4 pixels each have a connection (four connections above the parameters are the same w), and can be offset B has a connection, a total of 5 connections.
So number of connections = Total number of neurons X (2x2+1) = Total number of neurons X 5
The next sample does not change the number of feature graphs
The number of hidden layer training parameters = Number of feature maps x 2
Total number of neurons under the sampling layer = number of signatures x S_width x s_height
Number of connections = Total number of neurons X 5
Two. Calculation practice
The following is an example of Yann LeCun's handwritten numeral recognition of CNN "LeNet-5", which calculates some of the parameters:
Figure 4
LENET-5 Network structure: INPUT-C1-S2-C3-S4-C5-F6
INPUT-C1:
There is a single channel filter 5x5x1→6,stride = 5. The filter size is 5x5x1 = 25, there is an optional offset, 6 of the filter to get C1 6 feature map. According to the formula I gave above, I can figure out:
Available Training parameters = (5x5x1 + 1) x 6 = 156
C1 Total number of feature graph neurons = floor ((32-5)/1 +1) x floors ((32-5)/1 + 1) = 28x28, also the size of the C1 feature map.
Number of total neurons = 28x28x6
Number of connections per neuron = 5x5 + 1 (5x5 filter area + one additive bias)
Total number of connections = 28x28x6x (5x5+1) = 122,304
Also, do not confuse the filter area (Filter_width x filter_height) with the size of the filter (Filter_width x filter_height x filer_channels). In this example, "5x5" and "5x5x1" are very different.
C1-S2:
Training parameters = 6 x (+ +) = 12 (6 features X (1 w+1 B))
S2 Total number of feature graph neurons = ceiling (28/2) x ceiling (28/2) = 14x14, also the size of the S2 feature map
Number of total neurons = 14x14x6
Number of connections per neuron = 2x2 + 1 (2x2 sampling area +1 bias)
Total number of connections = 14x14x6x (2x2 + 1) = 5,880
S2-C3:
Looking at Figure 4, we found 6 features of S2 layer, and convolution obtained 16 features of C3 layer. The strategy used here is a combination. First of all, we should consider the 6 features of the S2 layer as the end-to-end connection.
Then the number of combinations of 3 adjacent feature graphs is 6, the combined number of 4 adjacent feature graph combinations is also 6, and the combination number of 4 not all adjacent feature graph combinations is 9. Then there is the case that the neurons on the feature map of the C3 layer point to the area of multiple feature plots of the S2 layer.
For example, a distribution situation: the first 6 features of the C3 in the S2 of the 3 adjacent feature map set as input, the next 6 feature maps with 4 adjacent feature map set as input in S2, and 3 feature maps with incomplete adjacent 4 feature map subset (9 from the inside of 3) as input, The last feature diagram takes all the feature graphs in the S2 as input. And the filter is 5x5x1→10,stride = 1.
In this way, the total number of parameters to be trained = (5x5x1x3 + 1) x 6 + (5x5x1x4 + 1) x 6 + (5x5x1x4 + 1) x 3 + (5x5x1x6 + 1) x 1 = 1,516
The process of calculation is very simple, for example 1th: (5x5x1x3 + 1) x 6 means one of the first 6 features, the neuron above it, points to the 5x5 area of three feature graphs in S2, plus a bias, a feature graph of the neurons sharing weights. So a graph has (5x5x1x3 + 1) parameters, then there are a total of 6 similar but not the same (the feature diagram of the connected S2 is not the same, so the parameters are different) of the diagram.
Total number of neurons in C3 = X ((14-5)/1 + 1) x floor ((14-5)/1 + 1) = 1,600
Number of connections = (5x5x1x3+1) x100x6 + (5x5x1x4+1) x100x6 + (5x5x1x4+1) x100x3 + (5x5x1x6+1) x100x1 = 151,600
Why is there a "x1" in each of the above, because the size of the filter is not only wide, but also multiplied by the number of channels. In fact, you want to think of the convolution core as three-dimensional, but only in the LeNet-5 with a single-channel filter, so the thickness of 1. As to why 16 convolution cores are used, is it appropriate to verify the experiment?
C3-S4:
Training parameters = (+) x 16 = 32
S4 Total number of feature graph neurons = ceiling (10/2) x ceiling (10/2) = 5x5, also the size of the S4 feature map
Number of total neurons = 25x16 = 400
Number of connections per neuron = 2x2 + 1 = 5 (2x2 sampling area + one additive bias)
Total number of connections = 5 x = 2,000
S4-C5:
The filter for the S4-C5 layer is 5x5x1→120, so the C5 layer has 120 feature graphs. The size of each feature map is floor ((5-5)/1 + 1) x floor ((5-5)/1 + 1) = 1x1. Each feature map has only 1 neurons. Since the size of the filter area is 5x5, this is exactly the size of the S4 layer's feature map. It is equivalent to saying that a characteristic diagram of a C5 and the S4 of its connection are fully connected. Also, in LeNet-5, the feature map in C5 is a feature map that points to S4. It is equivalent to saying that every neuron in the C5, and every pixel on the S4 layer, is connected.
Since different feature graphs do not share weights, the number of training parameters = Number of connections = (5x5x1x16 + 1) x 120 = 48,120
Total number of neurons in the C5 layer: 120
C5-F6:
The F6 layer has 84 units, and the C5 and F6 are fully connected, as well: the number of training parameters = Number of connections = (+ 1) x 84 = 10,164
The next step is the traditional neural network. As for what is behind CNN, there is also a reference in the paper, or at the end of the link blog. At the moment, I don't know much about the part. Study well.
References:
[1] Y. lecun,l. Bottou,y.bengio,and P. Haffner. Gradient-based Learning applied to document recognition. proceedingsof the IEEE, 1998. 2
[2] http://blog.csdn.net/zouxy09/article/details/8781543
convolutional Neural Networks (convolutional neural Network)