The biggest problem with full-attached neural networks (Fully connected neural network) is that there are too many parameters for the full-connection layer. In addition to slowing down the calculation, it is easy to cause overfitting problems. Therefore, a more reasonable neural network structure is needed to effectively reduce the number of parameters in the neural network. convolutional Neural Networks (convolutional neural network,cnn) can do that.

1. convolutional Neural Network composition

Figure 1: convolutional neural network

The entire network input, generally represents a picture of the pixel matrix. The leftmost three-dimensional matrix in Figure 1 represents an input image, the length and width of the three-dimensional matrix represent the size of the image, and the depth of the three-dimensional matrix represents the color channel of the image. The depth of the black and white image is 1,rgb color mode and the image depth is 3.

The most important part of CNN. Unlike an all-connected layer, the input of each node in the convolution layer is just a small chunk of the upper-layer neural network, which is commonly used in the size of 3x3 or 5x5. In general, the matrix of nodes processed through the convolution layer becomes deeper.

The pooling layer does not change the depth of the three-dimensional matrix, but it can reduce the size of the matrix. Pooling can be thought of as converting a high-resolution picture into a lower-resolution picture. Through the pooling layer, we can further reduce the number of nodes in the last fully connected layer, thus reaching the goal of reducing the whole neural network parameters. The pool layer itself has no parameters that can be trained.

- Full join layer, the last layer of activation function uses Softmax.

After the processing of the multi-wheel convolution layer and the pooling layer, the final classification results are given at the end of CNN by 1 to 2 fully connected layers. After several rounds of convolution and pooling, it can be assumed that the information in the image has been abstracted into a higher-content feature. We can consider convolution and pooling as the process of automatic image extraction, and after the feature extraction is complete, we still need to use the full join layer to complete the classification task.

For multi-classification problems, the last layer of activation function can choose Softmax, so we can get the probability distribution of the samples belonging to each category.

2. Convolution Layer 2.1 Filter

The most important part of the convolutional neural network structure, filters (filter), is shown in the 3x3x3 matrix of Yellow and orange in 2. For specific convolution operations, refer to the convolution Demo in convolutional neural Networks (cnns/convnets) or refer to Figure 3.

Figure 2: Convolution operation

Filter can transform a sub-node matrix on the current layer neural network into a single node matrix on the next layer of neural network. The matrix of the unit nodes is a matrix of nodes with a length and width of 1, but not an unlimited depth.

For convolution operations, be aware of the number of filter $K $, the size of the filter $F $, the size of the convolution step stride $S $ and the size of padding $P $. In Figure 2 $K = 2$, $F = 3$, $S = 1$, $P = 0$.

The commonly used filter size is 3x3 or 5x5, which is the first two dimensions of the yellow and orange matrices in Figure 2, which is artificially set; the node matrix depth of the filter, which is the last dimension in the yellow and orange matrices of Figure 2 (the last dimension of the filter dimension), is the depth of the current layer neural network node matrix (RG B image node matrix depth of 3), the depth of the convolution layer output matrix (also known as the filter depth) is determined by the number of filter in the convolution layer, the parameter is also artificially set, generally with the convolution operation is more and more large.

The size of the filter in Figure 2 is 3x3x3,filter with a depth of 2.

Convolution operation, a 3x3x3 sub-node matrix and a 3x3x3 filter corresponding element is multiplied, the resulting is a 3x3x3 matrix, at this time the matrix of all the elements summed, to obtain a 1x1x1 matrix, add it to filter bias, after the activation function to get the last result, the final result is filled into the corresponding output matrix. The first element in the output matrix $g (0, 0, 0) $ is computed as follows:

\begin{equation} g (0, 0, 0) = f (\sum_{x = 0}^2\sum_{y=0}^2\sum_{z=0}^2a_{x,y,z}xw_{x,y,z}^{(0)} + b^{(0)}) \end{equa tion}

Formula (1), $a _{x,y,z}$ represents a sub-node matrix of the current layer, that is, the upper-left corner of the 6x6x3 matrix 3x3x3 part; $w _{x,y,z}^{(0)}$ represents the weight of the first filter, which is the value of the first filter at each position; $b ^{(0)}$ Represents the offset bias of the first filter, which is a real number, $f $ represents an activation function, such as a ReLU activation function.

"The forward propagation of the convolutional layer structure is done by moving a filter from the upper-left corner of the current layer of the neural network to the lower-right corner, and calculating each corresponding unit matrix in the movement." ”

Figure 3: Convolution operation flow

(Note: Figure 2 and Figure 3 Neural network current layer input is not the same, Figure 2 is 6x6x3, and Figure 3 is 5x5x3 plus $P = 1$ padding. ）

In Figure 3, the convolution step $S = the size of the 2$,padding $P = 1$.

2.2 Padding

Padding, as the name implies, is to fill the image around, often zero padding, that is, fill with zeros. Fills a circle around the image when $P = 1$, and fills two laps when $P = 2$.

q:why padding?

A:two reasons:1) Shrinking output: As the convolution operation progresses, the image will become smaller; 2) Throwing away information from the edges of the images:filter to the image edge information and internal information is not the same degree of attention, some edge information filter only once, and internal information will be passed many times, in other words That, if padding is not performed, the pixels affected by the edge information in the next layer will be less than the pixels affected by the internal information.

Q:valid and same convolutions?

A: "Valid": no padding;

"Same": Pad so, output size is the same as the input size.

2.3 Stride

The convolution step stride is the step that the filter moves as in Figure 3, the size of stride in Figure 3 $S = 2$.

The convolution step is only valid for two dimensions of the length and width of the input matrix.

The size of the $\mbox{size}_{output}$ output matrix is related to the size of the input picture $N $, the size of the filter $F $, the size of padding $P $, and the convolution step $S $. (assuming that the input image is square)

\begin{equation} \mbox{size}_{output}= \lfloor \frac{n + 2p-f}{s} \rfloor + 1\end{equation}

When $ S \neq 1$, there may be $\frac{n + 2p-f}{s}$ is not an integer, this time to $\frac{n + 2p-f}{s}$ take off the whole or use divisible.

According to the "TensorFlow actual Google Deep learning framework", $\mbox{size}_{output}$ can also write the following situation:

\begin{equation} \mbox{size}_{output}= \lceil \frac{n + 2p-f + 1}{s} \rceil \end{equation}

The final result of the formula (2) and (3) will be the same.

3. Pool Layer

The pooling layer can reduce the size of the matrix very effectively (mainly reducing the length and width of the matrix, generally not reducing the depth of the matrix), thereby reducing the parameters in the last fully connected layer. "Using a pooled layer can both speed up the calculation and prevent overfitting problems. ”

Similar to the convolution layer, the forward propagation process of the pooling layer is done through a filter-like structure. However, the calculation in the pool filter is not a weighted sum of nodes, but rather a simpler maximum or average operation. The pooling layer using the maximum operation is called the maximum pooling layer (max pooling), which is the most used pooled layer structure. The pooling layer used for averaging operations is called the average pooling layer (average pooling).

Like the convolution layer filter, the filter of the pooling layer also needs to manually set the size of the filter, whether to use the full 0 fill, and the step of the filter move, and so on, and the meaning of these settings is the same.

The filter movement in the convolution layer and the pooling layer is similar, except that the filter used by the convolution layer spans the entire depth, while the filter used by the pool layer affects only a node at a depth. So the filter in the pool layer, in addition to the length and width of the two dimensions of the movement, it also needs to move in the depth of this dimension. That is, when you perform a max or average operation, only the same matrix depth is performed, not across the depth of the matrix.

Figure 4:max Pooling

In Figure 4, the pool layer filter is 2x2 in size, that is $F = 2$,padding size $P = 0$,filter moving Step $S = 2$.

The pool layer generally does not change the depth of the matrix, only changes the length and width of the matrix.

The pooling layer has no trainable parameters, only some hyper-parameters that need to be manually set.

4. Characteristics of convolutional neural networks

**Local Connection (sparse connection, sparsity of connections):** a position on the convolution output matrix is only related to the partial input matrix, not the entire input matrix. A feature of the convolution layer output may be related only to a portion of the input picture, and there is no association with the information elsewhere, and a local connection allows the feature to focus only on the part it should focus on. It also reduces the parameters of the neural network.
**parameter Sharing (parameter sharing):** The parameters of the filter in the same convolutional layer are shared, and a filter in the filter matrix is the same regardless of the location of the convolution operation. (Of course, the same layer different filter parameters, different layers between the filter parameters are not the same.) Sharing the parameters of the filter allows the content in the image to be unaffected by the position. Take mnist handwritten numeral recognition as an example, whether the number "1" appears in the upper left or bottom right corner, the type of picture is unchanged. Sharing the parameters of the convolution filter can also drastically reduce the parameters on the neural network.

The convolution layer in Figure 2 has a trainable parameter number of 3x3x3x2+2, where "3x3x3" represents the size of the filter, "X2" indicates the depth/number of the filter, and "+2" represents the bias of 2 filter. The parameters of the convolution layer are much smaller than the full connection layer in the same situation. And the number of convolutional layer parameters is independent of the size of the input image, which makes convolutional neural networks well extended to larger image data.

The number of trainable parameters of the convolution layer is only related to the size of the filter (including the length, width and depth of a single filter matrix), and the depth (number) of the filter. The depth of a single filter matrix is the number of channel (or the depth of the input image matrix) of the input image.

The pooling layer has no trainable parameters.

Note: the "trainable parameter" in this article refers to parameters that can be updated by gradient descent in the deep learning model, such as the value in each filter matrix, the bias of the filter, and the hyper-parameter is a set of parameters that the model has been programmed to run, such as length and width in the filter size, The depth of the filter, the step that the filter moves, and the size of the padding.

References

convolutional neural Networks (cnns/convnets)

Course 4 convolutional neural Networks by Andrew Ng

"TensorFlow Google deep Learning framework"