TensorFlow deep learning convolutional neural network CNN, tensorflowcnn
I. Convolutional Neural Network Overview
ConvolutionalNeural Network (CNN) was originally designed to solve image recognition and other problems. CNN's current applications are not limited to images and videos, but can also be used for time series signals, for example, audio signal and text data. CNN, as a deep learning architecture, was initially proposed to reduce the requirements for image data preprocessing and avoid Complex Feature Engineering. In a convolutional neural network, the first convolution layer directly receives pixel-level input from the image, and each convolution (filter) layer extracts the most effective features in the data, this method can extract the most basic features in the image, and then combine and abstract them to form higher-order features. Therefore, CNN theoretically has the immutability of image scaling, translation, and rotation.
The main points of CNN in convolutional neural networks are local connection, Weights Sharing, and Down-Sampling in the Pooling layer ). Local join and weight sharing reduce the number of parameters, greatly reducing the training complexity and overfitting. At the same time, the sharing of weights also gives the convolution network translation attention, while the pooling layer downsampling further reduces the output parameter quantity and gives the model the adequacy of mild deformation, this improves the generalization of the model. The convolution operation can be understood as the process of extracting similar features from multiple positions of the image with a small number of parameters.
Space Arrangement of the convolution layer:The connection between each neuron and the input data body in the convolution layer is described above, but the number of neurons in the output data body and their arrangement are not discussed yet. The three hyperparameters control the size of the output data body:Depth, stride, and zero-padding ).First, the depth of the output data body is a super parameter: it is consistent with the number of filters used, and each filter looks for something different in the input data. Second, the step size must be specified when the sliding filter is used. Sometimes it is convenient to fill the input data body with 0 at the edge. ThisZero-padding)Is a super parameter. Zero fill has a good property, that is, it can control the space size of the output data body (the most commonly used is to keep the size of the input data body in space, so that the width and height of the input and output are equal ). The size of the output data body in space can be measured by the input data body size (W), the competent field size (F), step size (S), and the number of zero-filling (P) of neurons in the convolution layer). (Assume that the spatial shape of the input array is square, that is, the height and width are equal)The size of the output data body is (W-F + 2 P)/S + 1. in calculation, the length and width of the input data body are calculated according to this formula, and the depth depends on the number of filters.Step Size Limit: note that the hyperparameters in these spaces are mutually restricted. For example, if the input size is W = 10, P = 0 without zero filling, and the filter size is F = 3, The step size S = 2 won't work, and the Result 4.5 is not an integer, this means that neurons cannot slide over the input data body in a neat and symmetrical manner.
Aggregation layer usageMAX operationEach deep slice of the input data body is operated independently to change its space size. The most common form is that the aggregation layer uses a 2x2 filter and uses a step size of 2 to perform downsample for each depth slice, dropping 75% of the activation information. Each MAX operation is to take the maximum value from four numbers (that is, a 2x2 region in the deep slice ). The depth remains unchanged.
Ii. Convolutional Neural Network Structure
A convolutional neural network is usually composed of three layers: convolution layer, convergence layer (unless otherwise specified, generally the maximum aggregation) and full connection layer (FC ). The ReLU activation function should also be regarded as a layer-by-element function activation.
The most common form of Convolutional neural networks is to put some convolution layers and ReLU layers together, followed by the aggregation layer, repeat this until the image is reduced to a small enough size, and it is also common to transition to a fully connected layer somewhere. The final full connection layer is output, such as classification rating.
The most common convolutional neural network structure is as follows:
INPUT-> [[CONV-> RELU] * N-> POOL?] * M-> [FC-> RELU] * K-> FC
* Indicates the number of repetitions, POOL? It refers to an optional aggregation layer. Where N> = 0, usually N <= 3, M> = 0, K> = 0, usually K <3.
The combination of several small filter convolution layers is better than that of a large filter convolution layer.Intuitively, it is better to choose a convolution layer combination with a small filter, rather than a convolution layer with a large filter.The former can express more powerful features in the input data, with fewer parameters used.The only drawback is that the convolution layer in the middle may occupy more memory during Reverse propagation.
Input layer(Including images) should be able to be divisible by 2 many times. Common numbers include 32 (for example, CIFAR-10), 64, 96 (for example, STL-10), 224 (for example, ImageNet convolutional Neural Network), 384, and 512.
Convolution LayerSmall-size filters (such as 3x3 or a maximum of 5x5) should be used, with step size S = 1. It is also very important to fill the input data with zero fill, so that the convolution layer will not change the size of the input data in the spatial dimension. Generally for any F, the input size can be maintained when P = (F-1)/2. If a larger filter size (such as 7x7) is required, it is usually used only on the first convolution layer facing the original image.
Aggregation LayerDownsampling the spatial dimension of input data to improve the distortion tolerance capability of the model. The most common setting is to use the maximum value of the 2x2 accept field for aggregation. The step size is 2. Note that this operation will discard 75% of the Activation Data in the input data (because two downsampling is performed for both width and height ). Another less common setting is to use the 3x3 accept field with a step of 2. The size of the competence field of the maximum value aggregation rarely exceeds 3, because the aggregation operation is too intense, which may easily lead to data loss, which usually leads to poor algorithm performance.
3. CNN shares convolution weights (parameter sharing), which can greatly reduce the number of parameters in a neural network and prevent overfitting while reducing the complexity of the neural network model. How to Understand?
Assume that the size of the input image is 1000*1000 and that it is a grayscale image, that is, there is only one color channel. Therefore, an image has 1 million pixels, and the input dimension is 1 million. If Fully Connected Layer (FCL) is used, the hidden Layer is the same size as the input Layer (1 million hidden Layer nodes), and 1 million * 1 million = 1 trillion connections are generated, only 1 trillion parameters need to be trained, Which is unimaginable. Considering the Concept of Human Visual competence field, each competence field only receives signals from a small area. Each neuron does not need to receive information from all pixels, but only needs to receive local pixels as input, by combining all the local information received by these neurons, we can obtain global information. Therefore, the previous full connection mode is changed to local connection. Assume that the local accept field size is 10*10, that is, each hidden node is connected to only 10*10 pixels, now we only need 10*10*1 million = 0.1 billion connections, which is 1 trillion times lower than the previous 10000 connections. Assume that the local connection method is convolution, that is, the parameters of each hidden node are identical by default, then our parameters will be 10*10 = 100. Regardless of the size of the image, these 100 parameters are the size of the convolution kernel, which is the contribution of convolution to reduce the number of parameters. This is the so-called share of weights. We increase the number of convolution Kernels to extract more features. The image obtained by each convolution kernel filter is a type of Feature ing, that is, a Feature Map. In general, it is enough to use 100 convolution cores in the first convolution layer, so we have 100*100 = 10000 parameters, which is 0.1 billion times less than the previous 10000. The advantage of convolution is that, regardless of the image size, the number of parameters to be trained is only related to the size and quantity of convolution kernels. Note that although the number of parameters is greatly reduced, however, the number of hidden nodes does not decrease. The number of hidden nodes is only related to the convolution step.
The above is all the content of this article. I hope it will be helpful for your learning and support for helping customers.