the composition of a convolutional neural network
Image classification can be considered to be given a test picture as input Iϵrwxhxc Iϵrwxhxc, the output of this picture belongs to which category. The parameter W is the width of the image, H is the height, C is the number of channels, and C = 3 in the color image, and C = 1 in the grayscale image. The total number of categories will be set, for example in a total of 1000 categories in the Imagenet contest, and 10 in the CIFAR10. convolutional neural networks can be seen as such a black box. The input is the original picture I, the output is an L-dimensional vector vϵrl vϵrl. L indicates the number of pre-set categories. Each dimension of vector v represents the size of the probability that the image belongs to the corresponding category. If it is a single-category identification problem, that is, each image is assigned only one label in the L label, then the elements in V can be compared and the maximum value corresponding to the label as the result of the classification. V can be a form of probability distribution, that is, each element 0≤vi≤1 0≤vi≤1, and ∑ivi=1∑ivi=1. Where VI VI represents the first element of V. It can also be a real number from a negative infinity to a positive infinity, and the larger the greater the likelihood of belonging to the corresponding category. In the inner part of convolutional neural network, it is composed of many layers. Each layer can be considered a function, the input is the signal x, the output is the signal y=f (x) y=f (x). The output y can also be used as input to other layers. The following is a survey of the definitions of commonly used layers from the perspective of the front, middle, and end of the network. The front-end mainly consider the process of image processing, the middle end is a variety of neurons, the end of the main consideration of training network-related loss function. the previous segment of the two networks
The previous paragraph refers to the processing of image data, which can be called the data layer. 2.1 Data Cuts
The size of the image you enter may vary, with some images having a larger resolution and some smaller. And the aspect ratio is not necessarily the same. For such inconsistencies, in theory, it can be dismissed, but this requires other layers of the network to support such input. In most cases, the output image is a fixed resolution by clipping method. At the stage of network training, the cropped position is randomly selected from the original image, and only the sub-graph that satisfies the clipping can be completely dropped in the image. This is done randomly because the equivalent of adding additional data can alleviate the problem of overfitting. 2.2 Color Disturbances
After cropping the original image, each pixel is a fixed value of 0 to 255. Further processing, including subtracting the mean, as well as the proportional scaling pixel value, makes the division of pixel values between [−1, 1]. In addition to these regular operations, the image is normalized, which is equivalent to image enhancement, such as the data preprocessing of CIFAR10 in [9, 18, 17]. For example, for each pixel, randomly select one of the RGB three channels, and then randomly add a value from [ -20,20] on the basis of the original pixel value. the middle of three networks
The following is a definition of the layers commonly used in volumes and neural networks, that is, what dimension the data x is entered in, what dimension of the output y is, and how to get the output from the input. 3.1 Basic components of convolutional neural networks
The following figure:
3.2 convolution layer
The convolution layer input is represented as xϵrwx