Based on the traditional polynomial regression, neural network is inspired by the "activation" phenomenon of the biological neural network, and the machine learning model is built up by the activation function.
In the field of image processing, because of the large amount of data, the problem is that the number of network parameters is very large, and convolution kernel is used to optimize the problem. The convolution check image is scanned locally, and its characteristics are extracted. For the problem that the small convolution kernel can not obtain the global characteristic, by increasing the number of network layer, the field of the first multi layer small convolution nucleus is gradually superimposed, and the feeling field of the posterior small convolution nucleus will be enlarged gradually. Moreover, with the increase of the network layer, the Relu activation function is introduced after each completion of convolution, which introduces more nonlinearity to the model and enhances the fitting ability of the network. Convolution
There are a number of explanations for the implications of convolution. The most classical explanation: the dimension-reducing strike
Convolution is the two-function U (x,y) = f (x) g (Y) roll into a function V (t), commonly known as the reduction of the dimension of the attack.
Image.png 1) How to roll.
Given that the function f and G should be equal, or that the variables x and y should be equal, one option would be to roll up the line x+y = t: 2. What is the use of volume?
Can be used to multiply the number of digits, such as:
Image.png
Note the sequence (14,34,14,4) of the coefficients in each bracket to the right of the second equal sign, which is actually the convolution of the sequence (2,4) and (7,3,1).
There is a more intuitive explanation for the "multiplication and add" operation here:
Image.png
The left graph sequence is not moving, the right graph sequence reverses and then starts to rotate, each rotation one time two sequence overlap position multiplication sum, obtains the cyclic convolution sequence. Take just the convolution as an example,
Image.png
(Do not know why here think of the mechanical calculator, do not know will be similar principle.) ) The popular understanding of the signal convolution
With the understanding just now, we can understand convolution from the point of view of non time-varying system, take discrete signal as example, the continuous signal is the same.
Known x[0]=a, X[1]=b, X[2]=c
Image.png
Known y[0]=i, Y[1]=j, y[2]=k
Image.png
The following shows the physical meaning of convolution by demonstrating the process of seeking x[n]*y[n].
The first step, x[n] times y[0, and translates to position 0:
Image.png
The second step, x[n] times y[1, and translates to position 1:
Image.png
Step three, x[n] times y[2] and pan to position 2:
Image.png
Finally, by stacking the above three graphs, you get:
Image.png
From here you can see that the important physical meaning of convolution is the weighted superposition of a function (such as a unit response) on another function (e.g., an input signal).
For a linear time invariant system, if the unit response of the system is known, then the unit response and the input signal convolution are equivalent to the $$ weighted superposition of the unit response of the input signal to the $$, and the output signal is obtained directly. The popular saying:
In each position of the input signal, a unit response is superimposed, and the output signal is obtained.
This is why the unit response is so important. Convolution neural network
In the field of image recognition, the convolution kernel (filter) in convolution neural network is used to extract the feature from the image.
Specific how to extract. In the case of cats, the salient feature is the sharp chin of a round-eyed triangular ear, but generally the fox also looks like this, but the ears are bigger and the chin is more sharp. These small distinctions are difficult to describe. The introduction of convolution nuclei
The first problem to be solved in image recognition is: What is the feature? The second problem to be solved is that the parameter scale is controllable.
Since the image can be viewed as a data matrix, the input of the traditional neural network, the network output and label for cross Entropy, into the sigmoid function, or the logarithm into the Softmax function, you can get the image belongs to a certain category of probability. Then the gradient descent method is used to train the network parameters.
However, according to the data distribution characteristics of the image matrix, for a 1000*1000 picture, if using the traditional neural network, taking the hidden layer node as the 10^6 as an example, the data quantity of the parameter is 10^3x10^3x10^6=10^12. The left figure in the following figure.
Image.png
However, if the convolution kernel is used to extract the feature, the 10x10 of the convolution nucleus is taken as an example, and the number of the parameters drops to 10x10x10^6=10^8. The effect of parameter dimensionality reduction is still obvious. Parameter sharing
Another good way to reduce the dimension of parameters is parameter sharing.
How to understand the value of weight sharing it. These 100 parameters (i.e. convolution operations) of the convolution kernel are considered to be the way of extracting the feature, which is independent of the position. The implied principle is that the statistical properties of a part of an image are the same as those of other parts. This also means that the characteristics we learn in this section can be used in another part, so we can use the same learning feature for all the positions on this image. A visual representation of the following figure. The same convolution kernel is scanned on the whole picture, and the results obtained are larger, which shows that the data feature of the region is more consistent with the convolution kernel, and the local feature of the image is extracted by this method.
Image.png
From the graph can be seen, a fixed value of the convolution check image matrix to scan, the specific calculation process is shown in the following figure. is the convolution core and the scanned area of the data multiplied directly add, the final addition of the result as a convolution after the position of the value. This is why the feature extraction of convolution networks can retain their location characteristics.
Image.png
Obviously, the closer the data distribution of the image matrix is to the convolution kernel, the greater the value will be. In this way, the value of the convolution kernel is gradually approaching the distribution of the data in the image, while the same feature in the similar image (which is labeled as the same) is gradually extracted. Multiple convolution cores
When you have more than one volume kernel, the following illustration shows:
Image.png
On the right side of the picture, different colors indicate different convolution cores. Each convolution kernel makes the image a different image. For example, two convolution cores can produce two images, which can be viewed as a different channel for a single image. Pool of
After the
has obtained the feature (features) through the convolution, the next step is to use these features to classify. Theoretically, people can use all the extracted features to train classifiers, such as the Softmax classifier, but this is the challenge of computational volume. For example: For an image with a 96x96 pixel, suppose we have learned 400 features defined on 8x8 input, each feature and image convolution will get one (96−8 + 1) x (96−8 + 1) = 7921-dimensional convolution feature, because of 400 features, so each A sample (example) will get a 7921x400 = 3,168,400 D convolution eigenvector. Learning a classifier with more than 3 million feature inputs is very inconvenient and easy to fit in (over-fitting).
to solve this problem, first of all, recall that the reason we decided to use the convolution is because the image has a "static" attribute, which means that the features that are useful in one image region are likely to apply equally in another region. Therefore, in order to describe a large image, a natural idea is to aggregate statistics on the characteristics of different locations, for example, one can calculate the average (or maximum) of a particular feature on an area of an image. These summary statistical features not only have a much lower dimension (compared to using all the extracted features), but also improve the results (not easy to fit). The operations of this aggregation are called pooling (pooling), sometimes called average pooling or maximum pooling (depending on the method of computing pooling).
Image.png Full-Connection layer
The whole connection is a matrix multiplication, which is equivalent to a feature space transformation, which can be used to extract useful information. The full connection layer is generally at the end of the convolution network, the front convolution layer, the pool layer and the activation function map the original data to the hidden layer feature space, and the whole connection layer will be the "distributed feature representation" to be learned to map to the sample tag space. Or, the main purpose of the fully connected layer is the dimension transformation, which transforms the high-dimensional data (the distributed feature representation) into the low dimension (the sample Mark). In this process, useful information is retained, but the location information of the feature is lost. In addition, the fully-connected layer can be replaced with a 1x1 convolution kernel of (HXW), and H and W are the height and width of the convolution result of the front layer respectively.
Because of the large number of parameters brought by the introduction of the full connection layer, it has been discovered recently that the existence of the whole connecting layer has no obvious effect on the result, on the other hand, the global average pool (averaging POOLING,AGP) replaces FC to fuse the depth characteristics. Finally, the method of Softmax and other loss functions as the network objective function to guide the learning process has obtained very good prediction results on ResNet and googlenet.
On the other hand, Wei Xiushing (see Reference) recent studies have found that FC can act as a "firewall" in the Model representation capability migration process. Classical convolution network model
In the development of convolution neural network, there are many milestone network models, such as Alexnet,vgg and googlenet. Alexnet
The alexnet is composed of 7 layers of hidden layers. The 1~5 is a convolution layer and 6-7 is a fully connected layer.
The diagram below is Alex's CNN chart. It is necessary to note that the model adopts the 2-GPU parallel structure, that is, the 1th, 2, 4, 5 convolution layers are divided into 2 parts to train the model parameters. Here, further, parallel structures are divided into data parallelism and model parallelism. Data parallelism means that the model structure is the same on different GPU, but the training data is divided, the different models are trained and then the model is fused. While the model is parallel, the model parameters of several layers are divided, the same data is trained on different GPU, and the result is directly connected as the input of the next layer.
Image.png
The basic parameters of the above diagram model are:
Input: 224x224 size of the picture, 3 channels
The first layer of convolution: the 5x5 size of the volume kernel 96, each GPU on 48.
The nucleus of the first layer of max-pooling:2x2.
Second-tier convolution: 3x3 convolutional kernel 256, 128 on each GPU.
The nucleus of the second layer of max-pooling:2x2.
Third-layer convolution: the upper layer is fully connected to the 3x3 convolutional nucleus of 384. Divided into two GPU last 192.
Fourth-layer convolution: 3x3 convolutional kernel 384, two GPU each 192. The layer does not go through the pooling layer with the previous layer.
Fifth-layer convolution: 3x3 convolutional kernel 256, two GPU last 128.
The nucleus of the fifth floor max-pooling:2x2.
The first layer full connection: 4096 D, the output of layer fifth max-pooling into a one-dimensional vector, as the input of this layer.
Second-level full connection: 4096-D
Softmax Layer: Output is 1000, each dimension of output is the probability that the picture belongs to this category. Vgg
Vgg's greatest contribution is the success of proving the potential of small convolution nuclei + koike layer + depth network. By repeatedly stacking 33 of small convolution cores and 22 of the pool layer, Vgg extends the depth of the convolution neural network to 19 layers. More importantly, the Vgg is very good, and the generalization of migrating to other image data is very good.
The Vggnet network structure is shown in the following figure. Starting with the 11-layer convolution neural network of a scheme, the author gradually increases the layer of convolution to the 19-layer of e-scheme.
Image.png
As can be seen from the diagram, Vggnet has 5 convolution, each section has 1~3 a 3x3 small convolution nucleus, the number of convolution nuclei per volume, the greater the number of segments, the volume of accumulated nucleus increased: 64-128-256-512-512. After the increase in the number of layers, there will be two small convolution nuclei stacked in the case, that is, a number of small sensory fields after the actual expansion of the wild. For example, the 2-layer 3x3 small convolution nuclei are stacked together, the equivalent of 1 5x5 convolution nuclei, but the convolution kernel parameters are relatively small.
Image.png
In addition, each time the convolution is done, it is brought into the Relu activation function, which also increases the non-linear expressive ability of the model.
Not to be continued ...
Finally, thank you for your great God's wonderful explanations of convolution. The level of limited, in the content of the statement on the negligence, there are some understanding is not in place is not correct, please note that ~ thank you.
Reference:
[convolution interpretation]
https://www.zhihu.com/question/54677157/answer/141316355
https://www.zhihu.com/ question/22298352/answer/34267457
Convolution neural network model: http://www.36dsj.com/archives/24006
Full-connection layer: https:// www.zhihu.com/question/41037974/answer/150522307
Deep Learning deep learning:https://github.com/exacity/ Deeplearningbook-chinese