This article first Huchi: HTTPS://JIZHI.IM/BLOG/POST/INTUITIVE_EXPLANATION_CNN
What is convolutional neural network. And why it's important.
convolutional Neural Networks (convolutional neural Networks, convnets or CNNs) are a neural network that has proved particularly effective in the field of image recognition and classification. Convolutional networks have successfully identified human faces, objects, traffic signs, and vehicles such as robots and drones.
Figure 1
In Figure 1 above, convolutional networks are able to identify scenes and the system can automatically recommend related tags such as "Bridge", "railroad", "tennis" and so on. Figure 2 shows an example of a convolutional network identifying everyday things such as humans and animals. Recently, convolutional networks have also shown power in natural language processing (such as sentence classification).
Figure 2
Convolutional networks are very important tools in most of today's machine learning applications. However, understanding convolutional networks and learning to use them for the first time is sometimes not a friendly experience. The main purpose of this paper is to help readers understand how convolutional neural networks are used in images.
If you are completely unfamiliar with neural networks, it is recommended to read 9 lines of Python code to build a neural network to master some basic concepts. In this paper, Multilayer perceptron (multi-layer perceptrons, MLP) is also recorded as an all-connected layer (Fully Connected Layers). lenet Architecture (1990)
Lenet is one of the first convolutional neural networks used in deep learning fields. Yann LeCun's masterpiece LeNet5 named after his series of successful iterations since 1988. At that time the Lenet architecture was also primarily used to identify tasks such as postal codes.
Below we will intuitively feel how lenet is learning to recognize images. There have been a lot of new architectures built on Lenet in recent years, but the basic concept is still lenet, and it's easier to understand lenet and learn the rest.
Figure 3: The Simple convnet
The convolutional neural network in Figure 3 is very close to the original architecture of the lenet, and the pictures are divided into four categories: dogs, cats, boats, birds (lenet is primarily used to do this). As shown in the figure above, when a ship chart is taken as input, the network correctly assigns the highest probability to the ship's classification (0.94). The sum of each probability of the output layer should be 1.
The convolutional neural network in Figure 3 performs four main operations: Convolution nonlinear (ReLU) pooling or down-sampling classification (fully connected layer)
These operations are also the cornerstone of all convolutional neural networks, so understanding these tasks is critical to understanding the entire neural network. Next we will try to understand the above actions in the most intuitive way. a picture is a matrix of pixel values
Essentially, each picture can be represented as a matrix of pixel values
Figure 4: Pixel value matrix
A channel is an idiom that refers to a particular component of a picture. Common digital camera photos have three channels-red, green, blue-can be imagined as three 2d matrices (one for each color) stacked together, each matrix value is between 0-255.
Grayscale images, on the other hand, have only single channels. This article only considers grayscale images for simplicity, which is a 2d matrix. Each pixel value in the matrix is 0 to 255--0 for black, and 255 for white. convolution
Convolutional networks are named for the "convolution" operation. The fundamental purpose of convolution is to extract features from the input images. Convolution uses a small square of data to learn image features, which preserves the spatial relationship between pixels. This does not delve into the mathematical principles of convolution, but to understand the work process.
As mentioned above, each picture is a matrix of pixel values. Consider a 5x5 image with a pixel value of 0 and 1, and the following green matrix is a special case of grayscale (the pixel value of a regular grayscale image is valued at 0-255), taking into account the following 3x3 matrix:
The convolution calculation between the 5x5 image and the 3x3 matrix is then represented by the animation in the following image:
Figure 5: Convolution operation. Output matrix called convolution feature or feature map
Think about how this is done, we slide the orange matrix (also called ' Stride ') on the original image (green) 1 pixels, 1 pixels, and in each position we multiply the corresponding elements of the two matrices to get an integer, which is the element of the output matrix (pink). Note that the 3x3 matrix is only "seen" at a time as part of the image input.
The 3x3 matrix is also called " filter ", "kernel" or "feature detector", and the matrix obtained by sliding the filter on the original image is called "convolution feature", "Excitation map" or " feature map ". The point here is that understanding the filter is a feature detector for the original input image.
For the same picture, different filters will produce different feature mappings. For example, consider the following input image:
The following table shows the effects of various convolution cores on the above image. Just by adjusting the value of the filter, we can perform such effects as edge detection, sharpening, blurring, and so on-which means that different filters can detect different features from the image, such as edges, curves, and so on.
Another good way to understand convolution operations is to look at the animation of Figure 6 :
Figure 6: Convolution operation
A filter (red box) slides (convolution) on the picture to produce a feature map. On the same picture, the convolution of another filter (green box) produces different feature mappings. It is important to note that the convolution operation captures the local dependencies of the original artwork. Also, be careful to observe how two different filters produce different feature mappings. In fact, whether it is a picture, or two filters, is essentially just the numerical matrix we have just seen.
In practice, convolutional neural Networks Learn the value of the filter during training, and of course we need to specify some parameters before training: The number of filters, the size of the filter, the network architecture, and so on. The more filters, the more features are extracted from the image, and the more powerful the pattern recognition capability is.
The size of the feature map is controlled by three parameters, and we need to set it before the convolution step: Depth (Depth): The depth is the number of filters used in the convolution operation. As shown in Figure 7 , we used three different filters for the original ship map, resulting in three feature mappings. You can assume that these three feature maps are also stacked 2d matrices, so the "depth" of the feature map here is 3.
Figure 7
Stride Length (Stride): The stride is the number of pixels per slide. When the stride=1 is a pixel by slide. When stride=2, it will slide over 2 pixels at a time. The larger the stride, the smaller the feature map.
Complement 0 (zero-padding): Sometimes it is convenient to fill the edge of the input matrix by 0, so that we can also apply a filter to the edge pixels of the image matrix. The benefit of zeroing is that we can control the size of the feature map. Fill 0 is also called wide convolution, do not fill 0 is called a narrow convolution. Nonlinear
As shown in Figure 3 , after each convolution operation, there is an additional operation called ReLU . The full name of the Relu is the corrective linear unit (rectified Linear unit), which is a non-linear operation with the following output:
Figure 8:relu
The relu is in pixels, which replaces all negative pixels with 0. The purpose of the Relu is to introduce nonlinearity into the convolutional network, because most of the problems in the real world need to be learned are non-linear (the linear-matrix multiplication and addition of pure convolution operations, so that additional computations are required to introduce nonlinearity).
Figure 9 can help us to understand clearly that the Relu application on the feature map obtained in Figure 6 , the output of the new feature map is also called "correction" feature mapping. (Black is smeared in gray)
Figure 9:relu
Other non-linear equations such as tanh or sigmoid can also replace Relu, but in most cases relu perform better. pooling of
Spatial pooling (also called sub-sampling or down sampling) reduces the dimensions of each feature map, but retains the most important information. Spatial pooling can take many forms: Max (max), average (Average), sum (sum), and so on.
With the largest pool as an example, we define a spatially adjacent (2x2 window) and remove the largest element of the window from the Corrective feature map. In addition to taking the maximum value for extra, we can also take the mean (average pooling) or add up all the elements of the window. In fact, maximum pooling has shown the best results.
Figure Ten shows the maximum pooling operation for correcting feature mappings (after convolution +relu), using a 2x2 window.
Figure 10: Maximum pooling
We slide the 2x2 window in a 2-Stride stride and take the maximum value for each area. Figure also shows the dimension that pooling can reduce the feature mapping.
In the network shown in figure One, the pooling operation is applied to each feature mapping (note that we have three output mappings from three input mappings).
Figure 11: Applying pooling to corrective feature mapping
Figure 9 is the effect of a pooled operation on the corrected feature mapping obtained in figure .
Figure 12: Pooling
The pooled function room progressively reduces the spatial dimensions of the input characterization. In particular, pooling makes the input characterization (feature dimension) smaller and easier to manipulate to reduce the number of parameters and calculations in the network, thus curbing the robustness of the small distortion, distortion, and translation of the input image by the overfitting Enhancement Network (the small distortion in the input does not change the pooled output-because we have already taken the maximum/average in the local neighborhood) Help us to obtain an equivalent picture representation that is not altered by size. This is very useful because we can detect the object in the image, no matter where it is.
As of Now:
Figure 13
So far we've learned how convolution, relu, and pooling work, and these layers are the most basic units for all convolutional neural networks. As shown in figure X, we have two sets of "convolutional +relu+ pooling" layers, where the second group imposes six filters on the first set of outputs, resulting in six feature mappings. Relu The six feature mappings for the respective scopes, and then uses the maximum pooling for the generated corrective feature mappings.
These layers work together to extract useful features, introduce nonlinearity into the network and reduce dimensions, and also make features invariant to size and translation.
The output of the second pool layer is equivalent to the input of the full join layer, which we will continue to explore in the next section. fully connected layer
The fully connected layer (Fully Connected layer) is a multilayer perceptron (multi-layer Perceptron) using the Softmax excitation function as the output layer, and many other classifiers such as support vector machines also use Softmax. "Full connection" means that each neuron in the previous layer is interconnected with each neuron in the next layer.
Convolution layer and pooling layer the output represents the advanced characteristics of the input image, the purpose of the full-join layer is to classify these characteristics, based on the training set. For example, the image classification task shown in figure Four has the following possible categories. (Note that figure 14 does not show all of the neuron nodes)
Figure 14: Fully connected Layer-each node is connected to all nodes in the adjacent layer
In addition to classification, joining an all-connected layer is an effective way to study the nonlinear combination of features. The convolution layer and the pooling layer are well-extracted, but it is better to consider the combination of these features.
The sum of the output probabilities of the fully connected layer is 1, which is guaranteed by the excitation function Softmax. The Softmax function transforms the vector of any real value into a vector of 0-1 and 1 of the elements. Unite--Reverse propagation training
In conclusion, convolution + pooling is a feature extractor, and the whole connection layer is a classifier.
Note that since the input picture is a boat, the target probability to the ship is 1, the other category is 0. Input image = ship target vector = [0, 0, 1, 0]
Figure 15: Training convolutional Neural network
The training process for convolutional networks can be summarized as follows: Step 1: Initialize all filters and parameters/weights with random numbers
Step 2: The network takes the training picture as input, performs the forward steps (convolution, ReLU, pooling, and forward propagation of the fully connected layer) and calculates the corresponding output probabilities for each category. Suppose the output probability of a ship chart is [0.2, 0.4, 0.1, 0.3] because the weight of the first training sample is random, so the output probability is almost random.
Step 3: calculate the total error of the output layer (sum of 4 classes) Total error =∑12 (target probability − output probability) 2 Total error =∑12 (target probability − output probability) 2
Step 4: The inverse propagation algorithm calculates the gradient of the error relative to the weight of ownership and updates all filter/weight and parameter values with the gradient descent method to minimize the output error. The degree of weight adjustment is proportional to its contribution to the total error. When the same image is entered again, this time the output probability may be [0.1, 0.1, 0.7, 0.1], with the target [0, 0, 1, 0] closer. This means that our neural networks have learned to classify specific images, and the way to learn is to adjust the weights/filters to reduce the output error. These parameters, such as the number of filters, the size of the filter, and the network architecture, are fixed before step 1 and do not change during the training process-only the filter matrix and neuron synaptic weights are updated.
The above steps train convolutional networks-essentially optimizing all weights and parameters so that they correctly categorize the pictures in the training set.
When a new (previously unseen) picture is entered into the Convolutional network, the network performs the forward propagation step and outputs the probability of each category (for the new image, the output probability is also the trained weight value). If our training set is large enough, the network is expected to correctly classify new images and gain good generalization (generalization) capabilities.
Note 1: The above steps have been greatly simplified, and the mathematical details are ignored, in order to make the training process more intuitive.
NOTE 2: in the above example, we used two sets of convolution + pooling layer, in fact, these operations can be repeated in a convolutional network countless times. Today, there are a number of outstanding convolutional networks with 10 convolution + pooling layers. Also, not every convolutional layer is followed by a pooled layer. As shown in Figure , we can have a continuous set of convolution +relu layers, followed by a pooled layer.
Visualization of convolutional neural Networks
In general, the more convolutional layers, the more complex the features you can learn. For example, in image classification, the first layer of a convolutional neural network learns to detect the edges in pixels, then the second layer uses these edges to detect simple shapes, and the other layers use shapes to detect advanced features, such as face shapes, as shown in figure X-these characteristics are convolutional Deep belief network learned. Here is a simple example, in fact the convolution filter may detect some meaningless features.
Figure 17:convolutional Characteristics of deep belief network learning
Adam Harley made a very stunning convolutional neural network visualization, which was trained with the mnist handwritten digital database. I strongly recommend that you play a game in order to understand the details of convolutional neural networks in greater depth.
Below we will see how the network recognizes the input number "8". Note that the Relu process is not shown separately in figure .
Figure 18: Visualization of convolutional neural networks
The input image has 1024 pixels (32x32 picture), and the first convolutional layer (convolution layer 1) has six different 5x5 filters (Stride = 1). The graph shows that six different filters produce a feature map with a depth of 6.
convolutional layer 1 is followed by pooling layer 1, and the six feature mappings are 2x2 maximum pooling (Stride = 2) respectively. You can animate the mouse pointer on each pixel in a dynamic Web page to see its corresponding 4x4 mesh in the previous convolution layer ( figure). It is not hard to see that the brightest pixels in each 4x4 grid (corresponding maximum) constitute the pooling layer.
Figure 19: Visualizing pooled operations
Then we have three fully connected (FC) Tiers: FC 1: 120 neurons FC 2: 100 neurons FC 3: 10 neurons, corresponding to 10 digits--also the output layer
In figureX, each of the 10 nodes in the output layer is connected to the 100 nodes of the second fully connected layer (so called "full Connection").
Note that the only bright spot in the output layer corresponds to the 8--. This indicates that the network correctly identifies the handwritten numerals (the brighter the nodes represent the higher the probability, for example, where 8 has the highest probability).
Figure 20: Visualization of the full-attached layer
The 3D version of the visualization is visible here. Other convolutional network architectures
convolutional Neural Network Since the 1990 's, we have known the earliest lenet, and some other very influential architectures are listed as follows:
1990s to 2012: convolutional neural networks are at the incubation stage from the 90 's to the early 2010. With the increase of data volume and the increase of computing power, the problems of convolutional neural network can be more and more interesting.
AlexNet (2012): In 2012, Alex Krizhevsky released AlexNet, a deeper, wider version of Lenet, and the big score won the Imagenet large-scale image recognition challenge of the Year (ILSVRC). This is a very important breakthrough, and now the widespread use of convolutional neural network applications are to thank this feat.
ZF Net (2013): The 2013 ILSVRC winner is the convolutional network of Matthew Zeiler and Rob Fergus, known as ZF Net, which is an improved alexnet for tuning over-architecture hyper-parameters.
Googlenet (2014): 2014 of ILSVRC winners are Szegedy et al from Google. Its main contribution is the development of the inception Module, which significantly reduces the number of parameters in the network (4 million, compared to 60 million of alexnet).
Vggnet (2014): The ILSVRC Runner of the year was vggnet, and the outstanding contribution was to demonstrate the depth of the network (number of levels) is a key factor in good performance.
ResNet: Kaiming He developed the residual network as the ILSVRC champion of the 2015, also representing the highest level of convolutional neural networks, as well as the default choice for practice (May 2016).
Densenet (August 2016): Published by Gao Huang, each layer of densely Connected convolutional network is directly connected to the other layers in front of each other. Densenet has shown remarkable progress in five difficult object recognition Foundation sets.
(translation partially completed)
After this article and a series of previous articles, you should have mastered the fundamentals of convolutional neural networks. Next we try to build a lenet on the ground using the popular Python deep learning library Keras. practice is the only standard for neural networks
Review the architecture of Lenet, complement the second set of convolution + incentive + pool, build a classic Lenet network. Tip: The second set of convolutional layers has 16 filters and is 8x8 in size. The second set of excitation layers uses the Tanh function the size and stride length of the second set of maximum pooling layers are 3x3
please complete the code in the HTTPS://JIZHI.IM/BLOG/POST/INTUITIVE_EXPLANATION_CNN our Python development environment and click the blue button to run the check to see if the answer is correct.
[Python] View Plain copy print? # Importing related modules from keras.models import sequential from keras.layers.convolutional import convolution2d, maxpooling2d from keras.layers.core import activation, flatten, dense # convolutional Network setup model = sequential () # First set of convolution + pooling Model.add (convolution2d (nb_filter=6, nb_row=5, nb_col=5, input_shape= (32, 32, 1)) Model.add (Activation (' Relu ')) Model.add ( Maxpooling2d (pool_size= (2,2), strides= (2,2))) # Second set of convolution + pooling # Code completion # code completion &NBSp; # Full Connection layer Model.add (Flatten ()) model.add (dense) Model.add ( Dense (+)) model.add (dense) Model.add (Activation (' Softmax '))
# import related module from
keras.models import sequential from
keras.layers.convolutional import convolution2d, Maxpooling2d from
keras.layers.core import Activation, Flatten, dense
# convolutional Network Setup model = sequential () # First set of convolution + pooled Model.add (convolution2d (nb_filter=6, nb_row=5, Nb_col=5, Input_shape= (32, 32, 1)) ) Model.add (Activation (' Relu ')) Model.add (Maxpooling2d (pool_size= (2,2), strides= (2,2)) # Second set of convolution + pooling # code completion # code completion # full connection layer mo Del.add (Flatten ()) Model.add (dense ()) Model.add (dense (+)) Model.add (dense) model.add (Activation (' Softmax ') ))