Transferred from: http://dataunion.org/11692.html
Zhang Yushi
Since July this year, has been in the laboratory responsible for convolutional neural networks (convolutional neural network,cnn), during the configuration and use of Theano and Cuda-convnet, Cuda-convnet2. In order to enhance the understanding and use of CNN, this blog post, in order to communicate with people, mutual gain. Before the text, say a few points of your own feelings about CNN. First, it is clear that deep learning is the general term of all depth learning algorithms, and CNN is an application of deep learning algorithms in the field of image processing.
1th, before learning deep learning and CNN, always think they are very great knowledge, always think they can solve a lot of problems, after learning, only to know that they are similar to other machine learning algorithms such as SVM, still can think of it as a classifier, You can still use it as you would with a black box.
2nd, deep Learning is a powerful place where the output of one layer of the network can be used as another expression of the data, which can be thought of as a feature of the network learning. Based on this feature, further similarity comparisons can be made.
3rd, the Deep learning algorithm can be effective key is actually large-scale data, the reason is that each DL has a number of parameters, a small amount of data can not be fully trained parameters.
The next few words, go straight to the topic to start the CNN tour.
Introduction to Convolutional neural Networks (convolutional neural Networks, CNN)
convolutional neural Network is an efficient recognition method which has been developed in recent years and has aroused wide attention. In the the 1960s, Hubel and Wiesel found that their unique network structure could reduce the complexity of the feedback neural network in the study of neurons used in local sensitivity and direction selection in the brain cortex of cats, and then proposed a convolutional neural network (convolutional neural networks-for short, CNN). Now, CNN has become one of the research hotspots in many fields of science, especially in the field of pattern classification, because the network avoids the complicated pre-processing of image, it can input the original image directly, so it gets more extensive application. K.fukushima in 1980, the new recognition machine is the first implementation network of convolutional neural networks. Later, more research workers improved the network. Among them, the representative research results are the "improved cognition Machine" proposed by Alexander and Taylor, which synthesizes the advantages of various improvement methods and avoids the time-consuming error reverse propagation.
In general, the basic structure of CNN consists of two layers, one of which is the feature extraction layer, and the input of each neuron is connected with the local acceptance domain of the previous layer, and the local characteristics are extracted. Once the local feature is extracted, the position relationship between it and other features is also determined, and the other is the feature map layer, each computing layer of the network is composed of multiple feature mappings, each feature map is a plane, and the weights of all neurons in the plane are equal. The feature mapping structure uses the sigmoid function which affects the function core as the activation function of convolutional network, which makes the feature map have displacement invariance. In addition, the number of network free parameters is reduced because of the shared weights of neurons on a mapped surface. Each convolutional layer in the convolutional neural network is followed by a computational layer that is used to extract the local average and two times, and this unique two-time feature extraction structure reduces the feature resolution.
CNN is primarily used to identify two-dimensional graphs of displacement, scaling, and other forms of distorted invariance. Because CNN's feature detection layer learns through the training data, it avoids the feature extraction when using CNN, and learns from the training data implicitly; Moreover, because the weights of neurons on the same feature map surface are the same, the network can learn in parallel, This is also a major advantage of convolutional networks over the network of neurons connected to each other. Convolution neural network has unique superiority in speech recognition and image processing because of its local weight sharing special structure, its layout is closer to the actual biological neural network, weight sharing reduces the complexity of the network, In particular, the image of multidimensional input vectors can be directly input to the network, which avoids the complexity of data reconstruction during feature extraction and classification.
1. Neural networks
First introduce the neural network, this step of detail can refer to resource 1. Briefly introduced below. Each cell of the neural network is as follows:
Its corresponding formula is as follows:
The unit can also be referred to as a logistic regression model. When multiple cells are combined and have a hierarchical structure, a neural network model is formed. Shows a neural network with a hidden layer.
Its corresponding formula is as follows:
More similar, can expand to have 2,3,4,5, ... A hidden layer.
The training method of neural network is similar to logistic, but because of its multilayer, it is necessary to use the chain derivation law to derivation the node of hidden layer, namely gradient descent + chain derivation rule, and professional name is called reverse propagation. Regarding the training algorithm, this article does not involve.
2 convolutional Neural Networks
In image processing, the image is often represented as a vector of pixels, such as a 1000x1000 image, which can be represented as a 1000000 vector. In the neural network mentioned in the previous section, if the number of hidden layers is the same as the input layer, which is also 1000000, then the input layer to the hidden layer parameter data is 1000000x1000000=10^12, so there is too much, basically can not be trained. Therefore, to practice the image processing as a neural network DAFA, we must first reduce the parameters to speed up. Just like the Sword of evil spirits, ordinary people practice very frustrated, once since the palace after the internal force to become stronger sword change quickly, it becomes very cow.
2.1 Local perception
convolutional Neural Network There are two kinds of artifacts that can reduce the number of parameters, the first artifact is called local perception field. It is generally believed that the cognition of people to the outside world is from local to global, and the spatial relation of image is more closely related to pixels, while the pixel correlation is weaker than that of distance. Thus, it is not necessary for each neuron to perceive the global image, but to perceive it locally, and then to synthesize the local information at higher levels to get a global message. The idea that the network is partially connected is also a visual system structure inspired by biology. Neurons in the visual cortex are locally receptive (that is, these neurons respond only to stimuli in certain regions). As shown: The left image is full, and the image on the right is a local connection.
In the upper right image, if each neuron is connected to only 10x10 pixels, then the weight data is 1000000x100 and reduced to the original value of 1 per thousand. And that 10x10 pixel value corresponding to the 10x10 parameter, in fact, is equivalent to convolution operation.
2.2 Parameter sharing
But in fact this argument is still too many, then start the second level artifact, that is, weight sharing. In the above local connection, each neuron corresponds to 100 parameters, altogether 1 million neurons, if the 100 parameters of the 1 million neurons are equal, then the number of parameters becomes 100.
How do you understand weight sharing? We can consider these 100 parameters (that is, convolution operations) as a way to extract features regardless of location. The implication of this is that the statistical characteristics of the part of the image are the same as the rest. This also means that the features we learn in this section can also be used in other parts, so we can use the same learning features for all the locations on this image.
More intuitively, when a small piece is randomly selected from a large image, such as 8x8 as a sample, and some features are learned from this small sample, we can apply the feature learned from this 8x8 sample as a detector to any place in the image. In particular, we can use the features learned from the 8x8 sample to make a convolution of the original large-size image, thus obtaining an activation value of a different feature for any position on this large-size image.
As shown, the process of convolution of a 33 convolution core on a 55 image is demonstrated. Each convolution is a feature extraction method that, like a sieve, filters out parts of the image that match the criteria (the larger the activation value is, the better the condition).
2.3 Multi-convolution cores
As described above only 100 parameters, indicating that only 1 100*100 convolution core, obviously, feature extraction is not sufficient, we can add a plurality of convolution cores, such as 32 convolution cores, can learn 32 characteristics. When there are multiple convolution cores, as shown in:
Right, different colors indicate different convolution cores. Each convolution core will have the image become another image. For example, two convolution cores can generate two images, which can be viewed as a different channel of an image. As shown, there is a small error, the W1 will be changed to W0,W2 W1 can be changed. They are still referred to in W1 and W2 below.
Shows the convolution operation on four channels, with two convolution cores, generating two channels. It should be noted that each channel on the four channels corresponding to a convolution core, the first to ignore the W2, only see W1, then at the W1 location (i,j) value, is four channels (I,J) at the end of the convolution results added and then take the activation function worth.
Therefore, in the process of obtaining 2 channels from 4 channel convolution, the number of parameters is 4x2x2x2, where 4 means 4 channels, the first 2 means generating 2 channels, and the final 2x2 represents the convolution kernel size.
2.4 down-pooling
After the feature (features) has been obtained by convolution, we want to use these features to classify the next step. In theory, one can use all the extracted features to train a classifier, such as a softmax classifier, but this poses a challenge to computational capacity. For example: For an image of a 96X96 pixel, suppose we have learned 400 features defined on the 8x8 input, each feature and image convolution will be given a convolution feature (96−8 + 1) x (96−8 + 1) = 7921 dimensional, since there are 400 features, each A sample (example) will get a convolution eigenvector with a 892x400 = 3,168,400 dimension. Learning a classifier with more than 3 million feature inputs is inconvenient and prone to overfitting (over-fitting).
To solve this problem, first recall that the reason we decided to use convolution is because the image has a "static" property, which means that features that are useful in one image area are likely to be equally applicable in another area. Therefore, in order to describe large images, a natural idea is to aggregate the characteristics of different locations, for example, one can calculate the average (or maximum) of a particular feature on an area of an image. These summary statistics feature not only a much lower dimension (compared to the use of all extracted features), but also improve the results (not easily overfitting). This aggregation is called pooling (pooling), which is sometimes referred to as averaging pooling or maximum pooling (depending on how pooling is calculated).
At this point, the basic structure and principle of convolutional neural network have been elaborated.
2.5 Multi-layer convolution
In practical applications, often using multi-layer convolution, and then use the full connection layer for training, multilayer convolution is the purpose of a layer of convolution to learn the characteristics are often local, the higher the number of layers, the more the characteristics of the more global.
3 IMAGENET-2010 Network structure
ImageNet LSVRC is a picture-classified tournament with a training set of 127w+ pictures, a validation set with 5 W images, and a test set with 15W images. This article intercepts the 2010 Alex Krizhevsky's CNN structure, which won the championship in 2010, with a top-5 error rate of 15.3%. It is worth mentioning that, in this year's Imagenet LSVRC competition, won the goognet has reached the top-5 error rate of 6.67%. It can be seen that deep learning is still a huge space for ascension.
That's Alex's CNN chart. It is important to note that the model adopts the 2-GPU parallel structure, that is, the 1th, 2, 4, 5 convolutional layers are trained by dividing the model parameters into 2 parts. Here, further, parallel structures are divided into data parallelism and model parallelism. Data parallelism means that the model structure is the same on different GPUs, but the training data is segmented, the different models are trained separately, and then the model is fused. Parallel to the model, the model parameters of several layers are segmented, and the same data are used to train on different GPUs, and the results are directly connected as the input of the next layer.
The basic parameters of the model are:
Input: 224x224 size picture, 3 channel
The first layer of convolution: a 5x5-sized convolution core of 96, 48 on each GPU.
The nucleus of the first layer of max-pooling:2x2.
Second-tier convolution: 3x3 convolutional cores 256, 128 on each GPU.
The second layer max-pooling:2x2 the nucleus.
The third layer of convolution: with the previous layer is fully connected, 3*3 convolution core 384. The last 192 points to two GPUs.
Fourth convolution: A 3x3 convolution core of 384, two GPUs each of 192. The layer is connected to the previous layer without passing through the pooling layer.
Fifth convolution: A 3x3 convolution core of 256, two GPUs last 128.
The nucleus of the fifth layer max-pooling:2x2.
The first layer is fully connected: 4096-D, connecting the output of layer fifth max-pooling to a one-dimensional vector, as input to that layer.
Second level full connection: 4096 D
Softmax Layer: Output is 1000, each dimension of the output is the probability that the picture belongs to that category.
4 DEEPID Network structure
The DEEPID network structure is a convolutional neural network developed by Sun Yi of the Chinese University of Hong Kong to learn facial features. Each input face is represented as a 160-dimensional vector, the learned vectors are classified by other models, the 97.45% accuracy rate is obtained on the face verification test, and further, the original author improves CNN and gets 99.15% accuracy.
As shown, the structure is similar to the specific parameters of imagenet, so just explain the different parts.
The structure in the last, only a layer of fully connected layer, then the SOFTMAX layer. In this paper, the whole connection layer is used as the representation of the image. At the fully connected layer, the output of the fourth and third layers of max-pooling is used as the input of the fully connected layer, so that the local and global characteristics can be learned.
Technology to: Read the convolutional neural network in one article CNN