(reproduced) convolutional neural networks

Source: Internet
Author: User
Tags theano

convolutional Neural Networks

Reprinted from: http://blog.csdn.net/stdcoutzyx/article/details/41596663

Since July this year, has been in the laboratory responsible for convolutional neural networks (convolutional neural network,cnn), during the configuration and use of Theano and Cuda-convnet, Cuda-convnet2. In order to enhance the understanding and use of CNN, this blog post, in order to communicate with people, mutual gain. Before the text, say a few points of your own feelings about CNN. First, it is clear that deep learning is the general term of all depth learning algorithms, and CNN is an application of deep learning algorithms in the field of image processing.

    • 1th, before learning deep learning and CNN, always think they are very great knowledge, always think they can solve a lot of problems, after learning, only to know that they are similar to other machine learning algorithms such as SVM, still can think of it as a classifier, You can still use it as you would with a black box.

    • 2nd, deep Learning is a powerful place where the output of one layer of the network can be used as another expression of the data, which can be thought of as a feature of the network learning. Based on this feature, further similarity comparisons can be made.

    • 3rd, the Deep learning algorithm can be effective key is actually large-scale data, the reason is that each DL has a number of parameters, a small amount of data can not be fully trained parameters.

The next few words, go straight to the topic to start the CNN tour.

1. Neural networks

First introduce the neural network, this step of detail can refer to resource 1. Briefly introduced below. Each cell of the neural network is as follows:

Its corresponding formula is as follows:

The unit can also be referred to as a logistic regression model. When multiple cells are combined and have a hierarchical structure, a neural network model is formed. Shows a neural network with a hidden layer.

Its corresponding formula is as follows:

More similar, can expand to have 2,3,4,5, ... A hidden layer.

The training method of neural network is similar to logistic, but because of its multilayer, it is necessary to use the chain derivation law to derivation the node of hidden layer, namely gradient descent + chain derivation rule, and professional name is called reverse propagation. Regarding the training algorithm, this article does not involve.

2 convolutional Neural Networks

In image processing, the image is often represented as a vector of pixels, such as a 1000x1000 image, which can be represented as a 1000000 vector. In the neural network mentioned in the previous section, if the number of hidden layers is the same as the input layer, which is also 1000000, then the input layer to the hidden layer parameter data is 1000000x1000000=10^12, so there is too much, basically can not be trained. Therefore, to practice the image processing as a neural network DAFA, we must first reduce the parameters to speed up. Just like the Sword of evil spirits, ordinary people practice very frustrated, once since the palace after the internal force to become stronger sword change quickly, it becomes very cow.

2.1 Local perception

convolutional Neural Network There are two kinds of artifacts that can reduce the number of parameters, the first artifact is called local perception field. It is generally believed that the cognition of people to the outside world is from local to global, and the spatial relation of image is more closely related to pixels, while the pixel correlation is weaker than that of distance. Thus, it is not necessary for each neuron to perceive the global image, but to perceive it locally, and then to synthesize the local information at higher levels to get a global message. The idea that the network is partially connected is also a visual system structure inspired by biology. Neurons in the visual cortex are locally receptive (that is, these neurons respond only to stimuli in certain regions). As shown: The left image is full, and the image on the right is a local connection.

In the upper right image, if each neuron is connected to only 10x10 pixels, then the weight data is 1000000x100 and reduced to the original value of one out of 10,000. And that 10x10 pixel value corresponding to the 10x10 parameter, in fact, is equivalent to convolution operation.

2.2 Parameter sharing

But in fact this argument is still too many, then start the second level artifact, that is, weight sharing. In the above local connection, each neuron corresponds to 100 parameters, altogether 1 million neurons, if the 100 parameters of the 1 million neurons are equal, then the number of parameters becomes 100.

How do you understand weight sharing? We can consider these 100 parameters (that is, convolution operations) as a way to extract features regardless of location. The implication of this is that the statistical characteristics of the part of the image are the same as the rest. This also means that the features we learn in this section can also be used in other parts, so we can use the same learning features for all the locations on this image.

More intuitively, when a small piece is randomly selected from a large image, such as 8x8 as a sample, and some features are learned from this small sample, we can apply the feature learned from this 8x8 sample as a detector to any place in the image. In particular, we can use the features learned from the 8x8 sample to make a convolution of the original large-size image, thus obtaining an activation value of a different feature for any position on this large-size image.

As shown, a convolution core of 3x3is presented as a convolution process on a 5x 5 image. Each convolution is a feature extraction method that, like a sieve, filters out parts of the image that match the criteria (the larger the activation value is, the better the condition).

2.3 Multi-convolution cores

As described above only 100 parameters, indicating that only 1 10*10 convolution core, obviously, feature extraction is not sufficient, we can add a plurality of convolution cores, such as 32 convolution cores, can learn 32 characteristics. When there are multiple convolution cores, as shown in:

Right, different colors indicate different convolution cores. Each convolution core will have the image become another image. For example, two convolution cores can generate two images, which can be viewed as a different channel of an image. As shown, there is a small error, the W1 will be changed to W0,W2 W1 can be changed. They are still referred to in W1 and W2 below.

Shows the convolution operation on four channels, with two convolution cores, generating two channels. It should be noted that each channel on the four channels corresponding to a convolution core, the first to ignore the W2, only see W1, then at the W1 location (i,j) value, is four channels (I,J) at the end of the convolution results added and then take the activation function worth.

Therefore, in the process of obtaining 2 channels from 4 channel convolution, the number of parameters is 4x2x2x2, where 4 means 4 channels, the first 2 means generating 2 channels, and the final 2x2 represents the convolution kernel size.

2.4 down-pooling

After the feature (features) has been obtained by convolution, we want to use these features to classify the next step. In theory, one can use all the extracted features to train a classifier, such as a softmax classifier, but this poses a challenge to computational capacity. For example: For an image of a 96X96 pixel, suppose we have learned 400 features defined on the 8x8 input, each feature and image convolution will be given a convolution feature (96−8 + 1) x (96−8 + 1) = 7921 dimensional, since there are 400 features, each A sample (example) will get a convolution eigenvector with a 7921x400 = 3,168,400 dimension. Learning a classifier with more than 3 million feature inputs is inconvenient and prone to overfitting (over-fitting).

To solve this problem, first recall that the reason we decided to use convolution is because the image has a "static" property, which means that features that are useful in one image area are likely to be equally applicable in another area. Therefore, in order to describe large images, a natural idea is to aggregate the characteristics of different locations, for example, one can calculate the average (or maximum) of a particular feature on an area of an image. These summary statistics feature not only a much lower dimension (compared to the use of all extracted features), but also improve the results (not easily overfitting). This aggregation is called pooling (pooling), which is sometimes referred to as averaging pooling or maximum pooling (depending on how pooling is calculated).

At this point, the basic structure and principle of convolutional neural network have been elaborated.

2.5 Multi-layer convolution

In practical applications, often using multi-layer convolution, and then use the full connection layer for training, multilayer convolution is the purpose of a layer of convolution to learn the characteristics are often local, the higher the number of layers, the more the characteristics of the more global.

3 IMAGENET-2010 Network structure

ImageNet LSVRC is a picture-classified tournament with a training set of 127w+ pictures, a validation set with 5 W images, and a test set with 15W images. This article intercepts the 2010 Alex Krizhevsky's CNN structure, which won the championship in 2010, with a top-5 error rate of 15.3%. It is worth mentioning that, in this year's Imagenet LSVRC competition, won the goognet has reached the top-5 error rate of 6.67%. It can be seen that deep learning is still a huge space for ascension.

That's Alex's CNN chart. It is important to note that the model adopts the 2-GPU parallel structure, that is, the 1th, 2, 4, 5 convolutional layers are trained by dividing the model parameters into 2 parts. Here, further, parallel structures are divided into data parallelism and model parallelism. Data parallelism means that the model structure is the same on different GPUs, but the training data is segmented, the different models are trained separately, and then the model is fused. Parallel to the model, the model parameters of several layers are segmented, and the same data are used to train on different GPUs, and the results are directly connected as the input of the next layer.

模型的基本参数为:
    • Input: 224x224 size picture, 3 channel
    • First-tier convolution: 11x11-sized convolution cores of 96, 48 on each GPU.
    • The nucleus of the first layer of max-pooling:2x2.
    • Second-tier convolution: 5x5 convolution core 256, 128 on each GPU.
    • The second layer max-pooling:2x2 the nucleus.
    • The third layer of convolution: with the previous layer is fully connected, 3*3 convolution core 384. The last 192 points to two GPUs.
    • Fourth convolution: A 3x3 convolution core of 384, two GPUs each of 192. The layer is connected to the previous layer without passing through the pooling layer.
    • Fifth convolution: A 3x3 convolution core of 256, two GPUs last 128.
    • The nucleus of the fifth layer max-pooling:2x2.
    • The first layer is fully connected: 4096-D, connecting the output of layer fifth max-pooling to a one-dimensional vector, as input to that layer.
    • Second level full connection: 4096 D
    • Softmax Layer: Output is 1000, each dimension of the output is the probability that the picture belongs to that category.
4 DEEPID Network structure

The DEEPID network structure is a convolutional neural network developed by Sun Yi of the Chinese University of Hong Kong to learn facial features. Each input face is represented as a 160-dimensional vector, the learned vectors are classified by other models, the 97.45% accuracy rate is obtained on the face verification test, and further, the original author improves CNN and gets 99.15% accuracy.

As shown, the structure is similar to the specific parameters of imagenet, so just explain the different parts.

The structure in the last, only a layer of fully connected layer, then the SOFTMAX layer. In this paper, the whole connection layer is used as the representation of the image. At the fully connected layer, the output of the fourth and third layers of max-pooling is used as the input of the fully connected layer, so that the local and global characteristics can be learned.

5 Reference Resources
    • [1] Http://deeplearning.stanford.edu/wiki/index.php/UFLDL%E6%95%99%E7%A8%8B Gardenia to Stanford Deep Learning research team's translation of deep learning tutorials
    • [2] http://blog.csdn.net/zouxy09/article/details/14222605 csdn bo master zouxy09 Deep Learning Tutorial Series
    • [3] Http://deeplearning.net/tutorial/theano realizes deep learning
    • [4] Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks[c]//advances in Neural information processing systems. 2012:1097-1105.
    • [5] Sun Y, Wang x, Tang x deep learning face representation from predicting classes[c]//computer Vision and Patter n Recognition (CVPR), IEEE Conference on. IEEE, 2014:1891-1898.

(reproduced) convolutional neural networks

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.