An introduction to the convolution neural network for Deep Learning (2)

Source: Internet
Author: User
Tags theano

The introduction of convolution neural network

Original address : http://blog.csdn.net/hjimce/article/details/47323463

Author : HJIMCE

Convolution neural network algorithm is the algorithm of n years ago, in recent years, because the depth learning correlation algorithm for multi-layer network training provides a new method, and now the computer's computing capacity is not the same level of computing that year, and now a lot of training data, so the neural network related algorithms again fire up, So the convolution neural network is alive again.

Before we begin, we need to be clear that the relevant tutorials on the convolution neural network are generally referred to as the forward conduction process of neural networks, the reverse propagation is the gradient descent method for training, most of the depth of learning library, has been the reverse derivative of the function to encapsulate the good, if you want to study the reverse derivative, you need to learn slowly.

Because the classical model of convolution neural network is: Lenet-5 implementation, as long as the understanding of this forward conduction process, basically OK, so we mainly explain the implementation of Lenet-5 behind.

I. Theoretical stage

As a CNN introductory article, there is no plan to long-winded too much stuff, because of what the weight of sharing, local feeling field, and so many, are all the related theories of biology, see those things, most beginners have been bored. Convolution neural network related blog is also a lot of, but said, is basically copied over the past, as I did not understand from the S2 layer to the C3 layer is how to achieve, online read a lot of tutorials, not a solution to this problem. My personal feeling of the whole process, only S2 to C3 is the most difficult to understand. Then I will explain it in the most understandable way.

1. Convolution

The concept of convolution this I think as long as the image processing people understand the concept, this does not explain. We know that for a given image, for a given volume kernel, convolution is the weighted sum of pixels according to the convolution window.


The convolution neural network is different from the convolution of the image we learned before, my understanding is: we learned before the image processing encountered convolution, in general, this convolution core is known, such as the various edge detection operators, Gaussian blur and so on, are already know the convolution kernel, and then with the image for convolution operations. However, convolution kernel in the depth learning is unknown, we train a neural network, which is to train these convolution cores, which are equivalent to the parameters W when we learn the single layer perceptron, so you can think of these convolution cores to be studied as the training parameters of the neural Network (W).

2, the pool of

Just started to learn CNN, see the word, as if the appearance of a tall, so check a lot of information, theory a lot of, but practice, algorithmic implementation has not mentioned, also do not know how to achieve the pool. In fact, the so-called pool, is the picture under the sample. At this time, you will find that the construction of each layer of CNN is similar to the construction of the image Gauss pyramid, so if you already know the algorithm of image Pyramid Fusion, then it becomes easier to understand. In the construction of the Gauss Gold Tower, each layer is sampled after the convolution, then the convolution, and CNN is the same process. No more nonsense, here's the talk, CNN's Pool:

CNN's Pool (image sampling) methods are many: Mean pooling (mean sampling), max pooling (maximum sampling), overlapping (overlapping sampling), L2 pooling (mean square sampling), local contrast Normalization (normalized sampling), stochasticpooling (sampled), def-pooling (deformation-constrained sampling). One of the most classic is maximum pooling, so I'll explain the implementation of maximum pooling:


Original picture

For the sake of simplicity, I use the picture above as an example, assuming that the image size above is 4*4, as shown in the above image, and then the value of each pixel in the picture is the number in each grid above. And then I'm going to pool this picture of 4*4, the size of the pool (2,2), step to 2, then use the largest pool is the above 4*4 of the picture to block, each block size is 2*2, and then statistics the maximum value of each block, as the following sample image of the pixel value, the specific calculation as shown in the following figure:


Which means we finally get the following sample image:


This is called the maximum pooling. Of course, you will also encounter various methods of pooling, such as mean pool, that is, the average of each block to take the next sample of the new pixel value. There are also overlapping sampling of the pool, I have this example is not overlapping sampling, that is, there is no overlap between each block, the above I said the step of 2, is to make the scoring blocks are not overlapping, and so on, and then explain the common method of pooling. It's good to remember the maximum pool, because this is the most commonly used.

3, feature maps

This word is translated into a feature map, it is a very professional term. So what is a feature map? In fact, a picture through a convolution kernel for convolution operations, we can get a convolution of the results of the picture, and this picture is the feature map. On CNN, the convolution core we're training for is not just one, these convolution cores are used to extract features, the more the number of convolution cores, the more features they extract, the higher the accuracy will theoretically be, however, the volume kernel is a heap, which means that the number of parameters we have to train more. In the LENET-5 classic structure, the first layer of convolution core selected 6, and in the alexnet, the first layer of convolution core selection of 96, the specific number of suitable, yet to learn.

Back to the concept of feature graph, we have to artificially select the appropriate number of convolution cores and the volume kernel size for each convolution layer of CNN. Each volume kernel and the image convolution, you can get a feature map, such as the LENET-5 classic structure, the first layer of convolution kernel selected 6, we can get 6 feature graphs, which is the next layer of network input. We can also take the input image as a feature map, as the input of the first layer network.

4, CNN's classic structure

For those of us who are just getting started with CNN, we first need some of the classic constructs now:

(1) LeNet-5. This is n years ago, a classic CNN structure, mainly for handwriting recognition, but also a new learning needs to learn a familiar network, my blog post is mainly to talk about the network


(2) alexnet.


Image classification on Imagenet challenge the Alexnet network structure model of the Great God Alex won the 2012 championship, inspiring, using CNN to achieve the classification of the picture, others with the traditional machine learning algorithm to jump to half dead also that way, Alex uses CNN's precision far beyond the traditional network.

There are other "network in Network", Googlenet, Deconvolution Network, in the future study we will encounter. For example, the use of deconvolution network Deconvolution Network to achieve the blurred picture, the cool coax.

OK, the theoretical phase of the long-winded to here, and then to explain LeNet-5, LeNet-5 is used for handwriting recognition of a classic CNN:


LENET-5 structure

input:32*32 's handwritten font pictures, which contain 0~9 numbers, which are equivalent to 10 categories of pictures

output: category result, a number between 0~9

So we can know that this is a multiple classification problem with a total of 10 classes, so the final output layer of the neural network is necessarily a softmax problem, and then the number of neurons is 10. LENET-5 structure:

Input layer: 32*32 's picture, which is equivalent to 1024 neurons

C1 Layer:paper author, select 6 feature convolution cores, then volume kernel size select 5*5, so that we can get 6 feature graphs, then each feature map of the size of 32-5+1=28, that is, the number of neurons from 1024 reduced to 28*28=784.

S2 Layer: This is the lower sampling layer, that is, using the maximum pool for the next sampling, pool size, select (2,2), which is equivalent to the C1 layer 28*28 of the picture, to block, each block size of 2*2, so we can get 14*14 block, Then we counted each block, the largest value as the next sample of the new pixel, so we can get S1 result: 14*14 size of the picture, a total of 6 such pictures.

C3 layer : Convolution layer, this layer we choose the size of the convolution core is still 5*5, so we can get the new picture size of 14-5+1=10, and then we hope that we can get 16 feature graphs. So here's the problem. This layer is the most difficult to understand, we know S2 contains: 6 14*14 size of the picture, we hope that the result of this layer is: 16 Pictures of 10*10. Each of these 16 pictures is obtained by weighted combination of the S2 's 6 images, and how it is combined. The problem is shown in the following illustration:


To explain this, let's start with the simple beginning, and I now assume that the size of the input 6 feature map is 5*5, with 6 5*5 convolution cores, and 6 convolution results picture size 1*1, as shown in the following figure:


For the sake of simplicity, let me first make some definition of the label: we assume that the values of each pixel of the input I feature graph are x1i,x2i......x25i because each feature graph has 25 pixels. Therefore, after the 5*5 of the picture convolution, the pixel value pi of the convolution result picture can be expressed as follows:


This is convolution formula, not explained. So for the above P1~P6 calculation method, this is directly according to the formula. And then we add the P1~P6 together, which is:

p=p1+p2+ ... P6

To put the above pi formula into the upper formula, then we can get:

P=wx

where x is the value of each pixel of the 6 5*5 feature picture that is entered, and W is the parameter we need to learn, and it is equivalent to 6 5*5 convolution cores, of course it contains 6* (5*5) parameters. So our output feature map is:

Out=f (P+B)

This is the calculation method from S2 to C3, where B represents the offset and f is the activation function.

We return to the original question: There are 6 images of the 14*14 feature, we want to use the 5*5 convolution kernel, and then finally we want to get a picture of the output feature of the 10*10.

According to the above process, that is, in fact, we use 5*5 convolution kernel to convolution each of the input feature map, of course, each feature map of the convolution kernel parameters are not the same, that is, do not share, so we are equivalent to the need for 6* (5*5) parameters. After the convolution of each of the input feature graphs, we get 6 10*10, new pictures, this time, we add these 6 pictures together, then add a bias item B, and then use the activation function to map, you can get a 10*10 output feature map.

And we want to get 16 10*10 output feature graphs, so we need the number of convolution parameters to be 16* (6* (5*5)) =16*6* (5*5) parameters. In short, each picture in the C3 layer is then rolled through a S2 picture and then added, plus offset B, and finally the result of the activation function mapping.

S4 layer: The lower sampling layer, relatively simple, but also confidant of the C3 of the 16 10*10 pictures for maximum pool, the size of the pool block for 2*2. So the last S4 layer is a picture of 16 pieces of 5*5. So far our number of neurons has been reduced to: 16*5*5=400.

C5 Layer: We continue to use the 5*5 convolution kernel for convolution, and then we want to get 120 feature graphs. So the size of the C5 layer is 5-5+1=1, which is equivalent to 1 neurons, 120 feature graphs, so there are only 120 neurons left in the end. This time, the number of neurons is enough, we can directly use the full connection of the neural network, to carry out the follow-up of these 120 neurons, the following specific how to do, as long as the knowledge of multi-layer sensors understand, do not explain.

The above structure, is only a reference, in the real use, each layer feature map needs how many, volume kernel size selection, as well as the pool when the sample rate to how much, and so these are changes, this is called the CNN tuning, we need to learn flexible.

For example, we can change the structure above to read: C1 layer volume kernel size of 7*7, and then the C3 layer of the volume kernel to 3*3 and so on, and then the number of feature maps is their own choice, perhaps the accuracy of handwriting recognition is higher than the above, this is also possible, in short, a word: need to learn flexible, Need to learn CNN's tuning.

second, the actual combat phase

Learn about CNN's source code implementation site: Http://deeplearning.net/tutorial/lenet.html#lenet

1. Training Data Acquisition

In the Theano Learning Library, there is a library of handwritten fonts, which can be downloaded from the Internet, named: mnist.pkl.gz's handwriting library, which contains three parts of the data, training data sets train_set:50000 training samples, validation set Valid_set, We can read the data using the following code, and then display one of the pictures with plot: [python] view plain copy <span style= "FONT-SIZE:18PX;" >import cpickle Import gzip import numpy as NP import Matplotlib.pyplot as Plt f = gzip.open (' mnist.pkl.gz ', ' r      B ') Train_set, valid_set, test_set = Cpickle.load (f) f.close () Tx,ty=train_set; #查看训练样本 Print Np.shape (TX) #可以看到tx大小为 (50000,28*28) matrix of the two-dimensional matrix print np.shape (ty) #可以看到ty大小为 (50000,1) #图片显示 A=tx[8].resh Ape (28,28) #第八个训练样本 y=ty[8] Print Y plt.imshow (a,cmap= ' Gray ') #显示手写字体图片 </span>

In the above code I show the 8th picture, and you can see the following results:


The eighth sample is the number 1.

2, LeNet-5 implementation

First of all you need to know mnist.pkl.gz This library gives us the size of the picture is 28*28, so we can first select 5*5 convolution kernel to get 24*24, at the same time we want the C1 layer to get 20 map, and so on, the specific code to achieve the following;

[Python] View plain copy import os   import sys   import timeit      import numpy      import theano   import theano.tensor as  t   from theano.tensor.signal import downsample   from  theano.tensor.nnet import conv      from logistic_sgd import  logisticregression, load_data   from mlp import hiddenlayer       #卷积神经网络的一层, including: convolution + down sampling two steps    #算法的过程是: Convolution-"sample-" Activation function    Class lenetconvpoollayer ( Object):           #image_shape是输入数据的相关参数设置   filter_ Shape of the relevant parameters of the setting        def __init__ (self, rng, input, filter_ Shape, image_shape, poolsize= (2, 2)):             "" "  &NBSP;&NBSP;&NBSP;&NBsp;    :type rng: numpy.random.randomstate           :p aram rng: a random number generator used to  initialize weights            3, input:  input feature map data, That is, n-feature pictures             4, Parameters  filter_shape:  (number  of filters, num input feature maps,                                  filter height, filter width)            num of filters: The number of convolution nuclei, how many convolution nuclei, then the number of out feature maps in this layer   How many of the          will also be generated. Num input feature maps: Enter the number of feature graphs.            then Filter height, filter width is the width of the convolution nucleus, such as 5*5,9*9......           filter_shape is a list, so we can get the number of convolution cores         with Filter_shape[0]      5, Parameter  image_shape:  (batch size, num input feature  maps,                                image height,  Image width),           batch size: Batch training samples number   Num input feature maps: Number of input feature graphs             Image height, image width is the size of the input feature map picture.            image_shape is a list type, so you can use the index directly to access the above 4 parameters, index subscript from            0~3. Like Imag.e_shape[2]=image_heigth  image_shape[3]=num input feature maps      Block size for        6, parameter  poolsize:  pool sampling, typically (2,2)             ""                ASSERT&NBSP;IMAGE_SHAPE[1]&NBSP;==&NBSP;FILTER_SHAPE[1] #判断输入特征图的个数是否一致, if the inconsistency is wrong             self.input = input               # fan_in=num input feature maps *filter height*filter  width             #numpy. PROD (x) function to compute the product of each element of x   Number of link parameter weights required for            #也就是说fan_in就相当于每个即将输出的feature   map            fan_in = numpy.prod (filter_shape[1:])            # fan_out=num output feature maps *  filter height * filter width            fan_out =  (Filter_shape[0] * numpy.prod (filter_shape[2:])  /                        Numpy.prod (poolsize)            #  initializes the parameter to the number between [-a,a], where a= sqrt (6./(fan_in + fan_out)), then the parameters are uniformly sampled            # Number of weights required. Number of convolution kernel * Input feature graph number * convolution kernel width * convolution kernel height. This does not contain the number of link weights for the sampling layer            w_bound = numpy.sqrt 6.  /  (fan_in + fan_out))            self. W = theano.shared (               numpy.asarray (                    rng.uniform (Low=-w_bound, high=w_bound, size=filter_shape),                   dtype= theano.config.floatx               ),                borrow=True           )              The   # b is biased and is a one-dimensional vector. Each output feature graph I corresponds to a bias parameter b[i]           #, so the number of initialization B below is the number of feature graphs Filter_ shape[0]           b_values = numpy.zeros (filter_ Shape[0],  dtype=theano.config.floatx)             self.b = theano.shared (value=b_values, borrow=true)                #  convolution layer operation, the first parameter of function conv.conv2d is the characteristic graph of input, the second parameter is the convolution kernel parameter of random accident              #第三个参数为卷积核的相关属性, input feature map related properties             conv_out = conv.conv2d (  

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.