(reproduced) convolutional Neural Networks convolutional neural network

Source: Internet
Author: User

convolutional Neural Networks convolutional neural network contents
    1. One: Leading back propagation reverse propagation algorithm
    2. Network structure
    3. Learning Algorithms
    4. Two: convolutional neural networks convolutional neural network
    5. Three: LeCun's LeNet-5
    6. Four: The training process of CNNs
    7. V: summary

This is my weekly report in 20140822, some of which refer to the following blog post or paper, if there are some areas of the text that are not clear, you can check them. To Yann LeCun Elder, and celerychen2009, Zouxy09 thanked.

    1. Deep Learning (depth learning) Learning Notes finishing Series (vii)
    2. Deep Learning paper notes (IV.) The derivation and implementation of CNN convolution neural network
    3. convolutional Neural Networks
    4. Inverse conduction algorithm
    5. Yann LeCun ' s publications in 1998 famous "Gradient-based Learning applied to Document recognition"
    6. Inverse propagation BP algorithm

One: Leading back propagation reverse propagation algorithmNetwork structure

The classic BP network is a three-layer structure: input layer x, Output layer o, and hidden layer y.

Input vector: X = (x1,x2,..., xn) T

Hidden layer output: Y = (y1,y2,..., ym) T weight
V = (v1,v2,..., vm) T

Output vector: O = (o1,o2,..., OL) T weight w = (w1,w2,..., wl) t

Expected output: D = (d1,d2,..., dn) T

Learning Algorithms

The calculation process for the input layer to the hidden layer:

The calculation process of the hidden layer to the output layer:

The network output layer error function is:

Expand the error function to the hidden layer:

The training process is to make the final e as small as possible to achieve the optimal value, so the E can be biased to each input parameter, in order to achieve optimal. So:

η is a proportional coefficient, after a series of calculations, the formula can be converted into:

Adjust the weight matrix by minimizing the error, and loop until the best.

Two: convolutional Neural Networks convolutional neural network

BP neural network each layer of nodes is a linear one-dimensional arrangement state, and the layer is fully connected to the network nodes of the layer. If the node connection between the middle layer and the stratum of BP network is no longer full connection, it is locally connected . This is the simplest one-dimensional convolutional network. If we extend this idea to two-dimensional, this is the convolutional neural network we see in most reference materials, as shown in Figure 2:

A. Fully connected network B. Local Connection Network

Figure 2

Figure 2.a: Fully connected network. If the L1 layer has an image of 1000x1000 pixels, the L2 layer has 1000,000 hidden layer neurons, each of which is connected to each pixel of the L1 layer image, there is 1000x1000x1000,000=10^12 connection, which is 10^12 weight parameter.

Figure 2.b: Local Connection network. Each node of the L2 layer is connected to the 10x10 window adjacent to the L1 layer node, then the 1 million hidden layer neurons are only 100w times 100, which is the 10^8 parameter. The number of weight connections is reduced by four orders of magnitude compared to the original value.

convolutional neural network Another feature is weight sharing . For example, in Figure 2.b, weight sharing does not mean that all red line labels have the same connection weights. Instead, each color line has a red line with equal weights, so each node in the second layer has the same parameters as the convolution from the previous layer.

Each neuron in the hidden layer in Figure 2 is connected to the 10x10 image area, which means that each neuron has a 10x10=100 connection weight parameter. What if the 100 parameters of each of our neurons are the same? This means that each neuron uses the same convolution kernel to deconvolution the image. This L1 layer we have only 100 parameters. But in this way, only one feature of the image is extracted? If you need to extract different features, add a few more convolution cores. So let's say we add to 100 convolution cores, which is 10,000 parameters.

The parameters of each convolution core are different, indicating that it presents various characteristics of the input image (different edges). In this way, each convolution core deconvolution image will be screened for different features of the image, which we call feature map, which is the feature map.

One thing to note is that the above discussion does not take into account the bias of each neuron, plus the bias parameter, the number of weights required for each neuron needs to be added 1.

Described above is only a single-layer network structure, Yann LeCun in 1998 published the paper "Gradient-based Learning applied to Document recognition" proposed a word recognition system based on convolutional neural network LeNet-5, which was then used for the recognition of bank handwritten numerals.

Three: LeCun's L e N et -5

Does not contain input, LeNet-5 has 7 layers, each of which contains the connection weights (training parameters). The input image is 32*32 size. Let's be clear: there are multiple feature graphs per layer, and each feature graph extracts a feature of the input through a convolution filter, and then each feature map has multiple neurons.

C1, C3, C5 are convolution layers, S2, S4, and S6 are the next sampling layers. Using the principle of local correlation, the image is sampled, which can reduce the amount of data processing and keep the useful information.

The C1 layer is a convolution layer composed of 6 feature graphs. Each neuron in the feature diagram is connected to the 5*5 neighborhood in the input layer. The size of the C1 is 28*28, which prevents input connections from falling outside the boundary. The C1 has 156 training parameters and a total of 122,304 connections.

The training parameters are convolution cores that can train the number of parameters plus a bias parameter, multiplied by the number of feature graphs. Formula

Number of connections you can train the parameters multiplied by the feature map size. Formula

For C1:

(5*5+1) *6=156 parameters

156* (28*28) =122,304 a connection

The S2 layer is the lower sampling layer, with 6 14*14 feature graphs. Each cell in the feature map is connected to the 2*2 neighborhood of the C1 in the corresponding feature graph. The 4 inputs of each unit of the S2 layer are added, multiplied by a training parameter, plus a trained bias. The result is calculated by the sigmoid function. The training coefficients and biases control the nonlinearity of the sigmoid function. If the coefficients are relatively small, then the operation is approximate to the linear operation, and the sub-sampling is equivalent to the blurred image. If the coefficients are larger, the sub-sampling according to the biased size can be considered as noisy "or" or noisy "and" operations. The 2*2 of each unit does not overlap,

For S2:

The size of each feature map is 1/4 of the size of the feature map in C1 (row and column 1/2). The S2 layer has (*6=12) a training parameter and a 14*14* (4+1) *6=5880 connection.

C3 layer is also a convolution layer, it also through the 5x5 convolution core deconvolution layer S2, and then get the characteristics of map only 10x10 neurons, but it has 16 different convolution cores, so there are 16 feature maps. Each feature map in C3 is connected to all 6 or several feature maps in the S2, indicating that the feature map of this layer is a different combination of the feature map extracted from the previous layer (this is not the only approach).

Why not connect each feature map in S2 to the feature map of each C3? There are 2 reasons. First, the incomplete connection mechanism keeps the number of connections within a reasonable range. Second, it destroys the symmetry of the network. Because different feature maps have different inputs, they are forced to draw different characteristics.

The LeCun is used in the following way: The first 6 features of C3 are entered as a subset of the 3 adjacent feature maps in S2. The next 6 features are entered with a subset of 4 adjacent feature maps in S2. Then the 3 are entered with a subset of the 4 non-contiguous feature maps. The last one will have all the feature graphs in the S2 as input.

This C3 layer has (25*3+1) *6+ (25*4+1) *6+ (25*4+1) *3+ (25*6+1) *1=1516 a training parameter and ((25*3+1) *6+ (25*4+1) *6+ (25*4+1) *3+ (25*6+1) * (10 *10) = 151,600 connections.

The S4 layer is the lower sampling layer, which is composed of 16 5*5-sized feature graphs. Each cell in the feature map is connected to the 2*2 neighborhood of the corresponding feature graph in the C3, as is the connection between C1 and S2. The S4 layer has 16* (+) = 32 training parameters (1 factors per feature figure and one bias) and 5*5* (4+1) *16=2000 connections. (If you don't understand the formula, you can read all the convolutional layers and look at all the next sampling layers, not in order.)

The C5 layer is a convolution layer with 120 feature graphs. Each unit is connected to the 5*5 neighborhood of all 16 cells in the S4 layer. Since the size of the S4 layer feature map is also 5*5 (same as the filter), the size of the C5 feature map is 1*1: This constitutes the full connection between S4 and C5.

The C5 is still labeled as a convolutional layer rather than a fully-connected layer, because if the input of LeNet-5 is larger and the others remain the same, then the dimension of the feature map will be larger than 1*1. C5 layer has (5*5*16+1) *120=48120 can be trained parameters, because the C5 feature map size is 1:1, so there are 48120*1*1=48120 links (Yann original only said there are 48,120 trainable connection, Unlike the terminology used above, it is assumed that 48,120 trainable parameters and 48,120 connection, which is consistent with our calculations.

The

F6 layer has 84 units (The reason why this number is chosen is from the design of the output layer) and is fully connected to the C5 layer. Like the classical neural network, the F6 layer calculates the dot product between the input vector and the weight vector, plus a bias. It is then passed to the sigmoid function to produce a state of unit I. There are (120+1) *84=10164 with a training parameter, which is also a 10,164 connection.

Finally, the output layer consists of a European radial basis function (Euclidean Radial Basis function) unit, one unit per class, each with 84 inputs. In other words, each output RBF unit computes the Euclidean distance between the input vector and the parameter vector. The farther away the input is from the parameter vector, the greater the RBF output. A RBF output can be interpreted as a penalty for measuring the input pattern and the degree of matching of a model of the RBF associated class. In terms of probabilistic terminology, the RBF output can be understood as the negative log-likelihood of the Gaussian distribution of the F6 layer configuration space. Given an input pattern, the loss function should be able to make the configuration of the F6 close enough to the RBF parameter vectors (i.e., the expected classification of the pattern). The parameters of these units are manually selected and remain fixed (at least initially). The components of these parameter vectors are set to-1 or 1. Although these parameters can be selected in the form of probabilities such as 1 and 1, or form an error-correcting code, they are designed as a formatted picture of the 7*12 size (or 84) of the corresponding character class. This representation is not useful for identifying individual numbers, but is useful for identifying strings that can be printed in an ASCII set.

Another reason to use this distributed encoding instead of the more commonly used "1 of N" encoding to produce output is that the non-distributed encoding is less effective when the category is large. The reason is that most of the time the output of the non-distributed encoding must be 0. This makes it difficult to achieve with the sigmoid unit. Another reason is that classifiers are used not only to identify letters, but also to reject non-letters. Using distributed coded RBF is more suitable for this target. Because unlike sigmoid, they are excited in areas where the input space is better constrained, rather than a typical pattern that is easier to fall outside.

The RBF parameter vector plays the role of the target vector of the F6 layer. It should be noted that the composition of these vectors is +1 or-1, which is within the range of the F6 sigmoid, thus preventing the sigmoid function from saturating. In fact, +1 and 1 are the most curved points of the sigmoid function. This allows the F6 unit to operate within the maximum nonlinear range. The saturation of the sigmoid function must be avoided, as this will result in a slow convergence and ill-posed problem for the loss function.

Four: The training process of the CNNs

The CNNs training algorithm is similar to the traditional BP algorithm. It consists of 4 steps, and these 4 steps are divided into two stages:

The first stage, the forward propagation phase:

A) Take a sample (X,YP) from the sample set and input X into the network;

b) Calculate the corresponding actual output op.

At this stage, the information is transferred from the input layer to the output layer through a gradual transformation. This process is also the process that the network executes when it is running properly after the training is completed. In this process, the network performs a calculation (in effect, the input is multiplied by the weight matrix of each layer, resulting in the final output):

OP=FN (... (F2 (F1 (XpW (1)) W (2)) ... ) W (n))

Second stage, backward propagation phase

A) calculates the difference between the actual output op and the corresponding ideal output YP;

b) The inverse propagation of the adjustment weight matrix by minimizing the error.

V: summary

CNNs This algorithm is widely used in image recognition and processing. In the Imagenet 2014 large-scale visual recognition competition, CNNs has been widely used, where the error rate of only 6.656% of the optimal algorithm is derived from CNNs.

Yann LeCun in the 90 's to make lenet, to today become the most important technology of visual recognition, on the one hand and his efforts inseparable, on the other side, he is willing to study this direction in the neural network trough period, itself is always worth learning spirit.

(reproduced) convolutional Neural Networks convolutional neural network

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.