convolutional Neural Networks convolutional neural Network (II.)

Source: Internet
Author: User

Transfer from http://blog.csdn.net/zouxy09/article/details/8781543

CNNs is the first learning algorithm to truly successfully train a multi-layered network structure. It uses spatial relationships to reduce the number of parameters that need to be learned to improve the training performance of the general Feedforward BP algorithm. In CNN, a small part of the image (local sensing area) as the lowest layer of the input of the hierarchy, the information is transferred to different layers, each layer through a digital filter to obtain the most significant characteristics of the observed data. This method is able to obtain significant features of the observed data that are invariant to translation, scaling, and rotation, because the local sensing area of the image allows neurons or processing units to access the most basic features, such as directional edges or corner points.

2) network structure of convolutional neural networks

convolutional Neural Network is a multilayer neural network, each layer consists of several two-dimensional planes, and each plane consists of several independent neurons,

Figure: Convolution neural Network concept demonstration: the input image through and three can be trained filter and can be offset to the convolution, filtering process one, after convolution in the C1 layer generated three feature map , and then the feature map in each group of four pixels to sum, weighted value, offset , A feature map of three S2 layers is obtained through a sigmoid function. These maps are then filtered to get the C3 layer. This hierarchy produces S4 as well as S2. Eventually, these pixel values are rasterized and connected into a vector input to the traditional neural network, resulting in output .

Generally, the C layer is a feature extraction layer , each neuron's input is connected to the local sensation field in the previous layer, and the local characteristics are extracted, and once the local feature is extracted, the position relationship between it and other features is determined; s layer is the feature map layer , Each computing layer of a network consists of multiple feature mappings, each of which is mapped to a plane, and the weights of all neurons in the plane are equal. The feature mapping structure uses the sigmoid function which affects the function core as the activation function of convolutional network, which makes the feature map have displacement invariance .

In addition, due to the sharing weights of neurons on a mapped surface, the number of free parameters is reduced and the complexity of network parameter selection is reduced. Each feature extraction layer (c-layer) in convolutional neural network is followed by a computational layer (s-layer) for local averaging and two extraction, and this unique two-time feature extraction structure makes the network more tolerant to the input sample when it is recognized.

3) about parameter reduction and weight sharing

It's like CNN's a nice place to be. by feeling wild and weight sharing, you reduce the number of parameters that a neural network needs to train. What the hell is that?

Left: If we have an image of 1000x1000 pixels, there are 1 million hidden neurons, then they are all connected (each hidden layer neuron is connected to each pixel of the image), there is a 1000x1000x1000000=10^12 connection, that is, 10^12 weight parameters. However, the spatial connection of the image is local, just like the human being through a local feeling field to feel the external image, each neuron does not need to feel the global image, each neuron only feel the local image area, and then at higher levels, The overall information can be obtained by synthesizing the neurons with different local feelings . In this way, we can reduce the number of connections, that is, to reduce the number of weight parameters that neural networks need to train. such as right: if the local feeling field is 10x10, the hidden layer of each feeling field only need and this 10x10 local image connection, so 1 million hidden layer neurons have only 100 million connections, that is, 10^8 parameters. Four less than the original 0 (order of magnitude), so the training is not so laborious, but still feel a lot of ah, there is no way to do?

We know that each neuron in the hidden layer is connected to the 10x10 image area, which means that each neuron has a 10x10=100 connection weight parameter. What if the 100 parameters of each of our neurons are the same? This means that each neuron uses the same convolution kernel to deconvolution the image. So we only have 100 parameters, regardless of the number of neurons in the hidden layer, the connection between the two layers only 100 parameters, this is the weight sharing, is the main advantage of convolutional neural network, learning together.

Well, you will think, so the extraction of features is not reliable, so you only extracted a feature ah? Yes, it's smart, we need to extract a lot of features, right? If a filter, or convolution kernel, is a feature of the proposed image, such as the edge of a certain direction. So we need to extract the different characteristics, how to do, add a few more filters will not be OK? That's right. so suppose we add 100 filters, each of which has different parameters, indicating that it presents the various features of the input image, such as different edges. (There will be 100 feature map) so each filter goes to the convolution image to get a different feature of the image of the show, which we call feature map. So there are 100 feature maps of 100 convolution cores. These 100 feature maps form a single layer of neurons . It's clear by this time. How many parameters do we have on this floor? 100 convolution cores x each convolution core shares 100 parameter =100x100=10k, which is 10,000 parameters. See right: Different colors to express different filters.

hey yo, missing a question. It is said that the number of the hidden layer parameter is independent of the number of neurons in the hidden layer, which is only related to the size of the filter and the type of filter . So how do we determine the number of neurons in the hidden layer? It is related to the original image, that is, the size of the input (number of neurons), the size of the filter and the sliding step length of the filter in the image! For example, my image is 1000x1000 pixels, and the filter size is 10x10, assuming that the filter does not overlap, that is, the step is 10, so that the number of neurons in the hidden layer is (1000x1000)/(10x10) =100x100 neurons, assuming the step is 8, That is, the convolution core overlaps two pixels, so ... I will not forget, the idea of understanding is good. Note that this is just a filter, that is, the number of neurons in a feature map Oh, if 100 feature map is 100 times times. Thus, the larger the image, the number of neurons and the number of weights needed to train the gap between the rich and poor is greater.

It is important to note that the above discussion does not take into account the biased parts of each neuron. So the number of weights needs to be added 1. This is also shared with the same filter.

In short, the core idea of convolutional networks is to combine the three structural ideas of local sensation field, weighted value sharing (or weight reproduction) and time or spatial sub-sampling to obtain some degree of displacement, scale and deformation invariance .

Example:

LeNet-5 has 7 layers, does not contain input, each layer contains training parameters (connection weights). The input image is 32*32 size. This is larger than the largest letter in the Mnist database (a recognized handwriting database). The reason for this is to hope that potential salient features such as stroke power outages or corner points can appear at the center of the highest level feature monitoring sub-field.

Let's be clear: there are multiple feature maps per layer, each feature map extracts a feature of the input via a convolution filter, and then each feature map has multiple neurons.

The C1 layer is a convolution layer (why is convolution?) An important feature of convolution operation is that the original signal features can be enhanced and the noise reduced by convolution operation, which consists of 6 feature maps feature map. Each neuron in the feature diagram is connected to the 5*5 neighborhood in the input. The size of the feature map is 28*28, which prevents input connections from falling outside the boundary (for BP feedback calculations, without gradient loss, personal insights). The C1 has 156 parameters that can be trained (each filter 5*5=25 a unit parameter and a bias parameter, altogether 6 filters, a total of (5*5+1) *6=156 parameters), a total of 156* (28*28) =122,304 a connection.

The S2 layer is a lower sample layer (why is it under sampling?). By using the principle of local correlation of images, the sub-sampling of images can reduce the amount of data processing while preserving useful information, and there are 6 14*14 feature graphs. Each cell in the feature map is connected to the 2*2 neighborhood of the C1 in the corresponding feature graph . The 4 inputs of each unit of the S2 layer are added, multiplied by a training parameter, plus a trained bias. The result is calculated by the sigmoid function. the training coefficients and biases control the nonlinearity of the sigmoid function. If the coefficients are relatively small, then the operation is approximate to the linear operation, and the sub-sampling is equivalent to the blurred image. If the coefficients are larger, the sub-sampling according to the biased size can be considered as noisy "or" or noisy "and" operations. The 2*2 of each cell does not overlap, so the size of each feature plot in S2 is 1/4 of the feature plot size in C1 (row and column 1/2). The S2 layer has 12 training parameters and 5,880 connections.

Figure: Convolution and sub-sampling process: Convolution process includes: Using a trained filter FX to convolution an input image (the first stage is the input image, the later stage is convolution feature map), and then add a bias bx, to get convolution layer cx. The sub-sampling process consists of a four pixel summation of each neighborhood into one pixel, then a scalar wx+1 weighting, and then an increase in the bias bx+1, and then a sigmoid activation function, resulting in a roughly four times times reduced feature map sx+1.

So the mapping from a plane to the next plane can be considered as convolution operation , and S-layer can be regarded as a fuzzy filter , which plays the role of two feature extraction . The spatial resolution between the hidden layer and the hidden layer decreases, and the number of planes in each layer increases, which can be used to detect more characteristic information.

C3 layer is also a convolution layer, it also through the 5x5 convolution core deconvolution layer S2, and then get the feature map is only 10x10 neurons, but it has 16 different convolution cores, so there are 16 feature map. One thing to note here is thateach feature map in C3 is connected to all 6 or several feature maps in the S2, indicating that the feature map of this layer is a different combination of the feature map extracted from the previous layer (this is not the only option). (see no, here is the combination, just like the human visual system that was talked about before, the underlying structure forms the upper layer of more abstract structures, such as edges that form shapes or parts of the target).

Just now, each feature map in C3 is composed of all 6 or several feature maps in the S2. Why not connect each feature map in S2 to the feature map of each C3? There are 2 reasons. First, the incomplete connection mechanism keeps the number of connections within a reasonable range. Second, and most important, it destroys the symmetry of the network. Because different feature maps have different inputs, they are forced to draw different characteristics (hopefully complementary).

For example, one way to exist is:The first 6 features of the C3 are entered with 3 adjacent feature map subsets in S2. The next 6 features are entered with a subset of 4 adjacent feature maps in S2. Then the 3 are entered with a subset of the 4 non-contiguous feature maps. The last one will have all the feature graphs in the S2 as input. This allows the C3 layer to have 1516 training parameters and 151,600 connections.

The S4 layer is a lower sampling layer consisting of 16 5*5-sized feature maps. Each cell in the feature map is connected to the 2*2 neighborhood of the corresponding feature graph in the C3, as is the connection between C1 and S2. The S4 layer has 32 training parameters (1 factors per feature figure and one bias) and 2000 connections.

The C5 layer is a convolution layer with 120 feature graphs. Each unit is connected to the 5*5 neighborhood of all 16 cells in the S4 layer. Since the size of the S4 layer feature map is also 5*5 (same as the filter), the size of the C5 feature map is 1*1: This constitutes the full connection between S4 and C5. The C5 is still labeled as a convolutional layer rather than a fully-connected layer , because if the input of LeNet-5 is larger and the others remain the same, then the dimension of the feature map will be larger than 1*1. The C5 layer has 48,120 training connections.

The F6 layer has 84 units (The reason why this number is chosen is from the design of the output layer) and is fully connected to the C5. There are (120+1) *84 of the available training parameters. Like the classical neural network, the F6 layer calculates the dot product between the input vector and the weight vector, plus a bias. It is then passed to the sigmoid function to produce a state of unit I.

Finally, the output layer is composed of European radial basis function units,

The RBF (Radial Basis Function) can be seen as a surface fitting (approximation) problem in a high-dimensional space, learning to find a surface that best matches the training data in a multidimensional space and then a batch of new data to be processed using the surface you just trained (such as regression, classification). The essence of RBF is the recursive technique used in the inverse propagation learning algorithm, which is called stochastic approximation in statistics. The basis function in the RBF (the basis functions in the radial basis function) provides a set of functions in the hidden unit of the neural network, which constructs an arbitrary base when the input pattern (vector) spreads to the hidden space. The functions in this function set are called radial basis functions.

Reasons for mapping to a high-dimensional space:

1, a pattern classification problem if mapping to a high-dimensional space will be more likely to achieve linear scalability than mapping to a status space.

2, the higher the dimension of the hidden space, the more accurate the approximation

Note: This is a non-linear mapping to a high-dimensional space, which is designed to make classification easier and more accurate.

Finally, the output layer consists of a European radial basis function unit, one unit per class, each with 84 inputs. In other words, each output RBF unit computes the Euclidean distance between the input vector and the parameter vector. The farther away the output is from the parameter vector, the greater the RBF output. A RBF output can be interpreted as a penalty for measuring the input pattern and the degree of matching of a model of the RBF associated class. In terms of probabilistic terminology, the RBF output can be understood as the logarithmic likelihood function of the Gaussian distribution of the F6 layer configuration space. Given an input pattern, the loss function should be able to make the configuration of the F6 close enough to the RBF parameter vectors (i.e., the expected classification of the pattern). The parameters of these units are manually selected and kept fixed.

The RBF parameter vector plays the role of the target vector of the F6 layer. It should be noted that the components of these vectors are +1 or-1, which is within the range of the F6 sigmoid, thus preventing the sigmoid function from saturating. In fact, +1 and 1 are the most curved points of the sigmoid function. This allows the F6 unit to operate within the maximum nonlinear range. The saturation of the sigmoid function must be avoided, as this will result in a slow convergence and ill-posed problem for the loss function.

5) Training Process

The mainstream of neural network for pattern recognition is guided learning network, and No Guidance Learning Network is used for clustering analysis . For guided pattern recognition, because the class of any sample is known, the distribution of the sample in space is no longer based on its natural distribution tendency, but rather to find an appropriate spatial partitioning method based on the spatial distribution of homogeneous samples and the degree of separation between different classes of samples, or to find a classification boundary, So that different classes of samples are located in different areas. This requires a lengthy and complex learning process that continuously adjusts the location of the classification boundaries used to divide the sample space so that as few samples as possible are divided into non-homogeneous areas.

Convolutional networks are essentially input-to-output mappings that can learn a large amount of mapping between input and output, without the need for precise mathematical expressions between inputs and outputs, as long as the network is trained with a known pattern for convolutional networks, which has the ability to map between input and output pairs. The Convolutional network performs a mentor training, so its sample set consists of a vector pair of shapes such as: ( input vector, ideal output vector) . All of these vectors should be the actual "running" result of the system that the network is about to emulate. They can be collected from the actual operating system. Before starting the training, all weights should be initialized with a few different small random numbers. The "small random number" is used to ensure that the network does not enter saturation due to excessive weights, resulting in training failure; "Different" is used to ensure that the network can learn normally. In fact, if you use the same number to initialize the weight matrix, the network is incapable of learning.

The training algorithm is similar to the traditional BP algorithm. It consists of 4 steps, and these 4 steps are divided into two stages:

The first stage, the forward propagation phase:

A) Take a sample (X,YP) from the sample set and input X into the network;

b) Calculate the corresponding actual output op.

At this stage, the information is transferred from the input layer to the output layer through a gradual transformation. This process is also the process that the network executes when it is running properly after the training is completed. In this process, the network performs a calculation (in effect, the input is multiplied by the weight matrix of each layer, resulting in the final output):

OP=FN (... (F2 (F1 (XpW (1)) W (2)) ... ) W (n))

Second stage, backward propagation phase

A) calculates the difference between the actual output op and the corresponding ideal output YP;

b) The inverse propagation of the adjustment weight matrix by minimizing the error.

6) Advantages of convolutional neural networks

Convolutional Neural Networks CNN is mainly used to identify two-dimensional graphs of displacement, scaling and other forms of distorted invariance. Because CNN's feature detection layer learns through training data, it avoids explicit feature extraction and implicitly learns from training data when using CNN, and because the weights of neurons on the same feature map face are the same, the network can learn in parallel, This is also a major advantage of convolutional networks over the network of neurons connected to each other. Convolution neural network has unique superiority in speech recognition and image processing because of its local weight sharing special structure, its layout is closer to the actual biological neural network, weight sharing reduces the complexity of the network, In particular, the image of multidimensional input vectors can be directly input to the network, which avoids the complexity of data reconstruction during feature extraction and classification.

The classification of streams is almost always based on statistical features, which means that certain features must be extracted before they can be resolved. However, explicit feature extraction is not easy and is not always reliable in some application issues. convolutional neural networks, which avoids explicit feature sampling and implicitly learns from the training data. This makes the convolution neural network obviously different from other neural network classifier, and the feature extraction function is fused into multilayer perceptron through structure recombination and weight reduction. It can directly handle grayscale images and can be used directly to process image-based classification.

The convolution network has the following advantages in image processing compared with the general Neural Network: a) The topological structure of the input image and the network can match well; b) feature extraction and pattern classification are carried out simultaneously and in training; c) weight sharing can reduce the training parameters of network, make the structure of neural network simpler and more adaptable.

7) Summary

The close relationship between these layers and spatial information in CNNs makes them suitable for image processing and comprehension. Moreover, it also shows a better performance in extracting the salient features of the image automatically. In some cases, the Gabor filter has been used in an initial pre-processing step to simulate the response of the human visual system to visual stimuli. In most of the current work, researchers have applied cnns to a variety of machine learning problems, including face recognition, document analysis, and language detection. To achieve the purpose of finding coherence between frames and frames in a video, CNNs is currently trained through a temporal coherence, but this is not cnns specific.

convolutional Neural Networks convolutional neural Network (II.)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.