Deep Learning (depth learning) Learning Notes finishing Series (vii)

Source: Internet
Author: User

Deep Learning (depth learning) Learning notes finishing Series

[Email protected]

Http://blog.csdn.net/zouxy09

Zouxy

Version 1.0 2013-04-08

Statement:

1) The Deep Learning Learning Series is a collection of information from the online very big Daniel and the machine learning experts selfless dedication. Please refer to the references for specific information. Specific version statements are also referenced in the original literature.

2) This article is for academic exchange only, non-commercial. So each part of the specific reference does not correspond in detail. If a division accidentally violated the interests of everyone, but also look haihan, and contact bloggers deleted.

3) I Caishuxueqian, finishing summary of the time is inevitable error, but also hope that the predecessors, thank you.

4) Reading this article requires machine learning, computer vision, neural network and so on (if not, it doesn't matter, no see, can read, hehe).

5) This is the first version, if there are errors, you need to continue to amend and delete. Also hope that we have a lot of advice. We all share a little, together for the promotion of the Motherland Scientific research (hehe, good noble goal ah). Please contact: [Email protected]

Directory:

I. Overview

Second, the background

III. visual mechanism of human brain

Iv. about Features

4.1, the granularity of the characteristic representation

4.2. Primary (shallow) feature representation

4.3, structural characteristics of the expression

4.4. How many features are needed?

The basic thought of deep learning

Vi. Shallow learning (shallow learning) and deep learning (Deepin learning)

Seven, deep learning and neural Network

Eight, deep learning training process

8.1. Training methods of traditional neural networks

8.2. Deep Learning Training Process

Common models or methods of deep learning

9.1, Autoencoder Automatic encoder

9.2, Sparse coding sparse coding

9.3. Restricted Boltzmann Machine (RBM) restricts the Boltzmann machines

9.4, deep Beliefnetworks convinced that the degree of network

9.5. Convolutional Neural Networks convolutional neural network

Ten, summary and Prospect

Xi. bibliography and deep Learning learning resources

Pick up

9.5. Convolutional Neural Networks convolutional neural network

convolutional neural network is a kind of artificial neural network, which has become the research hotspot in the field of speech analysis and image recognition. Its weight-sharing network structure makes it more similar to the biological neural network, reduces the complexity of the network model and reduces the number of weights. This advantage is more obvious when the input of the network is multidimensional image, so that the image can be used as the input of the network directly, and avoids the complicated feature extraction and data reconstruction process in the traditional recognition algorithm. Convolutional network is a kind of multilayer perceptron which is specially designed for recognizing two-dimensional shape, which is highly invariant to the transformation of translation, scale, tilt or co-form.

CNNs is affected by the early delayed neural network (TDNN). The delay neural network reduces the learning complexity by sharing weights in the time dimension, and is suitable for the processing of speech and time series signals.

CNNs is the first learning algorithm to truly successfully train a multi-layered network structure. It uses spatial relationships to reduce the number of parameters that need to be learned to improve the training performance of the general Feedforward BP algorithm. CNNs as a deep learning architecture is proposed to minimize the preprocessing requirements of data. In CNN, a small part of the image (local sensing area) as the lowest layer of the input of the hierarchy, the information is transferred to different layers, each layer through a digital filter to obtain the most significant characteristics of the observed data. This approach is able to capture the salient features of the observed data that are invariant to translation, scaling, and rotation, because the local sensing area of the image allows neurons or processing units to access the most basic features, such as directional edges or corner points.

1) History of convolutional neural networks

1962 Hubel and Wiesel through the study of the visual cortex cells of cats, the concept of receptive field was presented, and the 1984 Japanese scholar Fukushima neural cognition machine based on the concept of Sensation Wild (Neocognitron) It can be regarded as the first realization network of convolutional neural network, and it is the first application of the concept of sensation field in artificial neural network. The neuro-cognitive machine decomposes a visual pattern into many sub-patterns (features) and then goes into a hierarchical, connected feature plane, which attempts to model the visual system so that it can be identified even when the object is displaced or slightly deformed.

Usually, the neuro-cognitive machine contains two kinds of neurons, that is, the S-element and the anti-deformation C-element that bear the feature extraction. S-element involves two important parameters, namely, the sensing field and the threshold parameter, the former determines the number of input connections, while the latter controls the degree of response to the characteristic sub-pattern. Many scholars have been working to improve the performance of neuro-cognitive machines: in traditional neuro-cognitive machines, the visual fuzzy quantity brought by C-element in the photosensitive region of each S-element is normally distributed. If the edge of the photosensitive region produces a blurring effect larger than the center, the s-element will accept greater deformation tolerance resulting from this non-normal blur. What we want to get is that the difference between the training pattern and the effect of the deformation stimulation pattern on the edge of the sensing field and its center is getting bigger and larger. In order to effectively form this kind of non-normal fuzzy, an improved neuro-cognitive machine with double C-Fukushima layer is proposed.

Van Ooyen and Niehuis introduced a new parameter to improve the ability of the neuro-cognitive machine to differentiate. In fact, this parameter, as a restraining signal, suppresses the excitation of the neuron to the repetitive excitation feature. Most neural networks memorize training information in weights. According to Hebb learning rules, the more times a particular feature is trained, the more likely it is to be detected in the subsequent identification process. Some scholars also combine evolutionary computing theory with a neuro-cognitive machine, which makes the network pay attention to the different characteristics to help improve the distinguishing ability by weakening the training and learning of repetitive excitation features. All of these are the development process of neuro-cognitive machine, and convolutional neural network can be regarded as the generalization form of neural cognition machine, which is a special case of convolution neural network.

2) network structure of convolutional neural networks

convolutional Neural Network is a multilayer neural network, each layer is composed of several two-dimensional planes, and each plane consists of several independent neurons.

Figure: Convolution neural Network concept demonstration: the input image through and three can be trained filter and can be offset to the convolution, filtering process one, after convolution in the C1 layer generated three feature map, and then the feature map in each group of four pixels to sum, weighted value, offset, A feature map of three S2 layers is obtained through a sigmoid function. These maps are then filtered to get the C3 layer. This hierarchy produces S4 as well as S2. Eventually, these pixel values are rasterized and connected into a vector input to the traditional neural network, resulting in output.

Generally, the C layer is a feature extraction layer, each neuron's input is connected to the local sensation field in the previous layer, and the local characteristics are extracted, and once the local feature is extracted, the position relationship between it and other features is determined; s layer is the feature map layer, and each computing layer of the network is composed of multiple feature mappings. Each feature is mapped to a plane, and the weights of all neurons on the plane are equal. The feature mapping structure uses the sigmoid function which affects the function core as the activation function of convolutional network, which makes the feature map have displacement invariance.

In addition, due to the sharing weights of neurons on a mapped surface, the number of free parameters is reduced and the complexity of network parameter selection is reduced. Each feature extraction layer (c-layer) in convolutional neural network is followed by a computing layer (S-layer) for local averaging and two extraction, and this unique two-time feature extraction structure makes the network more tolerant to the input sample when it is recognized.

3) about parameter reduction and weight sharing

It's like CNN's a nice place to be. by feeling wild and weight sharing, you reduce the number of parameters that a neural network needs to train. What the hell is that?

Left: If we have an image of 1000x1000 pixels, there are 1 million hidden neurons, then they are all connected (each hidden layer neuron is connected to each pixel of the image), there is a 1000x1000x1000000=10^12 connection, that is, 10^12 weight parameters. However, the spatial connection of the image is local, just like a person through a local feeling of the field to feel the external image, each neuron does not need to feel the global image, each neuron only feel the local image area, and then in the higher level, these feelings of different local neurons can be obtained the overall information. In this way, we can reduce the number of connections, that is, to reduce the number of weight parameters that neural networks need to train. such as right: if the local feeling field is 10x10, the hidden layer of each feeling field only need and this 10x10 local image connection, so 1 million hidden layer neurons have only 100 million connections, that is, 10^8 parameters. Four less than the original 0 (order of magnitude), so the training is not so laborious, but still feel a lot of ah, there is no way to do?

We know that each neuron in the hidden layer is connected to the 10x10 image area, which means that each neuron has a 10x10=100 connection weight parameter. What if the 100 parameters of each of our neurons are the same? This means that each neuron uses the same convolution kernel to deconvolution the image. So we only have how many parameters?? Only 100 parameters Ah!!! Kiss! Regardless of the number of neurons in your hidden layer, I only have 100 parameters for the connection between the two layers. Kiss! This is the weight sharing Ah! Kiss! This is the main selling point of convolutional neural network Ah! Kiss! (a bit annoying, hehe) You may ask, is this a reliable way to do it? Why is it possible? This one...... Learn together.

Well, you will think, so the extraction of features is not reliable, so you only extracted a feature ah? Yes, it's smart, we need to extract a lot of features, right? If a filter, or convolution kernel, is a feature of the proposed image, such as the edge of a certain direction. So we need to extract the different characteristics, how to do, add a few more filters will not be OK? That's right. So suppose we add 100 filters, each of which has different parameters, indicating that it presents the various features of the input image, such as different edges. So each filter goes to the convolution image to get a different feature of the image show, which we call feature Map. So there are 100 feature maps of 100 convolution cores. These 100 feature maps form a single layer of neurons. It's clear by this time. How many parameters do we have on this floor? 100 convolution cores x each convolution core shares 100 parameter =100x100=10k, which is 10,000 parameters. Only 10,000 parameters Ah! Kiss! (Come again, can't stand it!) See right: Different colors to express different filters.

Hey yo, missing a question. It is said that the number of the hidden layer parameter is independent of the number of neurons in the hidden layer, which is only related to the size of the filter and the type of filter. So how do we determine the number of neurons in the hidden layer? It is related to the original image, that is, the size of the input (number of neurons), the size of the filter and the sliding step length of the filter in the image! For example, my image is 1000x1000 pixels, and the filter size is 10x10, assuming that the filter does not overlap, that is, the step is 10, so that the number of neurons in the hidden layer is (1000x1000)/(10x10) =100x100 neurons, assuming the step is 8, That is, the convolution core overlaps two pixels, so ... I will not forget, the idea of understanding is good. Note that this is just a filter, that is, the number of neurons in a feature map Oh, if 100 feature map is 100 times times. Thus, the larger the image, the number of neurons and the number of weights needed to train the gap between the rich and poor is greater.

It is important to note that the above discussion does not take into account the biased parts of each neuron. So the number of weights needs to be added 1. This is also shared with the same filter.

In short, the core idea of convolutional networks is to combine the three structural ideas of local sensation field, weighted value sharing (or weight reproduction) and time or spatial sub-sampling to obtain some degree of displacement, scale and deformation invariance.

4) A typical example illustrates

A typical convolutional network used to identify numbers is LeNet-5 (see the effects and paper). Most American banks used it to identify handwritten figures on cheques. Can reach the point of this commercial, it is conceivable accuracy. The current combination of academia and industry is, after all, the most controversial.

Let's also use this example to illustrate the following.

LeNet-5 has 7 layers, does not contain input, each layer contains training parameters (connection weights). The input image is 32*32 size. This is larger than the largest letter in the Mnist database (a recognized handwriting database). The reason for this is to hope that potential salient features such as stroke power outages or corner points can appear at the center of the highest level feature monitoring sub-field.

Let's be clear: there are multiple feature maps per layer, each feature map extracts a feature of the input via a convolution filter, and then each feature map has multiple neurons.

The C1 layer is a convolution layer (why is convolution?) An important feature of convolution operation is that the original signal features can be enhanced and the noise reduced by convolution operation, which consists of 6 feature maps feature map. Each neuron in the feature diagram is connected to the 5*5 neighborhood in the input. The size of the feature map is 28*28, which prevents input connections from falling outside the boundary (for BP feedback calculations, without gradient loss, personal insights). The C1 has 156 parameters that can be trained (each filter 5*5=25 a unit parameter and a bias parameter, altogether 6 filters, a total of (5*5+1) *6=156 parameters), a total of 156* (28*28) =122,304 a connection.

The S2 layer is a lower sample layer (why is it under sampling?). By using the principle of local correlation of images, the sub-sampling of images can reduce the amount of data processing while preserving useful information, and there are 6 14*14 feature graphs. Each cell in the feature map is connected to the 2*2 neighborhood of the C1 in the corresponding feature graph. The 4 inputs of each unit of the S2 layer are added, multiplied by a training parameter, plus a trained bias. The result is calculated by the sigmoid function. The training coefficients and biases control the nonlinearity of the sigmoid function. If the coefficients are relatively small, then the operation is approximate to the linear operation, and the sub-sampling is equivalent to the blurred image. If the coefficients are larger, the sub-sampling according to the biased size can be considered as noisy "or" or noisy "and" operations. The 2*2 of each cell does not overlap, so the size of each feature plot in S2 is 1/4 of the feature plot size in C1 (row and column 1/2). The S2 layer has 12 training parameters and 5,880 connections.

Figure: Convolution and sub-sampling process: Convolution process includes: Using a trained filter FX to convolution an input image (the first stage is the input image, the later stage is convolution feature map), and then add a bias bx, to get convolution layer cx. The sub-sampling process consists of a four pixel summation of each neighborhood into one pixel, then a scalar wx+1 weighting, and then an increase in the bias bx+1, and then a sigmoid activation function, resulting in a roughly four times times reduced feature map sx+1.

So the mapping from a plane to the next plane can be considered as convolution operation, and S-layer can be regarded as a fuzzy filter, which plays the role of two feature extraction. The spatial resolution between the hidden layer and the hidden layer decreases, and the number of planes in each layer increases, which can be used to detect more characteristic information.

C3 layer is also a convolution layer, it also through the 5x5 convolution core deconvolution layer S2, and then get the feature map is only 10x10 neurons, but it has 16 different convolution cores, so there are 16 feature map. One thing to note here is that each feature map in C3 is connected to all 6 or several feature maps in the S2, indicating that the feature map of this layer is a different combination of the feature map extracted from the previous layer (this is not the only option). (see no, here is the combination, just like the human visual system that was talked about before, the underlying structure forms the upper layer of more abstract structures, such as edges that form shapes or parts of the target).

Just now, each feature map in C3 is composed of all 6 or several feature maps in the S2. Why not connect each feature map in S2 to the feature map of each C3? There are 2 reasons. First, the incomplete connection mechanism keeps the number of connections within a reasonable range. Second, and most important, it destroys the symmetry of the network. Because different feature maps have different inputs, they are forced to draw different characteristics (hopefully complementary).

For example, one way to exist is: the first 6 features of the C3 are entered with 3 adjacent feature map subsets in S2. The next 6 features are entered with a subset of 4 adjacent feature maps in S2. Then the 3 are entered with a subset of the 4 non-contiguous feature maps. The last one will have all the feature graphs in the S2 as input. This allows the C3 layer to have 1516 training parameters and 151,600 connections.

The S4 layer is a lower sampling layer consisting of 16 5*5-sized feature maps. Each cell in the feature map is connected to the 2*2 neighborhood of the corresponding feature graph in the C3, as is the connection between C1 and S2. The S4 layer has 32 training parameters (1 factors per feature figure and one bias) and 2000 connections.

The C5 layer is a convolution layer with 120 feature graphs. Each unit is connected to the 5*5 neighborhood of all 16 cells in the S4 layer. Since the size of the S4 layer feature map is also 5*5 (same as the filter), the size of the C5 feature map is 1*1: This constitutes the full connection between S4 and C5. The C5 is still labeled as a convolutional layer rather than a fully-connected layer, because if the input of LeNet-5 is larger and the others remain the same, then the dimension of the feature map will be larger than 1*1. The C5 layer has 48,120 training connections.

The F6 layer has 84 units (The reason why this number is chosen is from the design of the output layer) and is fully connected to the C5 layer. There are 10,164 parameters that can be trained. Like the classical neural network, the F6 layer calculates the dot product between the input vector and the weight vector, plus a bias. It is then passed to the sigmoid function to produce a state of unit I.

Finally, the output layer consists of a European radial basis function (Euclidean Radial Basis function) unit, one unit per class, each with 84 inputs. In other words, each output RBF unit computes the Euclidean distance between the input vector and the parameter vector. The farther away the input is from the parameter vector, the greater the RBF output. A RBF output can be interpreted as a penalty for measuring the input pattern and the degree of matching of a model of the RBF associated class. In terms of probabilistic terminology, the RBF output can be understood as the negative log-likelihood of the Gaussian distribution of the F6 layer configuration space. Given an input pattern, the loss function should be able to make the configuration of the F6 close enough to the RBF parameter vectors (i.e., the expected classification of the pattern). The parameters of these units are manually selected and remain fixed (at least initially). The components of these parameter vectors are set to-1 or 1. Although these parameters can be selected in the form of probabilities such as 1 and 1, or form an error-correcting code, they are designed as a formatted picture of the 7*12 size (or 84) of the corresponding character class. This representation is not useful for identifying individual numbers, but is useful for identifying strings that can be printed in an ASCII set.

Another reason to use this distributed encoding instead of the more commonly used "1 of N" encoding to produce output is that the non-distributed encoding is less effective when the category is large. The reason is that most of the time the output of the non-distributed encoding must be 0. This makes it difficult to achieve with the sigmoid unit. Another reason is that classifiers are used not only to identify letters, but also to reject non-letters. Using distributed coded RBF is more suitable for this target. Because unlike sigmoid, they are excited in areas where the input space is better constrained, rather than a typical pattern that is easier to fall outside.

The RBF parameter vector plays the role of the target vector of the F6 layer. It should be noted that the composition of these vectors is +1 or-1, which is within the range of the F6 sigmoid, thus preventing the sigmoid function from saturating. In fact, +1 and 1 are the most curved points of the sigmoid function. This allows the F6 unit to operate within the maximum nonlinear range. The saturation of the sigmoid function must be avoided, as this will result in a slow convergence and ill-posed problem for the loss function.

5) Training Process

The mainstream of neural network for pattern recognition is guided learning network, and no Guidance Learning Network is used for clustering analysis. For guided pattern recognition, because the class of any sample is known, the distribution of the sample in space is no longer based on its natural distribution tendency, but rather to find an appropriate spatial partitioning method based on the spatial distribution of homogeneous samples and the degree of separation between different classes of samples, or to find a classification boundary, So that different classes of samples are located in different areas. This requires a lengthy and complex learning process that continuously adjusts the location of the classification boundaries used to divide the sample space so that as few samples as possible are divided into non-homogeneous areas.

Convolutional networks are essentially input-to-output mappings that can learn a large amount of mapping between input and output, without the need for precise mathematical expressions between inputs and outputs, as long as the network is trained with a known pattern for convolutional networks, which has the ability to map between input and output pairs. The Convolutional network performs a mentor training, so its sample set consists of a vector pair of shapes such as: (input vector, ideal output vector). All of these vectors should be the actual "running" result of the system that the network is about to emulate. They can be collected from the actual operating system. Before starting the training, all weights should be initialized with a few different small random numbers. The "small random number" is used to ensure that the network does not enter saturation due to excessive weights, resulting in training failure; "Different" is used to ensure that the network can learn normally. In fact, if you use the same number to initialize the weight matrix, the network is incapable of learning.

The training algorithm is similar to the traditional BP algorithm. It consists of 4 steps, and these 4 steps are divided into two stages:

The first stage, the forward propagation phase:

A) Take a sample (X,YP) from the sample set and input X into the network;

b) Calculate the corresponding actual output op.

At this stage, the information is transferred from the input layer to the output layer through a gradual transformation. This process is also the process that the network executes when it is running properly after the training is completed. In this process, the network performs a calculation (in effect, the input is multiplied by the weight matrix of each layer, resulting in the final output):

OP=FN (... (F2 (F1 (XpW (1)) W (2)) ... ) W (n))

Second stage, backward propagation phase

A) calculates the difference between the actual output op and the corresponding ideal output YP;

b) The inverse propagation of the adjustment weight matrix by minimizing the error.

6) Advantages of convolutional neural networks

Convolutional Neural Networks CNN is mainly used to identify two-dimensional graphs of displacement, scaling and other forms of distorted invariance. Because CNN's feature detection layer learns through training data, it avoids explicit feature extraction and implicitly learns from training data when using CNN, and because the weights of neurons on the same feature map face are the same, the network can learn in parallel, This is also a major advantage of convolutional networks over the network of neurons connected to each other. Convolution neural network has unique superiority in speech recognition and image processing because of its local weight sharing special structure, its layout is closer to the actual biological neural network, weight sharing reduces the complexity of the network, In particular, the image of multidimensional input vectors can be directly input to the network, which avoids the complexity of data reconstruction during feature extraction and classification.

The classification of streams is almost always based on statistical features, which means that certain features must be extracted before they can be resolved. However, explicit feature extraction is not easy and is not always reliable in some application issues. convolutional neural networks, which avoids explicit feature sampling and implicitly learns from the training data. This makes the convolution neural network obviously different from other neural network classifier, and the feature extraction function is fused into multilayer perceptron through structure recombination and weight reduction. It can directly handle grayscale images and can be used directly to process image-based classification.

The convolution network has the following advantages in image processing compared with the general Neural Network: a) The topological structure of the input image and the network can match well; b) feature extraction and pattern classification are carried out simultaneously and in training; c) weight sharing can reduce the training parameters of network, make the structure of neural network simpler and more adaptable.

7) Summary

The close relationship between these layers and spatial information in CNNs makes them suitable for image processing and comprehension. Moreover, it also shows a better performance in extracting the salient features of the image automatically. In some cases, the Gabor filter has been used in an initial pre-processing step to simulate the response of the human visual system to visual stimuli. In most of the current work, researchers have applied cnns to a variety of machine learning problems, including face recognition, document analysis, and language detection. To achieve the purpose of finding coherence between frames and frames in a video, CNNs is currently trained through a temporal coherence, but this is not cnns specific.

Oh, this part of the talk too wordy, and did not speak to the point. No way, first of all, so this process I have not passed, so my level is limited ah, I hope you examine. Need to change the back again, hehe.

Continuation of

Deep Learning (depth learning) Learning Notes finishing Series (vii)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.