Deep Learning notes finishing (very good)

Http://www.sigvc.org/bbs/thread-2187-1-3.html

Affirmation: This article is not the author original, reproduced from: http://www.sigvc.org/bbs/thread-2187-1-3.html

4.2, the primary (shallow layer) feature representation

Since the pixel-level feature indicates that the method has no effect, then what kind of representation is useful.

Around 1995, Bruno Olshausen and David Field two scholars, Cornell University, tried to use both physiological and computer methods to study visual problems.

They collected a lot of black-and-white scenery photos, from these photos, extracted 400 small fragments, each photo fragment size is 16x16 pixels, may wish to mark these 400 fragments as S, i = 0,.. 399. Next, from these black-and-white scenery photos, random extraction of another fragment, size is also 16x16 pixels, you may wish to mark this fragment as T.

The question they raised was, how to select a set of fragments from these 400 fragments, s[k], to synthesize a new fragment by stacking, and this new fragment should be as similar as possible to the randomly selected target fragment T, while the number of s[k] is as small as possible. To describe in a mathematical language is:

Sum_k (a[k] * s[k])--> T, where a[k] is the weight factor when the fragment s[k is superimposed.

To solve this problem, Bruno Olshausen and David Field invented an algorithm for sparse coding (Sparse coding).

Sparse encoding is a process of repeating iterations, which is divided into two steps per iteration:

1) Select a group of S[k], and then adjust the a[k] so that sum_k (a[k) * S[k]) is closest to T.

2) hold A[k], in 400 fragments, select other more appropriate fragment S ' [K], replace the original s[k], make Sum_k (a[k] * S ' [K]) closest to T.

After several iterations, the best s[k] combination was selected. Surprisingly, the chosen S[k] are basically the edges of different objects on the photograph, which are similar in shape and differ in direction.

The results of the algorithms in Bruno Olshausen and David Field coincide with the physical discoveries of David Hubel and Torsten Wiesel.

In other words, complex graphs are often composed of some basic structures. For example, a graph can be expressed linearly by using 64 orthogonal edges (basic structures that can be understood as orthogonal). For example, X can be reconciled with three of the 1-64 edges in accordance with the weight of 0.8,0.3,0.5. The other basic edge has no contribution, so they are all 0.

In addition, Daniel also found that not only the image of this law, sound also exists. They have found 20 basic sound structures in their never-labeled voices, and the rest sound can be synthesized from these 20 basic structures.

4.3, structural characteristics of the expression

Small pieces of graphics can be composed of basic edge, more structured, more complex, with conceptual graphics how to represent it. This requires a higher level of feature representation, such as v2,v4. So V1 pixel level is pixel level. V2 look at the V1 is pixel level, this is the level of progressive, high-level expression by the underlying expression of the combination. The professional point is base basis. The basis of the V1 is the edge, and then the V2 layer is the basis of the V1 layer, at which point the V2 zone is the basis of the one layer. That is the result of the basis combination of the previous layer, and the upper layer is the combined basis of the previous one ... (so Daniel said Deep Learning is "engaged in the base", because it is ugly, so the name Yue deep Learning or unsupervised Feature Learning)

Intuitively speaking, is to find make sense small patch and then combine, get the upper layer of feature, recursively upward learning feature.

To do training on different object, the edge basis is very similar, but the object parts and models will completely different (then we can distinguish car or face is not much easier):

From the text, a doc means something. We describe one thing, and what it means to be more appropriate. With a single word, I see is not, the word is pixel level, at least it should be term, in other words, every doc is composed of term, but the ability to express the concept is enough, may not enough, need to take a step, to the topic level, with topic, and then doc on the reasonable. But the number of each level gap is very large, such as the concept of->topic (thousand-million level)->term (100,000 magnitude)->word (million level).

A person looking at a doc, eyes see is word, by these word in the brain automatically cut words form term, in accordance with the concept of organized way, transcendental learning, get topic, and then carry out a high-level learning.

4.4. How many features are required?

We know that we need a hierarchy of features to build, but how many characteristics each layer should have.

Any method, the more features, the more reference information given, the accuracy will be improved. But the characteristics of many means that the computational complexity, exploration of space, can be used to train the data in each feature will be sparse, will bring a variety of problems, not certain characteristics of the more the better.

Well, to this step, finally can chat to deep learning. Above we talk about why there are deep learning (so that the machine automatically learn good features, and eliminates the manual selection process. There are also reference human layered vision processing systems), and we get the conclusion that deep learning requires multiple layers to obtain more abstract feature representations. So how many layers are appropriate. What architecture is used to model it. How to do unsupervised training.

Pick up.

Well, to this step, finally can chat to deep learning. Above we talk about why there are deep learning (so that the machine automatically learn good features, and eliminates the manual selection process. There are also reference human layered vision processing systems), and we get the conclusion that deep learning requires multiple layers to obtain more abstract feature representations. So how many layers are appropriate. What architecture is used to model it. How to do unsupervised training.

V. The basic ideas of Deep learning

Let's say we have a system s, it has n layers (S1,... SN), its input is I, the output is O, the image is expressed as: I =>s1=>s2=>.....=>sn => o, if the output o equals input I, that is, input I after this system changes without any loss of information (hehe, Daniel said, this is impossible.) In the information theory, there is a "layer by step" theory (processing inequality), the processing of a information to get B, and then to B processing to get C, it can be proved that: A and C mutual information does not exceed A and B mutual information. This means that information processing does not add to the message, and most processes lose information. Of course, if the lost is useless information that much good AH), maintained unchanged, which means that the input I through each layer of SI without any loss of information, that is, in any layer of SI, it is the original information (that is, input i) another expression. Now back to our topic deep Learning, we need to automatically learn the features, assuming we have a bunch of input I (such as a heap of images or text), assuming we have designed a system s (with n layers), we adjust the parameters in the system so that its output is still input I, Then we can automatically get a series of hierarchical features of input I, namely S1, ..., Sn.

For deep learning, the idea is to stack multiple layers, that is, the output of this layer as the next level of input. In this way, the input information can be implemented as a hierarchical expression.

In addition, the previous assumption is that the output is strictly equal to the input, the limit is too strict, we can slightly relax the limit, for example, if we make the difference between input and output as small as possible, this relaxation will lead to another class of different deep learning methods. This is the basic idea of deep learning.

Vi. Shallow-layer learning (shallow Learning) and deep Learning (Deep Learning)

Shallow learning is the first wave of machine learning.

In the the late 1980s, the invention of the back propagation algorithm (also called the back propagation algorithm or the BP algorithm) for Artificial neural Network (ANN) has brought hope to machine learning and set off a machine learning upsurge based on statistical model. The boom has continued to this day. It is found that the BP algorithm can be used to predict the unknown events by learning statistical rules from a large number of training samples. This method of machine learning based on statistics shows superiority in many aspects compared with the system based on artificial rules in the past. This time the artificial neural network, although also known as the Multilayer Perceptron (multi-layer perceptron), is actually a shallow layer model that contains only one layer of hidden layer nodes.

In the the 1990s, various shallow machine learning models were proposed, such as support vector machines (svm,support vector machines), boosting, maximum entropy method (such as Lr,logistic regression). The structure of these models can basically be regarded as having a layer of hidden layer nodes (such as SVM, boosting), or no hidden layer nodes (such as LR). These models have achieved great success both in theoretical analysis and in application. In contrast, due to the difficulty of theoretical analysis, training methods need a lot of experience and skills, this period of shallow artificial neural network is relatively quiet.

Deep learning is the second wave of machine learning.

In 2006, Professor Geoffrey Hinton of the University of Toronto, Canada, and his student Ruslansalakhutdinov published an article in science that opened up a wave of deep learning in academia and industry. This article has two main points of view: 1 The artificial neural network with multiple hidden layers has excellent characteristic learning ability, and the characteristics of learning are more essential to the data, which is beneficial to the visualization or classification; 2 The difficulty of the depth neural network in training can be done through "layer initialization" (layer-wise Pre-Training) to effectively overcome, in this article, layered initialization is achieved through unsupervised learning.

At present, most of the learning methods of classification and regression are shallow structure algorithms, whose limitation lies in the limited representation of complex functions in finite samples and computational units, and the generalization ability of complex classification problems is restricted. Depth learning can realize complex function approximation by learning a kind of deep nonlinear network structure, represent the distributed representation of input data, and demonstrate the powerful ability to learn the essential features of datasets from a few sample concentrations. (The advantage of multiple layers is that you can represent complex functions with fewer parameters)

The essence of deep learning is to learn more useful features by building machine learning models with many hidden layers and massive training data, which ultimately improves the accuracy of classification or prediction. Therefore, the "depth model" is the means by which "characteristic learning" is the purpose. Different from the traditional shallow learning, the difference in depth learning is: 1) emphasizes the depth of the model structure, usually has 5 layers, 6 layers, or even 10 layers of hidden nodes; 2 clearly highlights the importance of feature learning, that is, by changing the feature representation of a sample in the original space to a new feature space, Thus making classification or prediction easier. Compared with the method of constructing the artificial rule, using the large data to learn the feature, it is more able to depict the rich intrinsic information of the data.

Vii. Deep Learning and neural network

Deep learning is a new field in machine learning, whose motive is to build and simulate the neural network of human brain, which imitates the mechanism of human brain to interpret data, such as image, sound and text. Deep learning is a kind of unsupervised learning.

The concept of depth learning is derived from the study of artificial neural networks. Multilayer perceptron with multiple hidden layers is a deep learning structure. Depth learning forms a more abstract high-level representation of attribute classes or features by combining low-level features to discover distributed feature representations of data.

Deep Learning itself is a branch of machine learning, which can be easily understood as the development of neural network. About twenty or thirty years ago, neural network was once a particularly fiery direction in the ML field, but it slowly faded out because of the following:

1 more easy to fit, the parameters are difficult to tune, and need a lot of trick;

2 The training speed is relatively slow, in the level of less (less than 3) the effect is not better than other methods;

So there are about more than 20 years in between, the neural network is concerned about very little, this period of time is basically SVM and boosting algorithm of the world. But, a infatuated old gentleman Hinton, he persisted, and finally (together with others Bengio, Yann.lecun, etc.) commission a practical deep learning framework.

There are also many differences between the Deep learning and the traditional neural network in the same place.

The same is the deep learning using a similar hierarchical structure of neural network, the system consists of input layer, hidden layer (multi-layer), output layer of multi-layer network, only the adjacent layer nodes are connected, the same layer and the cross-layer nodes are not connected to each other, each layer can be considered as a logistic regression model; This layered structure is closer to the structure of the human brain.

In order to overcome the problems in neural network training, DL adopted a very different training mechanism from the neural network. Traditional neural network, the use of the way back propagation, simple is to use an iterative algorithm to train the entire network, randomly set the initial value, calculate the current network output, and then according to the current output and the difference between the label to change the parameters of the previous layers, Until convergence (the whole is a gradient descent method). and deep learning as a whole is a layer-wise training mechanism. The reason for this is that if you use the back propagation mechanism, for a deep network (above 7 layers), the residual spread to the front layer has become too small to appear in the so-called gradient diffusion (gradient diffusion). We'll discuss this question next.

VIII. training process of Deep learning

8.1, traditional neural network training methods why can not be used in the depth of neural networks

As a typical algorithm of traditional training multi-layer network, BP algorithm is not ideal for the training method with only a few layers of network. The ubiquitous local minima in the depth structure (involving multiple nonlinear processing unit layers) in non convex objective cost functions are the main sources of training difficulties.

The problems of BP algorithm:

(1) The gradient is more and more sparse: from the top layer downward, the error correction signal is getting smaller;

(2) Converge to a local minimum value: Especially from the beginning of a distance from the optimal region (random value initialization can cause this to happen);

(3) Generally, we can only use tagged data to train: But most of the data is not labeled, and the brain can learn from the data without tags;

8.2, deep learning training process

If all layers are trained at the same time, the complexity is too high, and if one level is trained, the deviations are passed by layer. This will face the opposite problem of supervised learning, which is badly fitted (because there are too many neurons and parameters in the depth network).

In the 2006, Hinton proposed an effective method for establishing multilayer neural networks on unsupervised data, to put it simply, there are two steps, one is to train one layer of network at a time, the other is tuning, so that the original represents x up to the advanced representation of R and the advanced represents r down Generation X ' as much as possible. The method is:

1 first layer to build a single layer of neurons, so that each time a single layer of network training.

2 When all the layers have been trained, Hinton uses the Wake-sleep algorithm for tuning.

The weights of other layers except the topmost layer are changed into bidirectional, so that the topmost layer is still a single layer neural network, while the other layers are transformed into graph models. The upward weighting is used for "cognition" and the downward weighting is used for "build". Then use the Wake-sleep algorithm to adjust all weights. Make the cognition and build agree, that is, to ensure that the topmost layer of the build is able to recover the underlying node as correctly as possible. For example, a node in the top layer represents a human face, so all images of the face should activate the node, and the resulting image should be able to represent an approximate face image. The wake-sleep algorithm is divided into two parts: waking (Wake) and sleeping (sleep).

(1) Wake stage: Cognitive process, through the external characteristics and upward weight (cognitive weight) to produce each layer of the abstract Representation (node State), and the use of gradient descent to modify the downward weight between layers (generate weights). That is, "if the reality is different from what I imagined, changing my weight makes me think like this".

2 The Sleep stage: the generation process, through the top-level representation (the concept of waking) and downward weight, to generate the underlying state, while modifying the weight of the layer up. That is, "if the vision of the dream is not the corresponding concept in my mind, changing my cognitive weights makes this vision the concept of what I see."

Deep learning training process specific as follows:

1 use from the bottom up unsupervised learning (that is, starting from the bottom, one layer to the top level training):

The use of no calibration data (with calibrated data can also be) layered training of various layers of parameters, this step can be seen as a unsupervised training process, and the traditional neural network is the most distinguished part (this process can be seen as the feature learning process):

Specifically, the first layer is trained with no calibration data, in training, you learn the first-level parameters (this layer can be seen as the hidden layer of a three-layer neural network with the least difference between output and input), because of the limitations of model capacity and sparsity constraints, so that the resulting model can learn the structure of the data itself, In order to obtain the characteristic that the input has more expression ability, after learning to get the n-1 layer, the output of the N-1 layer is used as the input of nth layer to train the nth layer, and then the parameters of each layer are obtained respectively.

2 Top-down Supervision Learning (is through the use of tagged data to train, error from top to bottom transmission, fine-tuning the network):

Based on the first step, the parameters of the whole multilayer model are further fine-tune, this step is a supervised training process; The first step is similar to the random initialization initial process of a neural network, because the first step of the DL is not random initialization, but by learning the structure of the input data, Thus, the initial value is closer to the global optimal, so it can achieve better results, so the deep learning effect is largely attributed to the first step of the feature learning process.

Pick up.

Common models or methods of Deep learning

9.1. Autoencoder Automatic Encoder

Deep Learning the simplest way is to use the characteristics of artificial neural network, Artificial neural Network (ANN) itself is a hierarchical system, if given a neural network, we assume that its output and input are the same, and then training to adjust their parameters, to get the weight of each layer. Naturally, we get several different representations of the input I (each layer represents a representation), and these representations are characteristics. An automatic encoder is a neural network that can reproduce the input signal as much as possible. To achieve this, the automatic encoder must capture the most important factor that can represent the input data, and, like PCA, find the main ingredient that can represent the original information.

The specific process is simply described as follows: