Transferred from: http://blog.csdn.net/zouxy09/article/details/8775518

Well, to this step, finally can talk to deep learning. Above we talk about why there are deep learning (let the machine automatically learn good features, and eliminate the manual selection process. As well as a hierarchical visual processing system for reference people, we get a conclusion that deep learning require multiple layers to obtain more abstract feature representations. So how many layers are appropriate? What architecture to use to model it. How to conduct non-supervised training.

**the basic thought of deep learning**

Suppose we have a system s, which has n layers (S1,... SN), its input is I, the output is O, the image is expressed as: I =>S1=>S2=>.....=>SN = o, if the output o equals input I, that is, input I after this system changes without any information loss (hehe, Daniel said, it is impossible.) In the information theory, there is a "message-by-layer-loss" statement (processing inequalities), the processing of a information obtained B, and then the B processing to get C, then it can be proved that the mutual information of A and C will not exceed the mutual information of A and B. This indicates that information processing does not increase, and most processing loses information. Of course, if the lost is useless information that much good AH), and remained unchanged, which means that the input I through each layer of SI has no information loss, that is, in any layer of SI, it is the original information (that is, input i) another expression. Now back to our topic deep learning, we need to automatically learn features, assuming we have a bunch of input I (such as a heap of images or text), suppose we design a system s (with N-layers), we adjust the system parameters, so that its output is still input I, Then we can automatically get a series of hierarchical features of input I, namely S1, ..., Sn.

For deep learning, the idea is to stack multiple layers, meaning that the output of this layer is the input to the next layer. In this way, the input information can be expressed in a hierarchical manner.

In addition, the previous assumption that the output is strictly equal to the input, this restriction is too strict, we can slightly relax the limit, for example, as long as we make the input and output differences as small as possible, this relaxation will lead to another class of different deep learning methods. The above is the basic idea of deep learning.

**vi. Shallow Learning (shallow learning) and deep learning (Deepin learning)**

**Shallow learning is the first wave of machine learning. **

In the the late 1980s, the invention of the reverse propagation algorithm (also called the back propagation algorithm or BP algorithm) for artificial neural networks brought hope to machine learning and set off a machine learning craze based on statistical models. This craze has continued to this day. It is found that the BP algorithm can be used to make an artificial neural network model to learn statistical laws from a large number of training samples, so as to predict unknown events. This statistical-based machine learning approach is more advantageous in many ways than in previous systems based on artificial rules. The artificial neural network at this time, although also known as Multilayer perceptron (multi-layer Perceptron), is actually a shallow layer model with only one layer of hidden layer nodes.

In the the 1990s, a variety of shallow machine learning models were presented, such as support vector machines (svm,support vector machines), boosting, and maximum entropy methods (such as Lr,logistic Regression). The structure of these models can basically be seen with a layer of hidden nodes (such as SVM, boosting), or no hidden layer nodes (such as LR). These models have achieved great success both in theoretical analysis and in application. In contrast, because of the difficulty of theoretical analysis, training methods need a lot of experience and skills, this period of shallow artificial neural network is relatively quiet.

**Deep Learning is the second wave of machine learning. **

In 2006, Professor Geoffrey Hinton of the University of Toronto in Canada and his student Ruslansalakhutdinov published an article in science that opened the wave of deep learning in academia and industry. This article has two main points: 1) The artificial neural network of the multiple hidden layer has excellent characteristic learning ability, and the learning features have a more essential characterization to the data, which is advantageous to the visualization or classification; 2) The difficulty of the deep neural network in training, can be by "layer by level initialization" (Layer-wise Pre-Training) to effectively overcome, in this article, layer by level initialization is achieved through unsupervised learning.

At present, most classification, regression and other learning methods are shallow structure algorithm, its limitation lies in the finite sample and the computational unit, the ability to express the complex function is limited, and the generalization ability of the complex classification problem is restricted. Deep learning can realize complex function approximation by learning a deep nonlinear network structure, characterize the distributed representation of input data, and demonstrate a powerful ability to learn the essential characteristics of datasets from a few samples. (The benefit of multilayer is that complex functions can be represented with fewer parameters.)

The essence of deep learning is to learn more useful features by building machine learning models with many hidden layers and massive training data, which ultimately improves the accuracy of classification or prediction. Therefore, the "depth model" is the means by which "characteristic learning" is the purpose. Different from the traditional shallow learning, the difference of deep learning is that: 1) emphasizes the depth of the model structure, usually has 5 layers, 6 layers, or even 10 layers of hidden layer nodes; 2) clearly highlights the importance of feature learning, that is to say, by changing the characteristics of the original space to a new feature space, This makes it easier to classify or predict. Compared with the method of constructing characteristics of artificial rules, the use of big data to learn the characteristics, more able to depict the rich intrinsic information of the data.

**Seven, deep learning and neural Network**

Deep learning is a new field in machine learning, which is motivated by the establishment and simulation of neural networks that analyze the human brain, which mimics the mechanisms of the human brain to interpret data, such as images, sounds, and text. Deep learning is a kind of unsupervised learning.

The concept of deep learning derives from the research of artificial neural networks. Multilayer perceptron with multiple hidden layers is a kind of deep learning structure. Deep Learning represents attribute categories or characteristics by combining lower-level features to form more abstract higher levels, to discover distributed feature representations of data.

Deep learning itself is a machine learning branch, simple can be understood as the development of neural network. About twenty or thirty years ago, the neural network was once a particularly fiery direction in the ML field, but it was slowly fading out for several reasons, including the following:

1) relatively easy to fit, the parameters are difficult to tune, and need a lot of trick;

2) Training speed is relatively slow, at a lower level (less than or equal to 3) the effect is not better than other methods;

So in the middle there are about more than 20 years, the neural network is concerned about very little, this period of time is basically SVM and boosting algorithm of the world. However, a foolish old gentleman Hinton, he insisted on down, and eventually (and others together Bengio, Yann.lecun, etc.) commission a practical deep learning framework.

There are many differences between deep learning and traditional neural networks.

The same is the deep learning using a similar hierarchical structure of neural network, the system consists of input layer, hidden layer (multilayer), the output layer composed of multi-layer network, only the adjacent layer nodes are connected, the same layer and the cross-layer nodes are not connected to each other, each layer can be regarded as a logistic regression model; This hierarchical structure is relatively close to the structure of the human brain.

In order to overcome the problems in neural network training, DL adopts the training mechanism which is very different from the neural network. Traditional neural network, the use of the back propagation way to do, the simple is to use an iterative algorithm to train the entire network, randomly set the initial value, calculate the current network output, and then according to the difference between the current output and label to change the parameters of the previous layers, Until convergence (the whole is a gradient descent method). And deep learning is a layer-wise training mechanism on the whole. The reason for this is because, if you use the back propagation mechanism, for a deep network (above 7 layers), the residual spread to the front of the layer has become too small, the emergence of so-called gradient diffusion (gradient diffusion). We'll discuss the question next.

**Eight, deep learning training process**

**8.1, the traditional neural network training method Why can not be used in the deep neural network**

BP algorithm, as a typical algorithm of traditional training multi-layer network, actually has few layers of network, so the training method is not ideal. The most common local minima in the non-convex target cost function of the depth structure (involving multiple nonlinear processing unit layers) are the main sources of training difficulties.

**problems with BP algorithm:**

(1) The gradient is more and more sparse: from the top layer downward, the error correction signal is getting smaller;

(2) Convergence to the local minimum: especially when starting from the optimal area (random value initialization can cause this situation);

(3) Generally, we can only use tagged data to train: But most of the data is not labeled, and the brain can learn from the data without tags;

**8.2. Deep Learning Training Process**

If all layers are trained at the same time, the complexity will be too high, and if each layer is trained, the deviation will pass through the layer. This will be subject to the opposite of the above supervised learning, which can be seriously underdeveloped (because there are too many neurons and parameters in the deep network).

In 2006, Hinton proposed an effective method of establishing multilayer neural networks on unsupervised data, which was simply divided into two steps, one for each training layer of network, and two for tuning, so that the original representation of the X-generated high-level representation of R and the advanced expression r down-generated x ' is as consistent as possible. The method is:

1) First build a single layer of neurons, so that each time training a single-layer network.

2) After all layers have been trained, Hinton uses the Wake-sleep algorithm for tuning.

The weights in addition to the topmost layers are changed to bidirectional, so that the topmost layer is still a single-layer neural network, while the other layers change to the graph model. The upward weights are used for "cognition", and the downward weights are used for "build". Then use the Wake-sleep algorithm to adjust all weights. Let the cognition and generation agree, that is, to ensure that the top level of the generation is able to restore the underlying nodes as correctly as possible. For example, if a node in the top layer represents a face, then all images of the face should activate the node, and the resulting downward image should be able to represent an approximate human face image. The wake-sleep algorithm consists of waking (Wake) and sleeping (sleep) two parts.

**1) Wake stage** : The cognitive process, through the external characteristics and upward weight (cognitive weight) to produce each layer of the abstract Representation (node State), and use gradient descent to modify the downward weight between the layers (generate weights). That is, "if the reality is different from what I imagined, changing my weights makes me think of something like this."

**2) Sleep stage** : The generation process, through the top level (the concept of learning at the time of waking) and downward weights, generate the underlying state, while modifying the weight between the layers. That is, "if the vision of the dream is not the corresponding concept in my mind, changing my cognitive weight makes this vision seem to me the concept."

The Deep **Learning training process is detailed as follows:**

1) Use self-rising non-supervised learning (that is, starting from the bottom, a layer of top-level training):

The use of non-calibrated data (with calibration data) to train each layer of parameters, this step can be seen as a unsupervised training process, and the traditional neural network is the largest difference between the part (this process can be seen as the feature learning process):

Specifically, the first layer of training with non-calibration data first, the first layer of training to learn the parameters (this layer can be regarded as a three-layer neural network to make the output and the least difference between the hidden layer), due to the model capacity constraints and sparse constraint, so that the resulting model can learn the structure of the data itself, In order to obtain a more representative ability than the input features, after learning to obtain the first n-1 layer, the output of the N-1 layer as the input of the nth layer, training the nth layer, which respectively obtained the parameters of each layer;

2) Top-down supervised learning (that is, by using tagged data to train, error from top to bottom transmission, to fine-tune the network):

The parameters of the whole multilayer model are further fine-tune based on the parameters of the first step, which is a supervised training process; The first step is analogous to the random initialization initial value process of a neural network, Since the first step of the DL is not random initialization, but is obtained by learning the structure of the input data, so the initial value is closer to the global optimal, so that it can achieve better results, so the deep learning effect is largely attributed to the feature learning process of the first step.