Http://blog.csdn.net/zouxy09
Zouxy
Version 1.0 2013-04-08
1) The Deep Learning Learning Series is a collection of information from the online very big Daniel and the machine learning experts selfless dedication. Please refer to the references for specific information. Specific version statements are also referenced in the original literature.
2) This article is for academic exchange only, non-commercial. So each part of the specific reference does not correspond in detail. If a division accidentally violated the interests of everyone, but also look haihan, and contact bloggers deleted.
3) I Caishuxueqian, finishing summary of the time is inevitable error, but also hope that the predecessors, thank you.
4) Reading this article requires machine learning, computer vision, neural network and so on (if not, it doesn't matter, no see, can read, hehe).
5) This is the first version, if there are errors, you need to continue to amend and delete. Also hope that we have a lot of advice. We all share a little, together for the promotion of the Motherland Scientific research (hehe, good noble goal ah). Please contact: [Email protected]
Iv. Basic thoughts of deep learning
differs from shallow learning (shallow learning), such as the reverse propagation algorithm (back propagation algorithm or BP algorithm), Support vector machine (svm,support vector machines), boosting, maximum entropy method (e.g. LR, Logistic Regression), the structure of these models can basically be seen with a layer of hidden nodes (such as SVM, boosting), or no hidden layer nodes (such as LR).
Deep learning is the essence of learning more useful features by building machine learning models with many hidden layers and massive training data, which ultimately improves the accuracy of classification or prediction. Therefore, the"depth model" is the means by which "characteristic learning" is the purpose .
Different from the traditional shallow learning, the difference between deep learning is:
1) The depth of the model structure is emphasized, usually there are 5 layers, 6 layers, or even 10 layers of hidden layer nodes (the advantage of multilayer is that complex functions can be represented with fewer parameters);
2) It clearly highlights the importance of feature learning, that is, by changing the feature-by-layer features, the sample is transformed into a new feature space in the original space, making it easier to classify or predict. Compared with the method of constructing characteristics of artificial rules, the use of big data to learn the characteristics, more able to depict the rich intrinsic information of the data.
The similarities and differences between deep learning and neural network
The same is the deep learning using a similar hierarchical structure of neural network, the system consists of input layer, hidden layer (multilayer), the output layer composed of multi-layer network, only the adjacent layer nodes are connected, the same layer and the cross-layer nodes are not connected to each other, each layer can be regarded as a logistic regression model; This hierarchical structure is relatively close to the structure of the human brain.
In order to overcome the problems in neural network training, DL adopts the training mechanism which is very different from the neural network.
1) The traditional neural network , the use of the back propagation way, simple is to use an iterative algorithm to train the entire network , randomly set the initial value , calculate the current network output, and then Changes the parameters of the preceding layers according to the difference between the current output and the label until it converges (the whole is a gradient descent method).
2) Deeplearning is a layer-wise (layered initialization) training mechanism on the whole.
The reason for this is because, if you use the back propagation mechanism, for a deep network (above 7 layers), the residual spread to the front of the layer has become too small, the emergence of so-called gradient diffusion (gradient diffusion). We'll discuss the question next.
The following has not been re-organized. Just pasted it over.
Five, deep learning training process
8.1, the traditional neural network training method Why can not be used in the deep neural network
BP algorithm, as a typical algorithm of traditional training multi-layer network, actually has few layers of network, so the training method is not ideal. The most common local minima in the non-convex target cost function of the depth structure (involving multiple nonlinear processing unit layers) are the main sources of training difficulties.
Problems with BP algorithm:
(1) The gradient is more and more sparse: from the top layer downward, the error correction signal is getting smaller;
(2) Convergence to the local minimum: especially when starting from the optimal area (random value initialization can cause this situation);
(3) Generally, we can only use tagged data to train: But most of the data is not labeled, and the brain can learn from the data without tags;
8.2. Deep Learning Training Process
If all layers are trained at the same time, the complexity will be too high, and if each layer is trained, the deviation will pass through the layer. This will be subject to the opposite of the above supervised learning, which can be seriously underdeveloped (because there are too many neurons and parameters in the deep network).
In 2006, Hinton proposed an effective method of establishing multilayer neural networks on unsupervised data, which was simply divided into two steps, one for each training layer of network, and two for tuning, so that the original representation of the X-generated high-level representation of R and the advanced expression r down-generated x ' is as consistent as possible. The method is:
1) First build a single layer of neurons, so that each time training a single-layer network.
2) After all layers have been trained, Hinton uses the Wake-sleep algorithm for tuning.
The weights in addition to the topmost layers are changed to bidirectional, so that the topmost layer is still a single-layer neural network, while the other layers change to the graph model. The upward weights are used for "cognition", and the downward weights are used for "build". Then use the Wake-sleep algorithm to adjust all weights. Let the cognition and generation agree, that is, to ensure that the top level of the generation is able to restore the underlying nodes as correctly as possible. For example, if a node in the top layer represents a face, then all images of the face should activate the node, and the resulting downward image should be able to represent an approximate human face image. The wake-sleep algorithm consists of waking (Wake) and sleeping (sleep) two parts.
1) Wake stage : The cognitive process, through the external characteristics and upward weight (cognitive weight) to produce each layer of the abstract Representation (node State), and use gradient descent to modify the downward weight between the layers (generate weights). That is, "if the reality is different from what I imagined, changing my weights makes me think of something like this."
2) Sleep stage : The generation process, through the top level (the concept of learning at the time of waking) and downward weights, generate the underlying state, while modifying the weight between the layers. That is, "if the vision of the dream is not the corresponding concept in my mind, changing my cognitive weight makes this vision seem to me the concept."
The deep learning training process is detailed as follows:
1) Use self-rising non-supervised learning (that is, starting from the bottom, a layer of top-level training):
The use of non-calibrated data (with calibration data) to train each layer of parameters, this step can be seen as a unsupervised training process, and the traditional neural network is the largest difference between the part (this process can be seen as the feature learning process):
Specifically, the first layer of training with non-calibration data first, the first layer of training to learn the parameters (this layer can be regarded as a three-layer neural network to make the output and the least difference between the hidden layer), due to the model capacity constraints and sparse constraint, so that the resulting model can learn the structure of the data itself, In order to obtain a more representative ability than the input features, after learning to obtain the first n-1 layer, the output of the N-1 layer as the input of the nth layer, training the nth layer, which respectively obtained the parameters of each layer;
2) Top-down supervised learning (that is, by using tagged data to train, error from top to bottom transmission, to fine-tune the network):
The parameters of the whole multilayer model are further fine-tune based on the parameters of the first step, which is a supervised training process; The first step is similar to the random initialization initial value process of the neural network, because the first step of the DL is not random initialization, but is obtained by learning the structure of the input data. Therefore, the initial value is closer to the global optimal, so that it can achieve better results, so the deep learning effect is due in large part to the first step of the feature learning process.
Deep Learning (depth learning) Learning Notes finishing Series (iii) the basic idea of--deep learning