Hinton "Reducing the dimensionality of Data with neural Networks" Reading Note

Source: Internet
Author: User

In 2006, Geoffery Hinton, a professor of computer science at the University of Toronto, published an article in science on the use of unsupervised, layer-wise greedy training algorithms based on depth belief networks (deep belief Networks, DBN). has brought hope for training deep neural networks.
If Hinton's paper, published in the journal Science in 2006, [1] is just a craze for deep learning in academia, then in recent years major companies have been scrambling to get top talent from academia to industry, marking deep learning as a real practical stage, Will have a profound impact on a range of products and services and become a powerful technology engine behind them.
Deep learning is now gaining ground in several key areas: in the field of speech recognition, deep learning replaces the mixed Gaussian model (Gaussian Mixture model, GMM) in the acoustic model with a deeper model, which yields a relative error rate reduction of about 30%. In the field of image recognition, By constructing the Deep convolutional neural Network (CNN) [3], the TOP5 error rate is reduced from 26% to 15%, and further reduced to 11% by increasing the network structure; in the field of natural language processing, deep learning basically obtains the same results as other methods, but avoids tedious feature extraction steps. So far, deep learning is the most intelligent learning approach to the human brain.

Training difficulty of deep model

The limitation of shallow-layer model lies in the finite parameter and the computational unit, which is limited in the representation of complex function, and it is restricted to the generalization ability of complex classification problem. The deep model can overcome this weakness of shallow model, however, the application of reverse propagation and gradient descent to train the deep model faces several outstanding problems:

    1. Local optimal ladder. Unlike the cost function of the shallow model, each neuron of the deep model is a nonlinear transformation, the cost function is a highly non-convex function, and the gradient descent method is easy to get into the local optimal.
    2. Gradient dispersion. When using the inverse propagation algorithm to propagate the gradient, as the propagation depth increases, the amplitude of the gradient decreases sharply, causing the weight of the shallow neurons to be updated very slowly and not effectively studied. In this way, the deep model becomes the first few layers relatively fixed, can only change the last few layers of shallow model.
    3. Data acquisition. The deep model is strong in expression and the parameters of the model increase correspondingly. For models that train so many parameters, small training datasets are not achievable and require massive amounts of tagged data that can only lead to severe overfitting (over fitting).


In 2006, Hinton published an article in "Science" that set off a wave of deep learning in academia and industry. The two main points of this article are:
1, multi-hidden layer of artificial neural network has excellent characteristics of learning ability, learning characteristics of the data has a more essential characterization, which is conducive to visualization or classification.
Why construct a deep network structure with so many hidden layers? For many training tasks, features have a natural hierarchical structure. For example, in speech, image, and text, the hierarchy is probably shown in the following table.

Table 1 feature hierarchies in several task areas


Taking image recognition as an example, the original input of the image is pixels, the neighboring pixels are composed of lines, multiple lines make up the texture, further form the pattern, and the pattern forms the part of the object until the whole object looks. It is not difficult to find the connection between the original input and the shallow features, and then through the middle features, one step at a way to obtain and the high-level characteristics of the link. It is undoubtedly difficult to move directly from the original input to the high-level features.
2, the depth neural network in the training difficulty, may through "the Layer initialization" (Layer-wise pre-training) to effectively overcome, the article gives the unsupervised layer by level initialization method.

Figure 7 Method of layer-wise initialization

Given the original input, you first train the first layer of the model, which is the black box on the left side of the diagram. The black box can be seen as an encoder that encodes the original input as the primary feature of the first layer, and can be seen as a "cognitive" of the model. In order to verify that these features are indeed an abstract representation of the input, and that there is no loss of much information, a corresponding decoder, the gray box on the left side of the graph, can be thought of as the "build" of the model. In order for cognition and generation to agree, it is required that the original input is decoded by encoding and can be roughly reverted to the original input. Therefore, the error of the original input and its encoding and decoding is defined as the cost function, while the encoder and decoder are trained. When training converges, the encoder is the first layer of the model we want, and the decoder is no longer needed. At this point we get the first layer of abstraction of the original data. Fixed the first layer of the model, the original input is mapped to the first layer of abstraction, as input, as the same, you can continue to train the second model, and then the first two layers of the model to train the third layer model, and so on, until the highest level of training model.
After the initialization is done, we can use the tagged data, and use the backward propagation algorithm to supervise the whole model. This step can be seen as a fine tuning of the overall multilayer model. Because the deep model has many local optimal solutions, the position of the model initialization will largely determine the quality of the final model. The step-by-layer initialization process is to get the model in a position that is closer to the global optimum for better results.

Comparison of shallow model and deep model

The shallow layer model has one important characteristic, needs to rely on the artificial experience to extract the sample characteristic, the model input is these already selected characteristic, the model uses only to be responsible for the classification and the forecast. In the shallow model, the most important is not the merits and demerits of the model, but the merits and demerits of the feature selection. Therefore, the majority of human resources are invested in the development and screening of features, not only need to have a deep understanding of the task problem areas, but also spend a lot of time repeatedly experimenting, which also limits the effect of shallow model.
In fact, the layered initialization of the deep model can also be seen as a feature learning process, through the hidden layer of the original input step by step abstract representation, to learn the original input data structure, find more useful features, and ultimately improve the accuracy of classification problems. After the effective features are obtained, the model can be trained as a whole.

Hinton "Reducing the dimensionality of Data with neural Networks" Reading Note

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.