Deep Learning Series (15) supervised and unsupervised training

Source: Internet
Author: User
Tags svm
1. Preface

In the process of learning deep learning, the main reference is four documents: the University of Taiwan's machine learning skills open course; Andrew ng's deep learning tutorial; Li Feifei's CNN tutorial; Caffe's official website tutorial;

Comparing these data, there was a sudden confusion: the DA and Andrew Tutorials used a lot of space to introduce unsupervised self-coding neural networks, but they were hardly involved in the caffe of Li Feifei's tutorials and implementations. It was not clear that the reason for this phenomenon, until the history of deep learning, only a little bit of a look.

The development of deep learning is broadly divided into several periods: the embryonic stage. From the invention of the BP algorithm (1970s-1980s) to 2006. Rapid development period. Since the 2006 stack-type self-encoder +BP trimmer was presented. Outbreak period. The 2012 Hilton team's Alexnet model was followed by astonishing results in the imagenet competition. 2. Germination period

In the Yann LeCun, Yoshua Bengio and Geoffrey Hinton three major nature depth study review deep learning article mentions that during this period the neural network model was abandoned by mainstream computer vision and academia.

During this time, scholars tried to use supervised learning to train deep neural networks, but the methods were not very effective and were in trouble, and in Andrew's tutorial you could find some of these reasons: data acquisition problems . Supervised training relies on tagged data for training. However, tagged data is often scarce, so for many problems it is difficult to get enough samples to fit the parameters of a complex model. For example, given the strong expressive power of deep networks, training on insufficient data can lead to overfitting. local extremum problem . The use of supervised learning methods to train shallow networks (with only one hidden layer) usually allows the parameters to converge to a reasonable extent. But when we use this method to train the depth of the network, it does not get very good results. In particular, the use of supervised learning methods to train neural networks usually involves solving a highly non-convex optimization problem. For deep networks, the search area of this non-convex optimization problem is flooded with a lot of "bad" local extrema, so the gradient descent method (or a conjugate gradient descent method, L-BFGS, etc.) is not good. gradient dispersion problem . The technical reason for the poor effect of the gradient descent method on a deep network with random initialization weights is that the gradient becomes very small. Specifically, when using the reverse propagation method to calculate the derivative, the amplitude of the reverse propagation gradient (from the output layer to the first layers of the network) decreases sharply as the depth of the network increases. As a result, the derivative of the overall loss function relative to the initial layers of weight is very small. Thus, when the gradient descent method is used, the weights of the initial layers change very slowly so that they are not able to learn effectively from the sample. This problem is often referred to as "gradient dispersion".

The development of deep neural networks has been tepid for a long time because there has been no way to solve these problems effectively. Or, in 2001, Hochreiter's gradient flow in recurrent nets:the difficulty of learning long-term Dependencies (as in this article) the development of neural networks has fallen into a trough in the next few years, after several problems have been raised. The popular machine learning algorithms for those years were SVM and integrated models (random forest, adaboost, etc.), as shown in the following figure.


3. Rapid development Period

06 Hilton published an article on nature. Reducing the dimensionality of data with neural networks, for the three deep learning issues mentioned above, a stack-based self-encoder +BP fine-tuning solution was proposed. To some extent, it solves the above three problems: the self-coding neural network is unsupervised learning algorithm . Therefore, there is no need for a large label sample. After the self-coding neural network training parameters have fallen in a better position, from this position to start the BP fine tuning, do not worry about the local extremum problem. Self-coding neural network training has made the first layer of deep network parameters have the ability to express, such as the edge of the image can be extracted, local formation, even if there is a gradient dispersion problem, the first few layers of parameters are no longer updated, will not affect the final depth of network expression ability.

Because of the above reasons, after 01 years of neural network trough, deep learning to open a new wave, embarked on the development of the fast lane, from the above image of the red line can be clearly seen. 4. Outbreak period

In the 12 ILSVRC competition, the Hilton team's alexnet model Imagenet classification with the deep convolutional neural Networks reduced the TOP-5 error rate of class 1000 classification to 15.3%, crushed the second using the SVM algorithm 26.2%, opened the deep learning revolution, since then, deep learning embarked on the exponential development path. In the 15 CVPR article, I focused on two directions for scene semantic tagging and significant object detection, a significant proportion of the articles covered by CNN or deep words, estimated next year CVPR article in the proportion of depth learning will be higher. The industry's popularity does not need to mention, from Yann LeCun, Yoshua Bengio and Geoffrey Hinton three giants to Mishing, Li Feifei, such as the visual direction of Daniel have been dug into the Internet companies can be seen.

Back in the Hilton team's alexnet model, only supervised training was used, seemingly without unsupervised pre-training. Not that there are many problems with supervised deep learning training, presumably because of these reasons, which leads to the feasibility of supervised training: the advent of large-scale labeling data. The datasets used in ILSVRC include 1.2 million training pictures, 50,000 validated images, and 150,000 test images. These images are labeled (category 1000), and the dimension data of this size does not exist until Imagenet appears. For the problem of local extremum, in the Nature review, three Daniel authors say: For deep networks, local extremum is never a problem, starting from any initial parameter values to train the network, and finally can achieve similar classification effect. This is also proven by recent theories and practices. The slow convergence rate caused by gradient dispersion is a problem. Two great weapons of the Alexnet model: Relu activation function and GPU parallel acceleration. The former makes SGD 6 times times faster, the latter using two block Gtx580gpu also greatly accelerates the convergence rate of SGD, the effect of the two multiplied, so unsupervised pre-training is almost superfluous, gradient dispersion problem is no longer a big problem. 5. Summary

As can be seen from the above, Andrew Ng's tutorial is the product of 06 to 12 years, when unsupervised training is the mainstream, Li Feifei's CNN Tutorial and Caffe Official website of the tutorial is generated in 12, then the database is large enough (on tens), The model is advanced enough (relu activation function, dropout, and so on), and the computation speed is fast enough (GPU acceleration), so unsupervised pre-training (self-coding neural network) loses its value in many applications, and supervised training is sufficient to complete the task.

In a nutshell, the 06 unsupervised pre-training opened the era of deep learning, in the course of rapid development of deep learning, the acquisition of big data, the development of computer hardware and the upgrading of the depth model make the supervision training back to the stage, unsupervised pre-training is also completed the historical mission.

Is that pre-training useful? The answer is yes, for example, we have a classification task, the database is very small, it is necessary to pre-training to avoid the depth model of the overfitting problem, but the pre-training is through a large database (such as imagenet), through supervised training to complete. This supervised pre-training and small database fine-tuning mode is called transfer learning, which is described in detail in the Li Feifei's CNN Tutorials and Caffe's official website tutorials.

In addition, Andrew Ng's tutorial also has a few other points commonly used before 12 but now rarely used details, such as the activation function described in this tutorial is sigmoid, is now very rare, almost replaced by the Relu activation function, the optimization algorithm is L-BFGS, Now the optimization algorithm mainstream is sgd+momentum. The differences between the tutorials were very confusing at the time of learning, until we understood the history of deep learning and learned about these different sources.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.