(vi) 6.12 neurons Networks from self-taught learning to the deep network

Last Update:2016-04-08 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Self-taught learning is a completely unsupervised method for feature extraction, and for tagged data, it can be combined with supervised learning to fine-tune the parameters obtained by the above method to obtain a more accurate parameter a.

In self-taught learning, a sparse autoencoder is trained with unlabeled data, which is used for the original input x, through sparse autoencoder to obtain the hidden layer feature a:

For classification problems, the goal is to predict the category label of the sample. The current label dataset contains a sample of the labels. It has been previously stated that the features obtained by the sparse self-encoder can be used to replace the original features. This gives you the training data set. Finally, a logistic classifier from feature to class designator is trained. To illustrate this process, a logistic regression unit (orange) is described.

Combining these two steps together, you get:

The parameters of the model are trained in two steps: at the first layer of the network, the weights that map the input to the hidden unit activations can be obtained through the sparse self-encoder training process. In the second layer, the weights that map the hidden cells to the output can be obtained by logistic regression or Softmax regression training.

This final classifier is obviously a large neural network on the whole. Therefore, after training to obtain the initial parameters of the model (using an automatic encoder to train the first layer, using the Logistic/softmax regression to train the second layer), we can further modify the model parameters, thus reducing the training error. In particular, we can fine-tune the parameters, using gradient descent or l-bfgs on the basis of existing parameters to reduce the training error on the labeled Sample set.

When using fine-tuning, the initial unsupervised feature learning steps (i.e., automatic encoder and logistic classifier training) are sometimes referred to as pre-training. The effect of fine tuning is that the annotated data set can also be used to correct weights so that the features extracted by the hidden cells can be further adjusted.

It is important to note that fine-tuning is usually used only when there are a large number of annotated training data. In such cases, fine tuning can significantly improve the performance of the classifier. However, if there are a large number of unlabeled datasets (for unsupervised feature learning/pre-training), there are only relatively few annotated training sets, and the effect of fine tuning is very limited.

The previously mentioned network is generally three layers, the following is a gradual consideration of multilayer networks, by deepening the network layer, we can calculate more complex input characteristics. Because each hidden layer can perform nonlinear transformations on the output of the previous layer, deep neural networks have more expressive power than "shallow" networks (for example, they can learn more complex functional relationships).

Note that the neural network should use non-linear activation functions, because the linear activation function of the limited expression, the combination of multi-layer linear function is still a linear expression ability, so the activation function is linear, the network layer is deepened and the number of eggs, and does not increase the ability to express.

Benefits of deepening your network:

First, the network does not deepen the expression of a layer will be a number of times before, for example, with K-Layer neural network can learn the function (and the number of network nodes per layer of polynomial) if you want to use K-1 layer neural network to learn, then this k-1 layer of neural network nodes must be a number of large numbers.

Second, the different layers of the network learning characteristics are from the bottom to the highest level of rising slowly. For example, in the image learning, the first hidden layer network may be learning edge features, the second hidden layer to learn is the outline of what, the latter will be more advanced there may be a part of the image target, that is, the underlying hidden layer learning features, high-level hidden layer learning high-level features.

The previous learning algorithms used by researchers were to randomly initialize the weights of deep networks and then use supervised target functions to train on tagged training sets.

This has some drawbacks:

First, the data acquisition problem, using the method mentioned above, we need to rely on tagged data for training. However, tagged data is often scarce, so for many problems it is difficult to get enough samples to fit the parameters of a complex model. For example, given the strong expressive power of deep networks, training on insufficient data can lead to overfitting

Second, the use of supervised learning methods to the shallow network (only a hidden layer) training usually can make the parameters converge to a reasonable range. But when we use this method to train the depth of the network, it does not get very good results. For example, when using supervised learning to train a neural network, it usually involves solving a highly non-convex optimization problem (such as minimizing the training error, where the parameter is the parameter to be optimized. For deep networks, the search area of this non-convex optimization problem is flooded with a lot of "bad" local extrema, so the gradient descent method (or a conjugate gradient descent method, L-BFGS, etc.) is not good.

Third, gradient diffusion (gradient diffuse), gradient descent (and related L-BFGS algorithms, etc.) in the use of random initialization weights in the depth of the network effect is not good technical reason is: the gradient will become very small, when using the reverse propagation method to calculate the derivative, as the depth of the network increases, The amplitude values of the reverse propagation gradient (from the output layer to the first layers of the network) are drastically reduced. As a result, the derivative of the overall loss function relative to the initial layers of weight is very small. Thus, when the gradient descent method is used, the weights of the initial layers change very slowly so that they are not able to learn effectively from the sample. This problem is often referred to as "gradient diffuse".

The problem with gradient diffusion is that when the last layers of the neural network contain sufficient numbers of neurons, it is possible that these layers alone are sufficient to model tagged data without the help of the first layers. Therefore, the performance of the entire network trained by all layers using a random initialization method will be similar to the performance of the shallow network (shallow network consisting only of the last few layers of the deep network) that has been trained.

Finally, with the development of deep learning, pre-training has become less important, because scholars have found that as long as the amount of data is enough, the final deep learning can give a better solution, And some non-connected networks such as lstm or CNN are difficult to pre-training.

(vi) 6.12 neurons Networks from self-taught learning to deep network

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

(vi) 6.12 neurons Networks from self-taught learning to the deep network

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

(vi) 6.12 neurons Networks from self-taught learning to the deep network

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support