Deep Learning (3) Analysis of a single-layer unsupervised learning network
Zouxy09@qq.com
Http://blog.csdn.net/zouxy09
I have read some papers at ordinary times, but I always feel that I will slowly forget it after reading it. I did not seem to have read it again one day. So I want to sum up some useful knowledge points in my thesis. On the one hand, my understanding will be deeper, and on the other hand, it will facilitate future surveys. You can also share your blog with us. Because of the limited foundation, some of my understanding of the paper may be incorrect. I hope you will not give me any comments. Thank you.
The thesis in this article comes from:
An analysis of single-layer networks in unsupervised feature learning, Adam Coates, hongaks Lee, and Andrew Y. Ng. In aistats 14,201 1. In the demo_code of his thesis. But I haven't read it yet.
The following is your understanding of some of these knowledge points:
An analysis of single-layer networks in unsupervised feature learning
Recently, many studies have focused on algorithms that learn features from unlabeled data, and have achieved good results in some benchmark databases by adopting gradually complicated unsupervised algorithms and deep models. This article describes some simple factors, such as the number of nodes in the hidden layer of the model. To achieve high performance, it is more important than the selection of learning algorithms or the depth of the model. This article mainly analyzes a hidden layer network structure and compares the four network structures, sparse auto-encoders, sparse RBMS, K-means clustering, and Gaussian mixtures. This article also analyzes four factors that affect network performance: accept field size, valid field size, number of hidden layer nodes (or number of features to be extracted), convolution step size (STRIDE), and white whitening.
Finally, the author draws the following conclusions:
1. Number of hidden Neuron nodes in the network (number of features to be learned), acquisition density (that is, the moving pace during convolution, that is, where to calculate features) and the size of the sensing area has a great impact on the final feature extraction effect, even more important than the number of layers in the network.
2. whitening is necessary in the preprocessing process.
3. If you do not consider the choice of unsupervised learning algorithms, whitening, large numbers of features, and small stride will achieve better performance.
4. Among the above four experimental algorithms, K-means has the best effect. Therefore, when the author finally draws a conclusion, it is recommended to use whitening to pre-process the data, train more feature numbers for each layer, and use more intensive methods to sample the data.
I. Overview
One of the main disadvantages of many feature learning systems is their complexity and overhead. In addition, many parameters need to be adjusted (hyper-parameters ). For example, learning rates, momentum (as needed in RBM), sparsity penalties, and weight decay. The final determination of these parameters needs to be obtained through cross-validation. It takes a long time to train such a structure, and so many parameters need to be obtained through cross-validation. Therefore, the conclusion obtained in this article is that kmeans is so effective, and there is no need to consider these parameters. (For K-means analysis, see "note in deep learning paper (1) k-means feature learning ").
Ii. unsupervised feature learning framework:
1. Follow these steps to learn a feature expression:
1) randomly extract some small patches from the non-tab training images;
2) perform preprocessing on these patches (each patch minus the mean, that is, minus the DC component, and divide by the standard deviation to normalize. For images, they are equivalent to local brightness and contrast normalization. And then it needs to be white );
3) Use unsupervised learning algorithms to learn feature ing, that is, the ing function from input to feature.
2. After learning the feature ing function, given a labeled training image set, we use the learned feature ing function to extract features and then train the classifier:
1) For an image, use the features learned above to Convolution each sub-patches of the image to obtain the features of the input image;
2) pooling the convolutional feature graph obtained above to reduce the number of features and obtain translation and other immutability;
3) train a linear classifier using the features obtained above and the corresponding labels, and then assign the prediction label to the new input.
Iii. feature learning:
After preprocessing data, you can use unsupervised learning algorithms to learn features. We can regard unsupervised learning algorithms as a black box. It accepts input and generates an output. It can be represented as a function F: Rn-> rk, which maps an n-dimensional input vector X (I) to a K-dimensional feature vector. Here, we compare and analyze four different unsupervised learning algorithms:
1) sparse auto-encoders
We use BP to train an automatic failover machine with K hidden nodes. The cost function is to reconstruct the mean square error and there is a penalty item. This mainly limits hidden nodes to maintain a low activation value. The algorithm outputs a weight matrix W (kxn dimension) and a group of Base B (k dimension). The feature ing function is f (x) = g (wx + B ). Here g (z) = 1/(1 + exp (-z) is the sigmoid function, which evaluates each element of vector z.
Here, many parameters need to be adjusted, such as the attenuation coefficient of the Weight Value and the target activation value. For specific accept field sizes, these parameters need to be selected through the cross-validation method.
2) sparse restricted Boltzmann Machine
RBM is an undirected graph model that contains K binary hidden layer random variables. Sparse RBMS can be trained using the contrastive divergence approximation (comparison of differences) method. The sparse penalty items are the same as those of the automatic failover machine. For the training model or input weight matrix W and B, we can also use the same feature ing function as the above Automatic Login machine. However, its training method is completely different from that of the automatic failover machine.
3) k-means clustering
We use the K-means clustering algorithm to learn K Clustering Centers C (k) from input data ). After learning these K Clustering Centers, we can have two feature ing F methods. The first type is the standard 1-of-k, which is hard-distributed encoding:
This FK (x) indicates the K element of feature vector F of sample X, that is, feature component. Why is it hard distribution, because a sample can only belong to a certain type. That is to say, each sample x corresponds to only one 1 in the feature vector, and all others are 0. K classes have k centers. The distance between sample X and which center is the closest, the corresponding bit is 1, and the other bit is 0, which is an extreme case of sparse encoding, high sparsity.
The second approach is nonlinear ing. It is soft encoding.
Here zk = | X-C (k) | 2, u (z) is the mean of vector z. That is, the average distance between the corresponding sample and K class centers is calculated. if the distance between the samples and class centers is greater than D, the distance is set to 0, if the value is less than D, the difference between D and the distance is used. In this way, more than half of the features can be changed to 0, which is also sparse, and more values close to the class center are considered. Why is soft allocation. It indicates that this sample belongs to this class with a certain probability. It also belongs to other classes with a certain probability, which is a bit fuzzy clustering. Hard indicates the distance between the sample and the center of the class. This sample only belongs to this class. It has nothing to do with other classes.
4) Gaussian mixtures
Gaussian mixture model GMM uses k Gaussian distributions to describe the density distribution of input data. GMMs can be trained using the EM algorithm. Generally, you need to run the K-means algorithm first to initialize K Gaussian components of GMM. In this way, the local minimum value is avoided. Feature ing F maps each input sample X to a posterior probability:
Σ K is the diagonal covariance matrix, and Φk is the prior probability of each class learned by EM. In fact, this is a bit like the soft distribution of K-means above. Calculate the probability that each sample X belongs to each category J. Then, these posterior probabilities are used to encode the feature vectors corresponding to X.
Iv. Feature Extraction and Classification
Through the above steps, we can obtain a function f that maps an input patch x (n-dimensional) to a new description y = f (x) (k-dimensional. At this time, we can use this Feature Extraction Tool to extract features with labeled image data to train the classifier.
The description here is everywhere, so it is not too long. You can refer to "convolution Feature Extraction" and "pooling" in ufldl ".
V. Whitening
For sparse automatic compaction machines and RBMS, it seems a little casual if there is any whitening. When the number of features to be extracted is small (the number of nodes in the hidden layer is small), it is a bit useful to white the sparse RBMS. However, if there are a large number of features, whitening will become less useful. However, for clustering algorithms, whitening is a key and essential pre-processing step, because clustering algorithms are blindly related to data. Therefore, we need to first eliminate the relevance of the data through whitelist and then cluster the data. We can see that more details are learned after whitening, and several algorithms after whitening can learn effects similar to Gabor filter. Therefore, these features are not necessarily learned only by the structure of deep learning.
This article also references deep learning: 20 (analysis of single-layer networks in unsupervised feature learning)