The topic of this class is deep learning, the person thought to say with deep learning relatively shallow, with Autoencoder and PCA this piece of content is relatively close.
Lin introduced deep learning in recent years has been a great concern: deep nnet concept is very early, just limited by the hardware computing power and parameter learning methods.
There are two reasons why deep learning has progressed in recent years:
1) pre-training technology has gained development
2) Regularization technology has been developed
Next, Lin begins to introduce Autoencoder's motivation.
Each hidden layer can be seen as a conversion to the original input information.
What is a good conversion? It is because of this conversion that more information is lost: after encoding, even the decoding process can be restored .
Therefore, when considering the parameter learning of deep nnet, it seems to be a good choice to adopt a similar autoencoding approach in the pre-training phase.
Below, is an example of Autoencoder. Simply put, the output is very close to the output after the single-layer neural network structure as follows.
What effect does this autoencoder have on machine learning?
1) for supervised learning: This information-preserving NN's hidden layer structure + weight is a reasonable conversion of the original input, equivalent to learning the expression of data in the structure
2) for unsupervised learning: can be used as density estimation or outlier detection. The place is not well understood, and there may be a lack of examples.
Autoencoder can be considered as a single-layer nn, can be solved by backprop; there is a need to add more regularization conditions, wij (1) =wji (2)
Using the basic autoencoder above, it can be used as the pre-training way of deep nnet.
Next, Lin began to focus on the regularization of deep nnet.
Several regularization methods mentioned previously can be used (structural constraints, weight decay/elimination regularizers, early stopping), A new regularization technique is described below.
This method is: adding noise to data
To put it simply, Gaussian noise is added to the autoencoder at the time of training, and the output is not added to the noise data, so the Autoencoder will have the ability to resist noise.
Next, we start introducing PCA-related content.
Previously stated Autoencoder can be categorized into Nonliner Autoencoder (because the hidden layer output needs to be tanh, so it is nonlinear).
So what if it's linear autoencoder? (The bias unit of the hidden layer is removed here)
The last expression of linear autoencoder is: H (x) =WW ' x
From this, you can write the error function
This is a 4-order polynomial about w, and analytic solution is not very well-rounded.
So Lin gave a solution to the following ideas:
The core of the above is: WW ' is a real symmetry matrix .
The properties of a real symmetric array are as follows: (http://wenku.baidu.com/view/1470f0e8856a561252d36f5d.html)
Let's take a look at W, the Matrix: W is the matrix of the Dxd ' dimension; WW ' is the matrix of the Dxd dimension.
Here is a review of the properties of the rank of the matrix:
Thus, WW's rank is the largest of d ' (d represents the original dimension of the data, d ' represents the number of hidden neurons, General d '
WW's rank largest is d ' to conclude thatWW ' has at most d ' non-0 eigenvalues → A maximum of d ' non-0 elements on the diagonal array gamma diagonal .
Here we need to review a concept of linear algebra:
If the matrix can be diagonal, then the number of non-0 eigenvalues is equal to the rank of the matrix, and if the matrix cannot be diagonal, then this conclusion is not necessarily true.
Here we say WW ' is a real symmetric array, and because the real symmetry matrix can be diagonal, so ww ' non-0 eigenvalues are special equal to the rank of the matrix.
Through the above, WW ' X can also be seen as Vgammav ' x:
1) V ' x can be considered as the original input rotate
2) Gamma can be seen as setting the portion of the component of the 0 eigenvalues to 0, and the remainder of the scale
3) Turn Back again
So, the optimization objective function is out.
This can be used without the front V (This is a property of the orthogonal transformation, the orthogonal transformation does not change the inner product of two vectors, see https://zh.wikipedia.org/wiki/orthogonal for details)
As a result, the problem is simplified: make I-gamma a lot of 0, take advantage of the gamma diagonal element's degrees of freedom, and plug 1 in gamma, up to a maximum of d ' 1. Leave the rest to V to fix it.
1) The minimization is converted to the equivalent maximization problem first
2) Consider the case with only one non-0 eigenvalue, σy ' xx ' v s.t. V ' v=1
3) in the above optimization problem, the best V to satisfy the error function and constraints in the optimal solution, their differential to parallel.
4) Carefully observe the form σxx ' V = Lambdav here V is not the characteristic vector of xx '
Therefore, the optimal V is the characteristic vector of the largest eigenvalue of xx '. How many dimensions you need to drop, take the number of eigenvectors.
Lin finally mentioned a PCA, in fact, before doing the above steps to the various dimensions of the vector is the value of:
Here's a look at PCA.
Http://blog.codinglabs.org/articles/pca-tutorial.html
The above log is very good and basically completely explains the ins and outs of PCA.
1) The purpose of PCA is to reduce the dimensionality of the data, but also to keep the data of the original information (open ... Large variance ... )
2) If the original data of the various dimensions of the operation, the variance & covariance, only a matrix is represented.
The above-mentioned paragraph is clear, the core of PCA is: the original input data are cleverly all the dimensions of the value, the variance and covariance are put into a matrix.
The goal of optimization is: The variance is large, the covariance is small, so the optimization goal is equivalent to diagonalization of the covariance matrix .
The diagonalization of real symmetric matrices is the basic knowledge of linear algebra: http://wenku.baidu.com/view/1470f0e8856a561252d36f5d.html
The OK,PCA was largely taken care of.
And saw Stanford's Http://ufldl.stanford.edu/wiki/index.php/PCA in the middle.
A thought came out of the mind: if the covariance matrix is full-rank, and the data is not dimensionality reduction, how many dimensions, or how many dimensions, then before and after the transformation of the difference?
In terms of the equation, this change is equivalent to turning the transformed covariance matrix into a diagonal array. In terms of geometry, compare the following two graphs:
Before transformation:
After transformation:
The intuitive view is that the whole to "put flat".
Before change: The larger the X1, the greater the X2, and vice versa.
After the transformation: due to the leveling, the size of the X1 and x2 the size of the matter.
Therefore, this flattening eliminates the X1 and X2 correlations, i.e., the non-diagonal elements of the covariance matrix give 0 effect.
"Deep learning" heights field machine learning techniques