"Reprint please indicate the source" Chenrudan.github.io
Recently looked at some things about the dimensionality reduction algorithm, this article first gives out seven kinds of algorithms of an information table, this paper sums up the parameters of each algorithm, the main purpose of the algorithm and so on, and then introduces some basic concepts of dimensionality reduction, including what is the dimension of dimensionality reduction, why dimensionality reduction and dimensionality reduction can solve the dimensionality disaster, Then the analysis can be from what angle dimensionality reduction, and then sorted out the specific flow of these algorithms. The main catalogue is as follows: 1. Basic concept of dimensionality reduction 2. From what angle descending dimension 3. dimensionality Reduction Algorithm 3.1 principal component analysis PCA 3.2 Multi dimensional Scaling (MDS) 3.3 Linear discriminant analysis (LDA) 3.4 metric Mapping (ISOMAP) 3.5 Local linear embedding (LLE) 3.6 T-sne 3.7 Deep Autoencoder Networks
4. Summary
The old rules, the first of each algorithm information table, X indicates that the dimensions of the high dimensional input matrix are the high dimension D times the number of samples N, C=xxt, Z represents the dimension reduction output matrix size low dimension d times N, E=zzt, the linear mapping is Z=WTX, the distance matrix between 22 in the high-dimensional space is a, and the SW,SB is LDA respectively. In-class divergence matrices and inter-class divergence matrices, K indicates that a point in manifold learning is adjacent to K points, and F represents a linear combination matrix of a point in a high dimensional space by a few dots around it, m= (i−f) (i−f) T. − Indicates uncertainty. P is the probability matrix of the two-point distance in the high dimensional space, which occupies the proportion of all distances, and the probability matrix of the two-point distance in the Q-low dimension space. L represents the number of layers of a fully connected neural network, and the Dl represents the number of nodes per layer.
Fig. 1 Comparison of different dimensionality reduction algorithms
Here Autoencoder whether to go to the center of the individual feel still a little doubt, in processing the image data, the input image will be a change to 0 of the average preprocessing, but this operation is for a sample of the reduction of the mean value [1], where the central to refer to a one-dimensional data reduction mean, is not a concept. The following is a concrete discussion of dimensionality-related content. 1. Basic concepts of dimensionality reduction
dimensionality reduction means being able to use a set of vector zi with a number d to represent the useful information contained in the Vector XI of the number D, which d<d. Assuming that a picture of 512*512 size, using SVM to do the classification, the most direct way is to expand the graph by rows or columns into the length of the 512*512 input vector xi, with the parameters of the SVM multiplication. If the 512*512 vector can be reduced to 100 in the case of preserving useful information, then the space for storing input and parameters will be reduced, and the time to compute vector multiplication will be much reduced. So dimensionality reduction can effectively reduce the computational time. and high dimensional data is likely to be sparse distribution, that is, 100 samples in the 100-dimensional distribution is certainly very sparse, each additional dimension of the number of samples required to increase exponentially, this in the high dimensional space sparse samples of the problem is called the dimensionality of the disaster. dimensionality reduction can alleviate this problem.
And why the dimensionality can be reduced because the data is redundant, either some information that is not used, or some information that is repeated, for example, a 512*512 figure has a value of not 0 in the center 100*100 area, and the remaining area is information that is not used, or a graph is center symmetric, Then the symmetrical part of the information is repeated. After the correct dimensionality reduction, the data generally retains most of the important information of the original data, it can completely replace the input to do some other work, which can greatly reduce the amount of computation. For example, drop to two or three dimensions to visualize. 2. From what point of view to reduce dimension
In general, it is possible to consider the data dimensionality reduction from two angles, one is to extract the feature subset for feature extraction, for example, from the 512*512 graph to take only the central part, one is to transform the original high-dimensional space into a new space by linear/non-linear way, which is mainly discussed in the latter one. The latter kind of angle generally has two kinds of ideas to realize [2], one is based on from the high dimensional space mapping to the low dimensional space projection method, which represents the algorithm is the PCA, but other Lda, Autoencoder also is this, the main goal is to study or calculate a matrix transformation W, This matrix is used to multiply the high-dimensional data to obtain the low dimensional data. The other is based on the manifold learning method, the purpose of manifold learning is to find the low dimensional description of high-dimensional space samples, which assumes that the data in the High-dimensional space will exhibit a regular low dimensional manifold arrangement, but this rule arrangement cannot be measured directly by the Euclidean distance in the high-dimensional space, as shown in the following left figure, The actual distance of a two point should be the distance from the bottom right figure. If there is a way to describe the manifolds of high-dimensional spaces, so in order to solve this problem, it is possible to preserve the spatial relationship in the process of dimensionality reduction, which assumes that the local region of the high-dimensional space still has the properties of European space, that is, their distance can be calculated by Euclidean distance (Isomap), Or a point coordinate can be calculated by a linear combination of adjacent nodes (LLE), thus, a relationship of high dimensional space can be obtained, which can be preserved in the low dimensional space, so as to reduce the dimension based on this relation representation, so manifold learning can be used to compress data, visualize, obtain effective distance matrix, etc.
Fig. 2 Manifold Learning
3. Several dimensionality reduction method Flow 3.1 principal component Analysis PCA
PCA, invented by Karl Pearson in 1901, is a linear dimensionality reduction method in which a point xi= (x1,x2,..., XD) of a high dimensional space (dimension D) is mapped to a point in a low dimensional space (dimension d,d<d) by multiplying with the matrix W Zi=wtxi , where the size of W is d∗d, and I corresponds to the first sample point. Thus, we can get n points from D-dimensional space mapped to D-dimensional space, and the goal of PCA is to make the points Zi as far apart as possible, that is, to make the variance of n Zi as large as possible. If the data in D-dimensional space has a mean value of 0 per dimension, that is to ∑ixi=0, then the two sides multiply WT to get the dimension of the dimensionality of the data is also 0, consider a matrix c=1nx∗xt, this matrix is the Group D-dimensional data covariance matrix, you can see that the value on the diagonal is a D-dimensional variance within a dimension, Non-diagonal elements are covariance between two dimensions in D-dimensional.
<br>1NX∗XT=⎛⎝⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜<br>1N∑i=1Nx21iamp;...amp;1N∑i=1NxT1ixDi&amp;amp;amp;amp; Amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp; Amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp; Amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp; amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;br>...amp;amp;...&amp;amp;amp;amp;amp;amp;amp;amp; Amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp; Amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp; Amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp; Amp;amp;amp;amp;amp;amp;amp;amp;lt;br>1n∑i=1nxtdix1iamp;...amp;1n∑i=1nx2di<br>⎞⎠⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟<br >
So the covariance matrix b=1nz∗zt of D-dimensional data after dimensionality reduction if you want the points in the dimension to be separated as much as possible, then you want the value on the B diagonal to be as large as possible for each dimension, and the variance is a big indication that the data on these dimensions are very well differentiated, and that each dimension of D is orthogonal. They are orthogonal so that the two dimensions are irrelevant, they do not contain overlapping information, so that the best performance of the data, each dimension is sufficient to differentiate, but also has different information. In this case, the B-non-diagonal value is all 0. And because it can be deduced that b=1nz∗zt=wt∗ (1NX∗XT) w=wt∗c∗w, this expression actually means that the function of the linear transformation matrix W in the PCA algorithm is to diagonalization the original covariance matrix C. Because diagonalization in linear algebra is obtained by solving eigenvalue and corresponding eigenvector, the process of PCA algorithm can be introduced (the process is mainly excerpted from Zhou Zhihua's "machine learning" book, in which the goal and hypothesis are added to compare the following algorithms.) Zhou's book is based on the Lagrange multiplier method deduced, in essence, [3] are the same, this is recommended in this article on the principle of PCA Mathematics Blog [3]). Input: n d-dimensional vector x1,..., XN, dimensionality reduction to D-dimensional output: Projection matrix w= (W1,..., wd), in which each WI is a D Ville vector target: The projection dimension is divided as much as possible, MAXWTR (WTXXTW) (the trace here is because the non-diagonal elements of B mentioned above are 0, The elements on the diagonal are exactly the variance of each dimension, assuming that: after dimensionality reduction, the variance of each dimension is as large as possible, and each dimension is orthogonal to 1. Converts each dimension of input to 0, to center 2. Calculates the covariance matrix of the input C=X∗XT 3. The covariance matrix C does the eigenvalue decomposition of 4. The eigenvector corresponding to the largest first D eigenvalue is W1,..., WD
In addition, PCA has many variants kernel PCA, probabilistic PCA and so on, this article only considers the simplest PCA version for the time being. 3.2 Multi-dimensional scaling (MDS)
The objective of MDS is to keep the dissimilarity (difference) of data in the process of dimensionality reduction, and to understand that dimensionality reduction allows the distance relationship in the high-dimensional space to remain the same as that in the low dimensional space. The distance here is represented by a matrix, and the 22 distance of n samples is represented by each of the AIJ of the matrix A, and the distance in the low dimensional space is assumed to be Euclidean distance. And the dimensionality of the data expressed as Zi, then there are AIJ=∣∣ZI−ZJ∣∣2=∣∣ZI∣∣2+∣∣ZJ∣∣2−2ZIZJT, the right three unified by the internal product matrix E to represent the EIJ=ZIZJT. After being centered, the sum of each row of E is 0, which can be deduced
<br>eij=−12 (a2ij−ai⋅−a⋅j−a2⋅⋅) =−12 (eii+ejj−2eij−1n (tr (e) +nejj) −1n (tr (e) +neii) +1n2 (2NEJJ)) &amp; Amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp; Amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp; Amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp; Amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;br>=eij=−12 (PAP) ij<br>
where the P=I−1N1 unit matrix I subtracts 1N of all 1 matrices, i⋅ and ⋅j refer to the sum of a column or a column, thus establishing the relationship between the distance matrix A and the inner product matrix E. Therefore, in the case of knowing a, we can solve E, and then do eigenvalue decomposition on E, so that e=vλvt, where Λ is the diagonal matrix, each item is the eigenvalues of e λ1≥ ... ≥λd, then the data under all the eigenvalues can be expressed as Z=Λ12VT, when the D-Max eigenvalue is selected, the distance matrix of the D-dimensional space is approximate to the range matrix of the high-dimensional space D, from the MDS process as follows [2]: Input: Distance matrix An∗n=aij, superscript representing matrix size, original data is D-dimensional, dimensionality reduction to D-dimensional output: reduced dimension after matrix Zd∗n=λ˜12v˜ target: reducing dimension while guaranteeing the relative relationship between data invariant hypothesis: The distance matrix of the known N samples is 1. Calculates the A<em>i⋅, A</em>⋅j, a⋅⋅2. Compute the inner product matrix E 3. For e to do eigenvalue decomposition 4. Take d maximum eigenvalue to constitute V˜, and corresponding eigenvector in order to constitute λ˜3.3 linear discriminant analysis (LDA)
LDA was the first to solve the two classification problem, which was proposed by Fisher in 1936, because the computational process has actually reduced the dimension of the data, so it can also be used as a supervised linear dimensionality reduction. By projecting the high-dimensional space data into the low dimension space, the class of each sample is determined in the low dimension space, and the case of K class is considered here. Its goal is to divide the samples as correctly as possible into the K class, as the same sample projection point as near as possible, not similar to the sample point as far as possible, this is not the same as PCA, PCA is the hope that all the samples in one dimension as possible apart, LDA's low dimensional projection may overlap, but the PCA does not want the projection point overlap. It uses the same dimensionality reduction method as the PCA, all through the matrix multiplication to carry on the linear dimensionality reduction, the projection point is the zi=wt∗xi. The original high dimensional point corresponding to the projection center is μ1,μ2, if the projection is based on the direction of the following figure. Since the hope belongs to different classes of samples as far away as possible, then the projection center after the projection of the point away as far as possible, that is, the target is maxw| | wtμ1−wtμ2| | 2, but there is not enough center away, for example, in the following figure perpendicular to the X1 axis projection, two centers away enough, but there are samples in the projection space overlap, so also need additional optimization target, that similar sample projection point as close as possible. Then the class covariance of similar samples should be as small as possible, and the covariance matrix of similar samples is as follows.
Fig. 3 The projection of LDA (figure Source [4])
<br>minw<br>=∑x∈x1 (wtx−wtμ1) (wtx−wtμ1) t=wt (∑x∈x1 (x−μ1) (x−μ1)) w=wt∑x1w<br>
where μ1= (U1,..., UN), w= (W1,..., wd), X1 that the sample belongs to the 1th class of sets, the middle of the Matrix ∑x1 is part of the X1 of the sample covariance matrix, the K-Class of the original data covariance matrix added up to be called the class of divergence matrix, sw=∑kk=1sk=∑kk= 1∑x∈xk (x−μk) (x−μk) T. The center distance of the above two classes is directly subtracted from the center, and the K-Class projection center distance needs to first calculate the center μ=1n∑kk=1nkμk of all the samples (Nk represents the number of samples of Class K), measured by the inter-class divergence matrix, i.e. Sb=∑kk=1nk (μk−μ) (μk−μ) T. To integrate, the optimization goal of the LDA algorithm is to maximize the following cost function:
<BR>MAXWJ (W) =tr (WT (∑kk=1nk (μk−μ) (μk−μ) t) W) TR (WT∑KK=1∑X∈XK (x−μk) (x−μk) t) <br>=tr (WTSBW) TR (WTSWW) <br>
In the case of two classifications, the size of the W is d∗1, that is, the J (W) itself is a scalar, in the case of K, the size of the W is d∗d−1, and the goal of optimization is to find its trace on the upper and lower matrices. Here you can find that there is no central to the data in LDA, if you want to center the center of each class will overlap, so this algorithm is not centralized. Then J (w) is derivative of W, which shows that the solution of W is the matrix of the eigenvector which corresponds to the d-1 maximum eigenvalue of the S−1WSB. Then you can use the W to reduce the dimension of X.
<br>∂j (W) ∂w= (WTSWW) 2sbw− (WTSBW) 2sww=0&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp; Amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp; Amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp; Amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp; Amp;amp;lt;br>⇒sbw−j (W) sww=0⇒s−1wsbw=j (W) w<br>
Input: n d-dimensional vector x1,..., XN, data can be divided into D-class output: Projection matrix w= (w<em>1,..., w</em>d−1), where each of the WI is a D-Ville vector target: projection dimensionality reduction after the same kind of sample covariance as small as possible, The center distance between the different classes is as far as possible assuming that the optimization goal is to maximize TR (WTSBW) TR (WTSWW) 1. Derive the class of divergence matrix Sw and the class divergence matrix Sb 2. To SW=U∑VT to do singular value decomposition, to obtain S−1w=v∑−1ut 3. Do feature decomposition on matrix S−1WSB 4. w<em>1,..., the eigenvector corresponding to the maximal first d-1 eigenvalues w</em>d−1
Personally think the optimization goal here actually embodies a hypothesis, that is, the assumption that the optimal target of the upper and lower expressions are diagonal matrices, w transformation makes both Sd and Sw into diagonal matrices. 3.4 Metric Mapping (ISOMAP)
The MDS mentioned above is only for data dimensionality reduction, it needs to know the distance relationship in high-dimensional space, it does not reflect the potential manifold of high-dimensional data itself, but can be combined with the basic idea of manifold learning and MDS to reduce dimension [5]. That is, the distance of the local space in the high dimensional space can be calculated by Euclidean distance, for the MDS distance matrix A, the distance between some two adjacent points is Aij=∣∣xi−xj∣∣, which is their Euclidean distance, the nearest point is determined by the shortest path algorithm, and the distance from the two points between aij=∞, To determine the matrix A, so here is related to determine what kind of point adjacent to the Isomap is to determine the adjacent points by KNN, the overall algorithm flow is as follows: input: N D-dimensional vector x1,..., XN, a point with K-nearest neighbor, map to D-dimensional output: dimensionality reduction after the matrix Zd∗n=λ˜12v˜ target: Dimensionality reduction ensures that the manifold invariant assumption of high dimensional data is: a distance of two points in the local area of a high dimensional space can be calculated from the Euclidean distance of 1. The first part of a is constructed by KNN, that is, to find the adjacent points and fill them with the Euclidean distance, and all the other positions are initialized to Infinity 2 Aij. Based on the shortest Path algorithm ( Dijkstra algorithm) Find the path between the nearest points and fill in the distance 3. The output 3.5 local linear embedding (LLE) is obtained by using the distance matrix A as input of the MDS.
As mentioned before, the local area of manifold learning has the properties of European space, in Lle, it is assumed that a point Xi coordinate can be derived from the coordinate linear combination of some points around it, i.e. the XI=∑J∈XIFIJXJ (where Xi represents the set of points on the neighborhood of Xi), which is also a representation of the high dimensional space. Since this relationship is also preserved in the low dimensional space, the weights in the two formulas are zi=∑j∈zifijzj.
Based on the above hypothesis, the first way to solve this weight, assuming that each sample point from the surrounding K samples, then a sample of the linear combination weight should be 1∗k, by minimizing the reconstruct error reconstruction errors, and then the objective function to the F derivative obtained solution.
<br>minf1,..., fk∑k=1k∣∣∣∣xi−∑j∈xifijxj∣∣∣∣s.t.∑j∈xifij=1&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp; Amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp; Amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp; Amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp; Amp;amp;amp;amp;amp;amp;lt;br>
After the weight is calculated, the optimized target in the low dimensional space is MINZ1,..., zk∑kk=1∣∣zi−∑j∈zifijzj∣∣=tr (z−z∗f) (z−z∗f) T) =tr (ZMZ) s.t.z∗zt=i to solve Z, where F is arranged in accordance with N∗k, And the restrictions on Z are added. The Lagrange multiplier method can be used to obtain the form of mz=λy, and then Z is obtained by decomposing the eigenvalues of M. Input: n d-dimensional vector x1,..., XN, one point has K neighboring points, map to D-dimensional output: dimension-Reduced matrix Z target: Reducing dimension while guaranteeing the invariant assumption of high dimensional data: a point on the local area of a high dimensional space is a linear combination of the adjacent K points, and the Wei Zheng of the lower dimensional space is 1. A part of the first structure of a KNN , that is, the K-adjacent point is obtained, and then the Matrix F and M 2 are calculated. The eigenvalue decomposition of M is 3. The eigenvector that corresponds to the first D non-0 minimum eigenvalue is composed of Z (here because of minimizing the target, so take a small eigenvalue) 3.6 T-sne
T-sne is also a method of dimensionality reduction to two-dimensional or three-dimensional space, which was proposed by Maaten in 2008 [6], and based on the Hinton proposed by 2002, the improved method of random neighbor embedding (stochastic neighbor embedding, SNE). The main idea is to assume that at any two points in the high-dimensional space, XJ's value obeys the Gaussian distribution with Xi as the center variance, and the same XI obeys the Gaussian distribution with XJ as the center variance as Σj, so the conditional probability of XJ and Xi is Pj|i=exp (−∣∣|xi−xj∣∣|2 /2σ2i) ∑k≠iexp (−∣∣|xi−xk∣∣|2/2σ2i), that is, the probability of XJ under the XI Gaussian distribution is the probability of the total sample under the XI Gaussian distribution, which shows the similarity between the two from the XI point of view. Then pij= (pi|j+pj|i)/2n Use this probability as the joint probability pij for the similarity of the two points in all samples 22. The formula is as follows: the paper does not explain whether Σ is a scalar or a vector, but because in the subsequent solution Pij is not directly derived from the following joint probability formula, but through the preceding conditional probability, the preceding formula computes a σi for each sample I, given a definite value Prep (Pi) =2h ( PI), where H (PI) =−∑jpj|ilogpj|i. Then the binary lookup is used to determine the σi of Xi, so that the two formulas above are equal to the Prep value, so σ here should be a vector. It is unlikely that all samples share a Gaussian distribution parameter.
<br>pij=exp (−∣∣xi−xj∣∣22σ2) ∑k≠lexp (−∣∣xk−xl∣∣22σ2) <br>
At the same time, the relationship or similarity between the two points in the low dimensional space is also expressed by the joint probability. Assuming that the Euclidean distance between two points in a low dimensional space obeys a student t distribution of degrees of freedom, the proportion of the distance probability of two points in the low dimensional space is the joint probability of the two points ' distance probabilities.
<br>qij= (1+∣∣zi−zj∣∣2) −1∑k≠l (1+∣∣zk−zl∣∣2) −1<br>
If the similarity value of the XI,XJ in the high-dimensional space is equal to that of the zi,zj in the low dimensional space, then the point of the low dimension space can react correctly to the relative position relationship in the high-dimensional space. So the goal of Tsne is to find a set of descending dimensions that can minimize the difference between PIJ and Qij. Therefore, Tsne uses the Kullbackleibler divergence the KL divergence to construct the objective function J=kl (p| | Q) =∑i∑jpijlogpijqij, KL divergence can be used to measure the difference of two probability distributions. It uses the gradient descent method to find the low dimensional expression Zi of the input data, that is, using the objective function to Zi derivation, zi as the optimal variable, and the gradient of Zi is ∂j∂zi=4∑j (pij−qij) (1+| | zi−zj| | 2) −1, and then update the iteration Zi, in the actual update process, like the update of the neural network to add the momentum item in order to accelerate optimization, the approximate algorithm flow is as follows: Input: n d-dimensional vector x1,..., XN, mapped to two-dimensional or three-dimensional, fixed-value perp, iteration number T, learning rate η, Momentum factor α (T) output: Dimension-reduced data represent Z1,..., ZN target: Descending dimension to two-dimensional or three-dimensional visualization (emphasis is visualization) hypothesis: In high dimensional space, the value of a point XJ obeys a Gaussian distribution centered on another point XI. In the low dimensional space, the Euclidean distance between the two points obeys the T distribution of 1 degrees of Freedom 1. First, the σi of Xi is determined by the binary lookup 2. The p<em>j|i is computed and the p</em>ij= (p<em>j|i+p</em> I|J)/2 3. Initialize Z1,..., ZN 4. Calculate Qij 5. Calculation Gradient ∂j∂zi 6 update zi (t) =zi (t−1) +η∂j∂zi+α (zi)) 7. Repeat t−1 to converge or to complete the number of iterations T
It is to be noted that this algorithm iterates low dimensional data as a variable, so if you need to add new data, there's no way to manipulate the new data directly, but add the new data to the original data and recalculate it again, so the main function of T-sne is visualization. 3.7 Deep Autoencoder Networks
Autoencoder is a kind of neural network, it is an unsupervised algorithm that can be used to reduce dimensions and to automatically learn a feature from the data, the principle of which is to enter a set of values, after the network can get a set of output, this set of output value as much as the input value of the same size. The network consists of a fully connected layer, each of which is connected to all nodes on the previous level. The structure of the autoencoder is shown in Fig. 4, the encoder network is the normal neural network forward propagation z=wx+b, the propagation parameter of decoder network is the transpose of the layer parameter of the symmetric structure, the value of this network is X′=WTZ+BT, Finally, when propagating to a layer equal to the number of input layers in the network, a set of value x′ is obtained, and the network expects the two values to be equal x′=x, which is the cost function of the reconstruction error obtained through the cross entropy or mean square error of the true input x value. The network learns the correct parameters by minimizing the cost and gradient descent. Therefore, through this network first through the "encoder" network to the high dimensional data projection to the low dimensional space, and then through the "decoder" network reverse low dimensional data to restore to the high-dimensional space.
Figure 4 Autoencoder Network structure diagram
However, in the actual implementation of the network process, the entire network actually layer is only half of the figure 4, that is, 4-layer network, 2000-1000-500-30 of the full connection structure. Because the weighting parameter is actually the same in encoder and decoder, the Enocoder process is the node value multiplied by the weight of the previous layer to get the node value of the layer, and decoder is the node value of this layer and the transfer multiplication of the weight matrix to get the value of the last layer. The figure [7] shows a clearer picture of the actual structure of each layer, including a forward propagation and a back propagation, so that the topmost value can be used as the dimension-reducing output of the network for other analyses, such as visualization, or as a compression feature.
Fig. 5 Autoencoder Layer Structure
In 06, Hinton wrote an article on science about how to use the Autoencoder network in depth learning to do dimensionality reduction [8], mainly by using multi-layer RBM to train weight parameters, It is used to solve the problem that the weight of the quality dependent initialization network is Autoencoder, that is, the main purpose is to propose an effective method of initializing weights. No nonlinear transformations are added to the above expression, and a nonlinear transformation is required after each layer of the real network is multiplied by the matrix multiplication. In addition, the Autoencoder model can be added to the properties of sparsity [9], that is, for N-D-dimensional input, the output value of one of the nodes of a layer and ρjˆ (l) is approximated to 0, i.e. ρjˆ (l) =1n∑ni=1[a (L) J (x (i))], where l represents the layer , I represent the first few inputs. can also be added to the weight of the regular items required. Input: n d-dimensional vector x1,..., XN, network structure that is, the number of nodes per layer, iteration number T, learning rate η output: Dimension-Reduced data indicates Z1,..., ZN target: The network can learn some properties or structures within the data so that it can refactor the input data hypothesis: Neural networks are super, that is, to learn features, Koko 1. Set number of layers and 2 nodes per layer. Initializes the weighting parameter 3. Forward propagation computes the node value of the next layer z=wx+b 4. Reverse propagation computes the upper-layer Reverse node Value X′=WTZ+BT 5. Calculates the gradient of each layer to the input and the W of this layer parameter, and transmits the error to the whole network using reverse propagation. 6. The gradient of W and WT is summed up and then W 7 is updated. Repeat 3~6 to converge or complete the iteration number T 4. Summary
This article focuses on what the algorithm process is, what each step specifically done, and some places may be not clear enough theoretical elaboration. But the interesting thing is that in addition to T-sne and Autoencoder, some of the other dimensionality reduction algorithms are based on constructing a matrix and then decomposing the eigenvalues to get the relevant Z or W. Laplacian eigenmaps Laplace feature mapping is not fully studied, but see the algorithm is also the choice of the first D minimum 0 eigenvalue, this is very interesting, is not good math skills, temporarily think why based on the effect of the eigenvalue so good. Compared with one layer of autoencoder and PCA, it is assumed that the objective function of autoencoder is to minimize the mean square error, although Autoencoder is not as strong as PCA (requiring every Wei Zheng), but Autoencoder may learn, Because the trace and the least mean variance estimation based on maximizing covariance are equivalent. Several methods are always felt to have some potential relevance, do not know whether it is possible to extract a unified model to solve the problem of dimensionality.