dimensionality reduction, what is dimensionality reduction, it is simply said that a number of feature fields with a few characteristics of the field, to facilitate the follow-up analysis of the data and 2-and 3-dimensional visualization. There are many methods to reduce dimension, such as principal component analysis, principal factor analysis, stochastic forest, decision tree, Lasso Regression and t-sne, in fact, dimensionality reduction can be understood as a choice of variable. This article does not introduce all methods, mainly introduces the principal component analysis and T-sne of these two methods.
First, the principal component Analysis (PCA) is first introduced, and the basic idea of principal component analysis is to study how to explain the most information of the original variable by means of a few linear combinations of the original variables. The basic theory of principal component analysis is to study the relationship between the relative matrix of the original variable and the internal structure of covariance matrix, and the linear combination of the original variable is used to form several synthetic indexes (principal component), which can reduce the dimension and simplify the problem while preserving the main information of the original variable. Generally speaking, the main components obtained from the principal component analysis have the following basic relationship with the original variables:
1. Each principal component is a linear combination of the original variables.
2. The number of principal components is significantly less than the number of original variables. (Strictly speaking, the main component and the number of original variables are the same, only the following based on the cumulative contribution rate of variance to select the number of main components is far less than the original variable)
3. The principal component retains most of the information of the original variable.
4. There is no correlation between the principal components.
The principal component is to decompose the total variance of P-random variables into the sum of the variance of P-unrelated random variables, the first principal component is the linear function of the original variable with the coefficients of the maximum variance/total variance, and the ratio of the first principal component is called the contribution rate. The larger the value, the stronger the ability of the first principal component to combine the original variable information. For the selection of the number of principal components, we can see that the cumulative contribution rate of the first K principal component is 85%, and when the cumulative contribution rate of the principal component is greater than 85%, K is taken. This is to make the original variable information loss is not too much, but also to reduce the variable, simplifying the problem. It is necessary to note that the principal component transform is sensitive to the scale of the orthogonal vectors. The data needs to be normalized before the transformation. It is also necessary to note that the new principal component is not produced by the actual system, so the interpretation of the data will be lost after the PCA transformation. If the ability to interpret data is important to your analysis, then PCA may not be appropriate for you.
Secondly, this paper introduces T-sne, also known as T-distributed stochastic domain embedding algorithm, which is used to explore the nonlinear dimensionality reduction algorithm for high dimensional data. It maps multidimensional data to two or more dimensions that are suitable for human observation. T-sne is a kind of popular learning, which belongs to nonlinear dimensionality reduction, and it is mainly to ensure that similar data points in high-dimensional space are as close as possible in low dimension space. is from the evolution of SNe, SNE uses Gauss distribution to measure the similarity between the spatial data points of high peacekeeping status, T-sne is mainly to solve the "crowding problem" in SNe, and define the similarity of the midpoint in low dimensional space with t distribution. But T-sne is not a general method of reducing dimension, and the time complexity is very high.
The principle of T-sne is to convert the distance into a form of probability, which indicates that the nearest point between the original data points is also relatively close to the mapping, the following gives the conditional probability of two points:
|x_i−x_j| calculates the Euclidean distance between two data points, |y_i−y_j| Indicates the distance of the mapping point
And close enough to make the data points and the mapping points close enough.
Finally, the application of the actual data, the data is also an app user behavior data, do a reduced-dimensional visualization, code and the use of principal component analysis and T-sne to reduce the dimensions of the graphic as follows:
Import pandas as PD from sklearn.preprocessing import Labelencoder,maxabsscaler to Sklearn.manifold import Tsne from Mat Plotlib import Pyplot as plt from sklearn.decomposition import PCA mbs = Maxabsscaler () E_cl = Labelencoder () col = [' Visi T_order ', ' stay_time ', ' device_brand ', ' device_type ', ' wdevice_resolution ', ' hdevice_resolution ', ' Network_ Type ', ' network_operator ', ' location_gps_long ', ' Location_gps_lat ', ' extra_data '] df = pd.read_csv (' out1.csv ', Engin E= ' C ') df = df[col] df = Df.dropna () df[' device_brand '] = e_cl.fit_transform (df[' Device_brand '].values) df[' Device_type '
] = E_cl.fit_transform (df[' Device_type '].values) df[' network_type '] = e_cl.fit_transform (df[' Network_type ') DF = df.apply (pd.to_numeric,errors= ' coerce ') df = Df.dropna () x = Mbs.fit_transform (df.values) x = x[:6000] Digits_proj =
Tsne (random_state=20150101). Fit_transform (x) pca_y = PCA (n_components=2). Fit_transform (x) plt.subplot (211)
Plt.scatter (digits_proj[:,0],digits_proj[:,1])Plt.subplot (212) Plt.scatter (pca_y[:,0],pca_y[:,1]) plt.show ()
The results are as follows:
The result of the graph shows that the difference between the two dimensionality reduction is relatively large, the aggregation of the principal component analysis shows the aggregation of the user behavior, but the T-sne method can not see the regularity of the user's behavior, and almost all the high-dimensional datasets may use T-sne, but it is widely used in image processing, NLP, Genome data and speech processing. And T-sne is time-consuming, if the amount of data is too large, but also stand-alone operation, the calculation will be particularly slow.
Reference documents:
Http://tech.idcquan.com/78484.shtml
He Xiaoqun, "Multivariate statistical analysis"
http://blog.csdn.net/u012162613/article/details/45920827
http://blog.csdn.net/lzl1663515011/article/details/46328337
https://www.analyticsvidhya.com/blog/2017/01/t-sne-implementation-r-python/