How to use PCA in scikit-learn, scikit-learnpca

Last Update:2014-12-28 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

How to use PCA in scikit-learn, scikit-learnpca
How to use PCA in scikit-learn

@ Author: wepon

@ Blog: http://blog.csdn.net/u012162613/article/details/42192293

In the previous article Principal Component Analysis (PCA), I implemented the PCA algorithm based on python and numpy, mainly to deepen my understanding of the algorithm, and its implementation was rough, in practical applications, mature packages are generally called. This article ends the methods used by PCA in scikit-learn and the details that need attention. For more information, see sklearn. decomposition. PCA

1. function prototype and parameter description

sklearn.decomposition.PCA(n_components=None, copy=True, whiten=False)

Parameter description:
N_components:

Meaning: The number of principal components to be retained in the PCA algorithm is n, that is, the number of features to be retained is n type: int or string. The default value is None, and all components are retained. If the value is int, for example, n_components = 1, the original data is reduced to one dimension. If the value is string, for example, n_components = 'mle', the number of features n is automatically selected to make the variance percentage meet the requirements.

Copy:

Type: bool, True, or False. The default value is True.
Meaning: Indicates whether to copy the original training data when running the algorithm. If the value is True, the value of the original training data cannot be changed after the PCA algorithm is run because the operation is performed on the copy of the original data. If the value is False, after the PCA algorithm is run, the value of the original training data is changed because the dimensionality reduction calculation is performed on the original data.

Whiten:

Type: bool; default value: False

Meaning: whitening, so that each feature has the same variance. For more information about "white", see Ufldl tutorial.

2. Attributes of PCA objects
Components _: returns the component with the maximum variance.
Explained_variance_ratio _: returns the variance percentage of each of the n remaining ingredients.
N_components _: returns the number of reserved ingredients n.
Mean _:
Noise_variance _:

3. Methods of PCA objects

Fit (X, y = None)

Fit () is a common method in scikit-learn. Each algorithm to be trained has a fit () method, which is actually a step of "training" in the algorithm. Because PCA is an unsupervised learning algorithm, Here y is equal to None.
Fit (X) indicates that data X is used to train the PCA model.
Function return value: the object that calls the fit method. For example, pca. fit (X) indicates that X is used to train the pca object.

Fit_transform (X)

Use X to train the PCA model and return the data after dimensionality reduction. NewX = pca. fit_transform (X), newX is the data after dimensionality reduction.

Inverse_transform ()

Convert the data after dimensionality reduction to the original data, X = pca. inverse_transform (newX)

Transform (X)

Convert data X to the data after dimensionality reduction. After the model is trained, you can use the transform method to reduce the dimension of new input data.
In addition, there are get_covariance (), get_precision (), get_params (deep = True), score (X, y = None) and other methods. You can add them later.

4. example
Take a set of two-dimensional data as an example. The data is as follows. A total of 12 samples (x, y) are actually points distributed on the straight line y = x, and it is clustered on x = 1, 2, 3, and 4.

>>> dataarray([[ 1.  ,  1.  ],       [ 0.9 ,  0.95],       [ 1.01,  1.03],       [ 2.  ,  2.  ],       [ 2.03,  2.06],       [ 1.98,  1.89],       [ 3.  ,  3.  ],       [ 3.03,  3.05],       [ 2.89,  3.1 ],       [ 4.  ,  4.  ],       [ 4.06,  4.02],       [ 3.97,  4.01]])

This set of data has two features. Because the two features are approximately the same, they can be expressed with one feature, which can be reduced to one dimension. Next let's take a look at how to use the PCA algorithm package in sklearn.
(1) n_components is set to 1, and copy is set to True by default. We can see that the raw data is not changed, newData is one-dimensional, and the raw data is obviously divided into four types.

>>> from sklearn.decomposition import PCA >>> pca=PCA(n_components=1)>>> newData=pca.fit_transform(data)>>> newDataarray([[-2.12015916],       [-2.22617682],       [-2.09185561],       [-0.70594692],       [-0.64227841],       [-0.79795758],       [ 0.70826533],       [ 0.76485312],       [ 0.70139695],       [ 2.12247757],       [ 2.17900746],       [ 2.10837406]])>>> dataarray([[ 1.  ,  1.  ],       [ 0.9 ,  0.95],       [ 1.01,  1.03],       [ 2.  ,  2.  ],       [ 2.03,  2.06],       [ 1.98,  1.89],       [ 3.  ,  3.  ],       [ 3.03,  3.05],       [ 2.89,  3.1 ],       [ 4.  ,  4.  ],       [ 4.06,  4.02],       [ 3.97,  4.01]])

(2) Set copy to False, and the raw data will change.

>>> pca=PCA(n_components=1,copy=False)>>> newData=pca.fit_transform(data)>>> dataarray([[-1.48916667, -1.50916667],       [-1.58916667, -1.55916667],       [-1.47916667, -1.47916667],       [-0.48916667, -0.50916667],       [-0.45916667, -0.44916667],       [-0.50916667, -0.61916667],       [ 0.51083333,  0.49083333],       [ 0.54083333,  0.54083333],       [ 0.40083333,  0.59083333],       [ 1.51083333,  1.49083333],       [ 1.57083333,  1.51083333],       [ 1.48083333,  1.50083333]])

(3) n_components is set to 'mle', and the effect is automatically reduced to 1 dimension.

>>> pca=PCA(n_components='mle')>>> newData=pca.fit_transform(data)>>> newDataarray([[-2.12015916],       [-2.22617682],       [-2.09185561],       [-0.70594692],       [-0.64227841],       [-0.79795758],       [ 0.70826533],       [ 0.76485312],       [ 0.70139695],       [ 2.12247757],       [ 2.17900746],       [ 2.10837406]])

(4) object attribute values

>>> pca.n_components1>>> pca.explained_variance_ratio_array([ 0.99910873])>>> pca.explained_variance_array([ 2.55427003])>>> pca.get_params<bound method PCA.get_params of PCA(copy=True, n_components=1, whiten=False)>

The n_components value of the pca object we train is 1, that is, one feature is retained. The variance of this feature is 2.55427003, accounting for 0.99910873 of the variance of all features, this means that almost all information is retained. Get_params returns the values of each parameter.
(5) object Method

>>> newA=pca.transform(A)

For new data A, use the trained pca model for dimensionality reduction.

>>> pca.set_params(copy=False)PCA(copy=False, n_components=1, whiten=False)

Set parameters.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

How to use PCA in scikit-learn, scikit-learnpca

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support