How to use PCA in scikit-learn, scikit-learnpca

Source: Internet
Author: User

How to use PCA in scikit-learn, scikit-learnpca
How to use PCA in scikit-learn


@ Author: wepon

@ Blog: http://blog.csdn.net/u012162613/article/details/42192293


In the previous article Principal Component Analysis (PCA), I implemented the PCA algorithm based on python and numpy, mainly to deepen my understanding of the algorithm, and its implementation was rough, in practical applications, mature packages are generally called. This article ends the methods used by PCA in scikit-learn and the details that need attention. For more information, see sklearn. decomposition. PCA


1. function prototype and parameter description
sklearn.decomposition.PCA(n_components=None, copy=True, whiten=False)


Parameter description:
N_components:
Meaning: The number of principal components to be retained in the PCA algorithm is n, that is, the number of features to be retained is n type: int or string. The default value is None, and all components are retained. If the value is int, for example, n_components = 1, the original data is reduced to one dimension. If the value is string, for example, n_components = 'mle', the number of features n is automatically selected to make the variance percentage meet the requirements.

Copy:

Type: bool, True, or False. The default value is True.

Meaning: Indicates whether to copy the original training data when running the algorithm. If the value is True, the value of the original training data cannot be changed after the PCA algorithm is run because the operation is performed on the copy of the original data. If the value is False, after the PCA algorithm is run, the value of the original training data is changed because the dimensionality reduction calculation is performed on the original data.

Whiten:

Type: bool; default value: False

Meaning: whitening, so that each feature has the same variance. For more information about "white", see Ufldl tutorial.



2. Attributes of PCA objects
Components _: returns the component with the maximum variance.
Explained_variance_ratio _: returns the variance percentage of each of the n remaining ingredients.
N_components _: returns the number of reserved ingredients n.
Mean _:
Noise_variance _:


3. Methods of PCA objects
  • Fit (X, y = None)
Fit () is a common method in scikit-learn. Each algorithm to be trained has a fit () method, which is actually a step of "training" in the algorithm. Because PCA is an unsupervised learning algorithm, Here y is equal to None.
Fit (X) indicates that data X is used to train the PCA model.
Function return value: the object that calls the fit method. For example, pca. fit (X) indicates that X is used to train the pca object.
  • Fit_transform (X)
Use X to train the PCA model and return the data after dimensionality reduction. NewX = pca. fit_transform (X), newX is the data after dimensionality reduction.
  • Inverse_transform ()
Convert the data after dimensionality reduction to the original data, X = pca. inverse_transform (newX)
  • Transform (X)
Convert data X to the data after dimensionality reduction. After the model is trained, you can use the transform method to reduce the dimension of new input data.
In addition, there are get_covariance (), get_precision (), get_params (deep = True), score (X, y = None) and other methods. You can add them later.

4. example
Take a set of two-dimensional data as an example. The data is as follows. A total of 12 samples (x, y) are actually points distributed on the straight line y = x, and it is clustered on x = 1, 2, 3, and 4.
>>> dataarray([[ 1.  ,  1.  ],       [ 0.9 ,  0.95],       [ 1.01,  1.03],       [ 2.  ,  2.  ],       [ 2.03,  2.06],       [ 1.98,  1.89],       [ 3.  ,  3.  ],       [ 3.03,  3.05],       [ 2.89,  3.1 ],       [ 4.  ,  4.  ],       [ 4.06,  4.02],       [ 3.97,  4.01]])

This set of data has two features. Because the two features are approximately the same, they can be expressed with one feature, which can be reduced to one dimension. Next let's take a look at how to use the PCA algorithm package in sklearn.
(1) n_components is set to 1, and copy is set to True by default. We can see that the raw data is not changed, newData is one-dimensional, and the raw data is obviously divided into four types.
>>> from sklearn.decomposition import PCA >>> pca=PCA(n_components=1)>>> newData=pca.fit_transform(data)>>> newDataarray([[-2.12015916],       [-2.22617682],       [-2.09185561],       [-0.70594692],       [-0.64227841],       [-0.79795758],       [ 0.70826533],       [ 0.76485312],       [ 0.70139695],       [ 2.12247757],       [ 2.17900746],       [ 2.10837406]])>>> dataarray([[ 1.  ,  1.  ],       [ 0.9 ,  0.95],       [ 1.01,  1.03],       [ 2.  ,  2.  ],       [ 2.03,  2.06],       [ 1.98,  1.89],       [ 3.  ,  3.  ],       [ 3.03,  3.05],       [ 2.89,  3.1 ],       [ 4.  ,  4.  ],       [ 4.06,  4.02],       [ 3.97,  4.01]])

(2) Set copy to False, and the raw data will change.
>>> pca=PCA(n_components=1,copy=False)>>> newData=pca.fit_transform(data)>>> dataarray([[-1.48916667, -1.50916667],       [-1.58916667, -1.55916667],       [-1.47916667, -1.47916667],       [-0.48916667, -0.50916667],       [-0.45916667, -0.44916667],       [-0.50916667, -0.61916667],       [ 0.51083333,  0.49083333],       [ 0.54083333,  0.54083333],       [ 0.40083333,  0.59083333],       [ 1.51083333,  1.49083333],       [ 1.57083333,  1.51083333],       [ 1.48083333,  1.50083333]])


(3) n_components is set to 'mle', and the effect is automatically reduced to 1 dimension.
>>> pca=PCA(n_components='mle')>>> newData=pca.fit_transform(data)>>> newDataarray([[-2.12015916],       [-2.22617682],       [-2.09185561],       [-0.70594692],       [-0.64227841],       [-0.79795758],       [ 0.70826533],       [ 0.76485312],       [ 0.70139695],       [ 2.12247757],       [ 2.17900746],       [ 2.10837406]])

(4) object attribute values
>>> pca.n_components1>>> pca.explained_variance_ratio_array([ 0.99910873])>>> pca.explained_variance_array([ 2.55427003])>>> pca.get_params<bound method PCA.get_params of PCA(copy=True, n_components=1, whiten=False)>

The n_components value of the pca object we train is 1, that is, one feature is retained. The variance of this feature is 2.55427003, accounting for 0.99910873 of the variance of all features, this means that almost all information is retained. Get_params returns the values of each parameter.
(5) object Method
>>> newA=pca.transform(A)
For new data A, use the trained pca model for dimensionality reduction.
>>> pca.set_params(copy=False)PCA(copy=False, n_components=1, whiten=False)
Set parameters.



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.