ML: Descending dimension algorithm-PCA

Source: Internet
Author: User

PCA (Principal Component analysis) is also known as the Karhunin-low transformation (Karhunen-loeve Transform), a technique used to explore high-dimensional data structures. PCA is often used for exploration and visualization of high-dimensional datasets. can also be used for data compression, data preprocessing and so on. PCA can synthesize high-dimensional variables that may have correlations as linear independent low-dimensional variables called principal components (principal component). The new low-dimensional data assembly preserves the original data variables as much as possible. PCA projects the data into a low subspace space to achieve dimensionality reduction. For example, a two-dimensional dataset is reduced to a single line, and each sample of the dataset can be represented by a single value, with no two values. Three-dimensional data sets can be reduced to two-dimensional, that is, to map variables into a plane. In general, n-dimensional datasets can be reduced to k subspace space by mapping.

Directory:

    • Dimension reduction problem
    • Stats::p Rincomp
    • Stats::p Rcomp
    • Other R Packages

Vector representation and dimensionality reduction of data

In data mining and machine learning, data is represented as vectors. Example records the daily commodity transactions in the following format:

    • (date, PageView, number of visitors, next singular, number of deals, transaction amount)

where "date" is a record flag, not a measure, and data mining is mostly concerned with metrics, so if we omit the date field, we get a set of records that each record can be represented as a five-dimensional vector, one of which looks something like this:

    • (500,240,25,13,2312.15) ^\mathsf{t}

Of course, this group of five-dimensional vectors can be analyzed and mined, but we know that many of the complexity of machine learning algorithms are closely related to the dimensionality of the data, even with the number of dimensions of the exponential association. Of course, there may not be a few five-dimensional data here, but it is not uncommon to deal with thousands or even hundreds of thousands of of dimensions in real machine learning, in which case the resource consumption of machine learning is unacceptable and the data must be dimensionality-reduced. dimensionality reduction, of course, means the loss of information, but in view of the fact that the actual data are often relevant, we can find ways to reduce the loss of information while reducing the dimension.

From the above trading example data, from experience can be known, "PageView" and "number of visitors" tend to have a strong correlation, while the "next singular" and "deal number" also has a strong correlation. We can intuitively understand that "when a day's view volume is high (or low), we should largely consider the number of visitors on this day to be higher (or lower)". This situation indicates that if you delete one of the metrics of the browse volume or the number of visitors, we should expect that there will not be too much information lost. So we can remove one to reduce the complexity of the machine learning algorithm. The above gives a simple thought description of dimensionality reduction, which can help to understand the motive and feasibility of dimensionality reduction, but it does not have the operational guidance meaning. For example, which column do we delete to minimize the loss of information? Or is it not simply to delete a few columns, but to change the original data into fewer columns with some transformations but to minimize the loss of information? How do you measure the amount of lost information? How to determine the specific dimensionality reduction operation steps according to the original data?

PCA is a dimensionality reduction method with strict mathematical basis and has been widely adopted. For PCA principle and algorithm deduction can refer to the following information:

    • http://blog.jobbole.com/109015/
    • Http://www.360doc.com/content/13/1124/02/9482_331688889.shtml

PCA algorithms are divided into 6 main steps:

    1. Constructing the variable matrix of p*n order
    2. Standardize each row of the variable matrix X for the P*n order (representing an attribute field)
    3. Find the covariance matrix C
    4. Finding the eigenvalues and corresponding eigenvectors of the covariance matrix
    5. The eigenvector is arranged into a matrix by the corresponding eigenvalue size from top to bottom, and the first k column consists of the matrix p.
    6. Y=XP is the data after dimensionality reduction to K-dimension

Stats::p Rincomp

Usage:princomp (x, Cor = FALSE, scores = TRUE, ...)

    • Cor: Logical value, TRUE: denotes principal component using correlation matrix, False: Using covariance matrix to find principal component
    • Scores: A logical value that indicates whether the principal component score is calculated

return value

    • Loadings is a matrix, each column is a feature vector, that is, the rotation factor of the original feature
    • Scores the score of the data provided on each principal component

Other functions:

    • #princomp () Principal component analysis can be analyzed from related matrices or from covariance matrix
    • #summary () extracting principal component information
    • #loadings () Displays the contents of the load in principal component analysis or factor analysis
    • #predict () Predict the value of the principal component, predict (Object,newdata, ...) : #object是由prcomp () The resulting object, NewData is a data frame made up of predicted values, and when NewData defaults, predicts the main component value of the existing data
    • #screeplot () plot the main component of the gravel
    • #biplot () draw data about the main component of the scatter plot and the original coordinates under the principal component direction

Determine the number of principal components:

    • contribution rate : The variance of the principal component of the first part is the proportion of the total variance, reflecting the share of the overall information extracted from the main component of I.
    • Cumulative Contribution rate : The proportion of the first K principal components in the total variance
    • determining the number of principal components : Cumulative contribution rate >0.85

Example

>Library (stats)> Test <-iris[,1:4]> DATA.PR <-princomp (Test,cor =TRUE)> Summary (data.pr,loadings =TRUE) Importance of Components:comp.1 COMP.2 comp.3 COMP.4Standard deviation1.7083611 0.9560494 0.38308860 0.143926497#Standard deviation Standard deviation squared is Variance = eigenvaluesProportion of Variance0.7296245 0.2285076 0.03668922 0.005178709#Proportion of Variance variance contribution rateCumulative Proportion0.7296245 0.95813210.99482129 1.000000000#Cumulative Proportion Variance Cumulative contribution rate from the results show that the cumulative contribution rate of the first two principal components has reached 96% can be taken out of the other two principal components to achieve the purpose of dimensionality reduction
Loadings:comp.1 COMP.2 comp.3 COMP.4Sepal.length0.521-0.377 0.720 0.261Sepal.width-0.269-0.923-0.244-0.124Petal.length0.580-0.142-0.801Petal.width0.565-0.634 0.524

as we can see from summary (), the principal component analysis method generates four new variables for us, the first variable can interpret the variance of the original data 73%, and the second variable can interpret the variance of the metadata 23%. Obviously, these two data add up already can explain the original data 96% of the information, so we can use these two new variables instead of the original four initial variables .

main cost Gravel chart: Screeplot (data.pr,type = "lines")

It can be seen from the gravel plot that the second principal component follows a smooth line change so that the first two principal components can be selected for analysis.

calculate the data set after dimensionality reduction according to the 5th and 6 steps corresponding to the calculation step

> Newfeature<-as.matrix (Test)%*%as.matrix (Data.pr$loadings[,1:2])> Head (newfeature)       Comp. 1    comp.2[1,] 2.640270-5.204041[2,] 2.670730-4.666910[3,] 2.454606-4.773636[ 4,] 2.545517-4.648463[5,] 2.561228-5.258629[6,] 2.975946-5.707321

Question: What is the relationship between the value of the main cost forecast through predict and the new data set after dimensionality reduction?

> Newfeature<-as.matrix (Test)%*%as.matrix (data.pr$loadings[,1:2])>Head (newfeature) Comp.1 COMP.2[1,] 2.640270-5.204041[2,] 2.670730-4.666910[3,] 2.454606-4.773636[4,] 2.545517-4.648463[5,] 2.561228-5.258629[6,] 2.975946-5.707321> > P <-predict (DATA.PR)>Head (P) Comp.1 COMP.2 comp.3 COMP.4[1,] -2.264703-0.4800266 0.12770602 0.02416820[2,]-2.080961 0.6741336 0.23460885 0.10300677[3,]-2.364229 0.3419080-0.04420148 0.02837705[4,]-2.299384 0.5973945-0.09129011-0.06595556[5,] -2.389842-0.6468354-0.01573820-0.03592281[6,] -2.075631-1.4891775-0.02696829 0.00660818

Stats::p Rcomp

PrComp () and Princomp (), the former uses the singular value decomposition method of the observation array , the latter adopts the eigenvalue decomposition method of the correlation coefficient matrix. The results of the output, including eigenvalues, loads, principal component scores, etc., are basically similar. The difference is as follows:

    • Observation: Refers to the number of records observed, that is, the number of rows in the data set
    • Variable: Refers to the number of factors
    • R mode: Refers to the analysis based on variable, meaning that in all the observation, to study the relationship between variable, which is the main? Is that typical?
    • Q mode: Refers to the analysis based on observation, meaning that in all the variable inside, to study the relationship between observation, which two records similar?

Usage: prcomp (formula, data = NULL, subset, Na.action, ...)

    • Formula: A formula set in the formula method that does not have a dependent variable to indicate the column in the data frame that the data analysis uses,
    • Data: Formula object that contains the data specified in the
    • Subset: A vector object that specifies the observed values used in the analysis, which are optional parameters
    • Na.action: Specifies a function to handle missing values

PrComp (x, Retx = true, center = true, scale = False,tol = NULL, ...)

    • x: In the default method, specify the numeric or complex matrix to be used for analysis
    • RETX: Logical variable specifying whether to return rotation variable
    • Center: A logical variable that specifies whether to center the variable
    • Scale: A logical variable that specifies whether to standardize a variable
    • Tol: Numeric variable to specify precision, values less than this value are ignored

Example:

> PrC <-prcomp (iris[,1:4])>Summary (PrC) Importance of COMPONENTS:PC1 PC2 PC3 pc4standard deviation2.0563 0.49262 0.2797 0.15439Proportion of Variance0.9246 0.05307 0.0171 0.00521Cumulative Proportion0.9246 0.97769 0.9948 1.00000>prc$rotation PC1 PC2 PC3 pc4sepal.length0.36138659-0.65658877 0.58202985 0.3154872Sepal.width-0.08452251-0.73016143-0.59791083-0.3197231Petal.length0.85667061 0.17337266-0.07623608-0.4798390Petal.width0.35828920 0.07548102-0.54583143 0.7536574> > Newfeature<-as.matrix (iris[,1:4])%*%as.matrix (prC$rotation[ , 1:2])>Head (newfeature) PC1 pc2[1,] 2.818240-5.646350[2,] 2.788223-5.149951[3,] 2.613375-5.182003[4,] 2.757022-5.008654[5,] 2.773649-5.653707[6,] 3.221505-6.068283

Other R Packages

      • sciviews bag of Pcomp ()
    • Psych bag of Principal ()
    • LABDSV Package PCA ()
    • ADE4 Package PCA ()
    • Factominer Package PCA ()
    • Rrcov Package PCA ()
    • Seacarb Package PCA ()

ML: Descending dimension algorithm-PCA

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.