[ZZ] Principal Component Analysis (PCA) principal components

Source: Internet
Author: User

Http://matlabdatamining.blogspot.com/2010/02/principal-components-analysis.html

English principal components Analysis of the blog, write very good, worried after not open, full text reproduced.

Principal Components Analysis Introduction

Real-world data sets usually exhibit relationships among their variables. These relationships is often linear, or at least approximately so, making them amenable to common analysis techniques. One such technique is principal Component analysis("PCA"), which rotates the original data to new coordinates, making the data as "flat" as possible.

Given a table of variables, PCA generates a new table with the same number of variables, called the Principal Components. Each principal component is a linear transformation of the entire original data set. The coefficients of the principal components is calculated so the first principal component contains the maximum Var Iance (which we may tentatively think of as the "Maximum Information"). The second principal component is calculated to having the second most variance, and, importantly, was uncorrelated (in a Lin Ear sense) With the first principal component. Further principal components, if there is any, exhibit decreasing variance and is uncorrelated with all other principal Components.

PCA is completely reversible (the original data could be recovered exactly from the principal components), making it a versa Tile tool, useful for data reduction, noise rejection, visualization and data compression among other things. This article walks through the specific mechanics of calculating the principal of a data set in MATLAB, using E ither the Matlab Statistics Toolbox, or just the base MATLAB product.


Performing Principal components analysis


Performing PCA would be illustrated using the following data set, which consists of 3 measurements taken of a particular SU Bject Over time:


>> A = [269.8 38.9 50.5
272.4 39.5 50.0
270.0 38.9 50.5
272.0 39.3 50.2
269.8 38.9 50.5
269.8 38.9 50.5
268.2 38.6 50.2
268.2 38.6 50.8
267.0 38.2 51.1
267.8 38.4 51.0
273.6 39.6 50.0
271.2 39.1 50.4
269.8 38.9 50.5
270.0 38.9 50.5
270.0 38.9 50.5
];


We determine the size of this data set thus:


>> [N m] = size (A)

n =

15


m =

3


To summarize the data, we calculate the sample mean vectors and the sample standard deviation vector:


>> Amean = mean (A)

Amean =

269.9733 38.9067 50.4800

>> ASTD = STD (A)

ASTD =

1.7854 0.3751 0.3144


Most often, the first step in the PCA was to standardizeThe data. Here, "standardization" means subtracting the sample mean from each observation and then dividing by the sample standard Devi ation. This centers and scales the data. Sometimes there is good reasons for modifying or not performing this step, but I'll recommend that you standardize Unle SS You has a good reason not to. This was easy to perform, as follows:


>> B = (A-repmat (amean,[n 1])./Repmat (Astd,[n 1])

B =

-0.0971-0.0178 0.0636
1.3591 1.5820-1.5266
0.0149-0.0178 0.0636
1.1351 1.0487-0.8905
-0.0971-0.0178 0.0636
-0.0971-0.0178 0.0636
-0.9932-0.8177-0.8905
-0.9932-0.8177 1.0178
-1.6653-1.8842 1.9719
-1.2173-1.3509 1.6539
2.0312 1.8486-1.5266
0.6870 0.5155-0.2544
-0.0971-0.0178 0.0636
0.0149-0.0178 0.0636
0.0149-0.0178 0.0636


This calculation can also is carried out using the Zscorefunction from the Statistics Toolbox:


>> B = Zscore (A)

B =

-0.0971-0.0178 0.0636
1.3591 1.5820-1.5266
0.0149-0.0178 0.0636
1.1351 1.0487-0.8905
-0.0971-0.0178 0.0636
-0.0971-0.0178 0.0636
-0.9932-0.8177-0.8905
-0.9932-0.8177 1.0178
-1.6653-1.8842 1.9719
-1.2173-1.3509 1.6539
2.0312 1.8486-1.5266
0.6870 0.5155-0.2544
-0.0971-0.0178 0.0636
0.0149-0.0178 0.0636
0.0149-0.0178 0.0636


Calculating the coefficients of the principal components and their respective variances are done by finding the Eigenfuncti ONS of the sample covariance matrix:


>> [V D] = Eig (cov (B))

V =

0.6505 0.4874-0.5825
-0.7507 0.2963-0.5904
-0.1152 0.8213 0.5587


D =

0.0066 0 0
0 0.1809 0
0 0 2.8125


The matrix V contains the coefficients for the principal components. The diagonal elements of the D store the variance of the respective principal components. We can extract the diagonal like this:


>> diag (D)

Ans =

0.0066
0.1809
2.8125


The coefficients and respective variances of the principal components could also is found using the Princompfunction from the Statistics Toolbox:


>> [Coeff score Latent] = Princomp (B)

Coeff =

0.5825-0.4874 0.6505
0.5904-0.2963-0.7507
-0.5587-0.8213-0.1152


Score =

-0.1026 0.0003-0.0571
2.5786 0.1226-0.1277
-0.0373-0.0543 0.0157
1.7779-0.1326 0.0536
-0.1026 0.0003-0.0571
-0.1026 0.0003-0.0571
-0.5637 1.4579 0.0704
-1.6299-0.1095-0.1495
-3.1841-0.2496 0.1041
-2.4306-0.3647 0.0319
3.1275-0.2840 0.1093
0.8467-0.2787 0.0892
-0.1026 0.0003-0.0571
-0.0373-0.0543 0.0157
-0.0373-0.0543 0.0157


latent =

2.8125
0.1809
0.0066


Note three important things about the above:

1. The order of the principal components from PrincompIs opposite of the From Eig (CoV (B)). PrincompOrders the principal and the first one appears in column 1, whereas Eig (CoV (B))Stores it in the last column.

2. Some of the coefficients from each method has the opposite sign. This is fine:there are no "natural" orientation for principal components, so can expect different software to produce Different mixes of signs.

3. Score contains the actual principal components, as calculated by Princomp.

To calculate the principal components without Princomp, simply multiply the standardized data by the principal component coefficients:


>> B * Coeff

Ans =

-0.1026 0.0003-0.0571
2.5786 0.1226-0.1277
-0.0373-0.0543 0.0157
1.7779-0.1326 0.0536
-0.1026 0.0003-0.0571
-0.1026 0.0003-0.0571
-0.5637 1.4579 0.0704
-1.6299-0.1095-0.1495
-3.1841-0.2496 0.1041
-2.4306-0.3647 0.0319
3.1275-0.2840 0.1093
0.8467-0.2787 0.0892
-0.1026 0.0003-0.0571
-0.0373-0.0543 0.0157
-0.0373-0.0543 0.0157


To reverse this transformation, simply multiply by the transpose of the coefficent matrix:


>> (B * coeff) * coeff '

Ans =

-0.0971-0.0178 0.0636
1.3591 1.5820-1.5266
0.0149-0.0178 0.0636
1.1351 1.0487-0.8905
-0.0971-0.0178 0.0636
-0.0971-0.0178 0.0636
-0.9932-0.8177-0.8905
-0.9932-0.8177 1.0178
-1.6653-1.8842 1.9719
-1.2173-1.3509 1.6539
2.0312 1.8486-1.5266
0.6870 0.5155-0.2544
-0.0971-0.0178 0.0636
0.0149-0.0178 0.0636
0.0149-0.0178 0.0636


Finally, to-get back to the original data, multiply each observation by the sample standard deviation vector and add the M EAN Vector:


>> ((B * coeff) * coeff '). * Repmat (astd,[n 1]) + Repmat (amean,[n 1])

Ans =

269.8000 38.9000 50.5000
272.4000 39.5000 50.0000
270.0000 38.9000 50.5000
272.0000 39.3000 50.2000
269.8000 38.9000 50.5000
269.8000 38.9000 50.5000
268.2000 38.6000 50.2000
268.2000 38.6000 50.8000
267.0000 38.2000 51.1000
267.8000 38.4000 51.0000
273.6000 39.6000 50.0000
271.2000 39.1000 50.4000
269.8000 38.9000 50.5000
270.0000 38.9000 50.5000
270.0000 38.9000 50.5000


This completes the round-the original data to the principal and the original data. In some applications, the principal is modified before the return trip.

Let's consider "what we've gained by making" to the principal component coordinate system. First, more variance have indeed been squeezed in the first principal component, which we can see by taking the sample Vari Ance of principal components:


>> var (score)

Ans =

2.8125 0.1809 0.0066


The cumulative variance contained in the first so many principal, can be easily calculated thus:


>> Cumsum (Var (score))/SUM (VAR (score))

Ans =

0.9375 0.9978 1.0000


Interestingly in this case, the first principal component contains nearly 94% of the variance of the original table. A lossy data compression scheme which discarded the second and third principal components would compress 3 variables into 1, while losing only 6% of the variance.

The other important thing to note about the principal, is, they are completely uncorrelated (as measured by The usual Pearson correlation), which we can test by calculating their correlation matrix:


>> Corrcoef (Score)

Ans =

1.0000-0.0000 0.0000
-0.0000 1.0000-0.0000
0.0000-0.0000 1.0000


Discussion

PCA "squeezes" as much information (as measured by variance) as possible to the first principal components. In some cases the number of principal, needed to store, the vast majority of variance is shockingly small:a trem Endous feat of data manipulation. This transformation can being performed quickly on contemporary hardware and are invertible, permitting any number of useful a Pplications.

For the more part, PCA really is as wonderful as it seems. There is a few caveats, however:

1. PCA doesn ' t always work well, in terms of compressing the variance. Sometimes variables just aren ' t related in a the-a-which is easily-exploited by PCA. This means, all, or nearly all of the principal components would be a needed to capture the multivariate variance in the D ATA, making the use of the PCA moot.

2. Variance May is not is what we want condensed into a few variables. For example, if we were using PCA to reduce data for predictive model construction and then it was not necessarily the case tha t the first principal components yield a better model than the last principal components (though it often works out more O R less that).

3. PCA is built from components, such as the sample covariance, which was not statistically robust. This means, PCA may is thrown off by outliers and other data pathologies. How seriously this affects the result was specific to the data and application.

4. Though PCA can cram much of the variance in a data set into fewer variables, it still requires all of the variables to Generate the principal components of the future observations. Note that this is true, regardless of what many principal are retained for the application. PCA is notA subset selection procedure, and this could have important logistical implications.

[ZZ] Principal Component Analysis (PCA) principal components

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.