Http://matlabdatamining.blogspot.com/2010/02/principal-components-analysis.html
English principal components Analysis of the blog, write very good, worried after not open, full text reproduced.
Principal Components Analysis
Introduction
Real-world data sets usually exhibit relationships among their variables. These relationships is often linear, or at least approximately so, making them amenable to common analysis techniques. One such technique is
principal Component analysis("PCA"), which rotates the original data to new coordinates, making the data as "flat" as possible.
Given a table of variables, PCA generates a new table with the same number of variables, called the
Principal Components. Each principal component is a linear transformation of the entire original data set. The coefficients of the principal components is calculated so the first principal component contains the maximum Var Iance (which we may tentatively think of as the "Maximum Information"). The second principal component is calculated to having the second most variance, and, importantly, was uncorrelated (in a Lin Ear sense) With the first principal component. Further principal components, if there is any, exhibit decreasing variance and is uncorrelated with all other principal Components.
PCA is completely reversible (the original data could be recovered exactly from the principal components), making it a versa Tile tool, useful for data reduction, noise rejection, visualization and data compression among other things. This article walks through the specific mechanics of calculating the principal of a data set in MATLAB, using E ither the Matlab Statistics Toolbox, or just the base MATLAB product.
Performing Principal components analysis
Performing PCA would be illustrated using the following data set, which consists of 3 measurements taken of a particular SU Bject Over time:
>> A = [269.8 38.9 50.5
272.4 39.5 50.0
270.0 38.9 50.5
272.0 39.3 50.2
269.8 38.9 50.5
269.8 38.9 50.5
268.2 38.6 50.2
268.2 38.6 50.8
267.0 38.2 51.1
267.8 38.4 51.0
273.6 39.6 50.0
271.2 39.1 50.4
269.8 38.9 50.5
270.0 38.9 50.5
270.0 38.9 50.5
];
We determine the size of this data set thus:
>> [N m] = size (A)
n =
15
m =
3
To summarize the data, we calculate the sample mean vectors and the sample standard deviation vector:
>> Amean = mean (A)
Amean =
269.9733 38.9067 50.4800
>> ASTD = STD (A)
ASTD =
1.7854 0.3751 0.3144
Most often, the first step in the PCA was to
standardizeThe data. Here, "standardization" means subtracting the sample mean from each observation and then dividing by the sample standard Devi ation. This centers and scales the data. Sometimes there is good reasons for modifying or not performing this step, but I'll recommend that you standardize Unle SS You has a good reason not to. This was easy to perform, as follows:
>> B = (A-repmat (amean,[n 1])./Repmat (Astd,[n 1])
B =
-0.0971-0.0178 0.0636
1.3591 1.5820-1.5266
0.0149-0.0178 0.0636
1.1351 1.0487-0.8905
-0.0971-0.0178 0.0636
-0.0971-0.0178 0.0636
-0.9932-0.8177-0.8905
-0.9932-0.8177 1.0178
-1.6653-1.8842 1.9719
-1.2173-1.3509 1.6539
2.0312 1.8486-1.5266
0.6870 0.5155-0.2544
-0.0971-0.0178 0.0636
0.0149-0.0178 0.0636
0.0149-0.0178 0.0636
This calculation can also is carried out using the
Zscorefunction from the Statistics Toolbox:
>> B = Zscore (A)
B =
-0.0971-0.0178 0.0636
1.3591 1.5820-1.5266
0.0149-0.0178 0.0636
1.1351 1.0487-0.8905
-0.0971-0.0178 0.0636
-0.0971-0.0178 0.0636
-0.9932-0.8177-0.8905
-0.9932-0.8177 1.0178
-1.6653-1.8842 1.9719
-1.2173-1.3509 1.6539
2.0312 1.8486-1.5266
0.6870 0.5155-0.2544
-0.0971-0.0178 0.0636
0.0149-0.0178 0.0636
0.0149-0.0178 0.0636
Calculating the coefficients of the principal components and their respective variances are done by finding the Eigenfuncti ONS of the sample covariance matrix:
>> [V D] = Eig (cov (B))
V =
0.6505 0.4874-0.5825
-0.7507 0.2963-0.5904
-0.1152 0.8213 0.5587
D =
0.0066 0 0
0 0.1809 0
0 0 2.8125
The matrix V contains the coefficients for the principal components. The diagonal elements of the D store the variance of the respective principal components. We can extract the diagonal like this:
>> diag (D)
Ans =
0.0066
0.1809
2.8125
The coefficients and respective variances of the principal components could also is found using the
Princompfunction from the Statistics Toolbox:
>> [Coeff score Latent] = Princomp (B)
Coeff =
0.5825-0.4874 0.6505
0.5904-0.2963-0.7507
-0.5587-0.8213-0.1152
Score =
-0.1026 0.0003-0.0571
2.5786 0.1226-0.1277
-0.0373-0.0543 0.0157
1.7779-0.1326 0.0536
-0.1026 0.0003-0.0571
-0.1026 0.0003-0.0571
-0.5637 1.4579 0.0704
-1.6299-0.1095-0.1495
-3.1841-0.2496 0.1041
-2.4306-0.3647 0.0319
3.1275-0.2840 0.1093
0.8467-0.2787 0.0892
-0.1026 0.0003-0.0571
-0.0373-0.0543 0.0157
-0.0373-0.0543 0.0157
latent =
2.8125
0.1809
0.0066
Note three important things about the above:
1. The order of the principal components from
PrincompIs opposite of the From
Eig (CoV (B)).
PrincompOrders the principal and the first one appears in column 1, whereas
Eig (CoV (B))Stores it in the last column.
2. Some of the coefficients from each method has the opposite sign. This is fine:there are no "natural" orientation for principal components, so can expect different software to produce Different mixes of signs.
3. Score contains the actual principal components, as calculated by
Princomp.
To calculate the principal components without
Princomp, simply multiply the standardized data by the principal component coefficients:
>> B * Coeff
Ans =
-0.1026 0.0003-0.0571
2.5786 0.1226-0.1277
-0.0373-0.0543 0.0157
1.7779-0.1326 0.0536
-0.1026 0.0003-0.0571
-0.1026 0.0003-0.0571
-0.5637 1.4579 0.0704
-1.6299-0.1095-0.1495
-3.1841-0.2496 0.1041
-2.4306-0.3647 0.0319
3.1275-0.2840 0.1093
0.8467-0.2787 0.0892
-0.1026 0.0003-0.0571
-0.0373-0.0543 0.0157
-0.0373-0.0543 0.0157
To reverse this transformation, simply multiply by the transpose of the coefficent matrix:
>> (B * coeff) * coeff '
Ans =
-0.0971-0.0178 0.0636
1.3591 1.5820-1.5266
0.0149-0.0178 0.0636
1.1351 1.0487-0.8905
-0.0971-0.0178 0.0636
-0.0971-0.0178 0.0636
-0.9932-0.8177-0.8905
-0.9932-0.8177 1.0178
-1.6653-1.8842 1.9719
-1.2173-1.3509 1.6539
2.0312 1.8486-1.5266
0.6870 0.5155-0.2544
-0.0971-0.0178 0.0636
0.0149-0.0178 0.0636
0.0149-0.0178 0.0636
Finally, to-get back to the original data, multiply each observation by the sample standard deviation vector and add the M EAN Vector:
>> ((B * coeff) * coeff '). * Repmat (astd,[n 1]) + Repmat (amean,[n 1])
Ans =
269.8000 38.9000 50.5000
272.4000 39.5000 50.0000
270.0000 38.9000 50.5000
272.0000 39.3000 50.2000
269.8000 38.9000 50.5000
269.8000 38.9000 50.5000
268.2000 38.6000 50.2000
268.2000 38.6000 50.8000
267.0000 38.2000 51.1000
267.8000 38.4000 51.0000
273.6000 39.6000 50.0000
271.2000 39.1000 50.4000
269.8000 38.9000 50.5000
270.0000 38.9000 50.5000
270.0000 38.9000 50.5000
This completes the round-the original data to the principal and the original data. In some applications, the principal is modified before the return trip.
Let's consider "what we've gained by making" to the principal component coordinate system. First, more variance have indeed been squeezed in the first principal component, which we can see by taking the sample Vari Ance of principal components:
>> var (score)
Ans =
2.8125 0.1809 0.0066
The cumulative variance contained in the first so many principal, can be easily calculated thus:
>> Cumsum (Var (score))/SUM (VAR (score))
Ans =
0.9375 0.9978 1.0000
Interestingly in this case, the first principal component contains nearly 94% of the variance of the original table. A lossy data compression scheme which discarded the second and third principal components would compress 3 variables into 1, while losing only 6% of the variance.
The other important thing to note about the principal, is, they are completely uncorrelated (as measured by The usual Pearson correlation), which we can test by calculating their correlation matrix:
>> Corrcoef (Score)
Ans =
1.0000-0.0000 0.0000
-0.0000 1.0000-0.0000
0.0000-0.0000 1.0000
Discussion
PCA "squeezes" as much information (as measured by variance) as possible to the first principal components. In some cases the number of principal, needed to store, the vast majority of variance is shockingly small:a trem Endous feat of data manipulation. This transformation can being performed quickly on contemporary hardware and are invertible, permitting any number of useful a Pplications.
For the more part, PCA really is as wonderful as it seems. There is a few caveats, however:
1. PCA doesn ' t always work well, in terms of compressing the variance. Sometimes variables just aren ' t related in a the-a-which is easily-exploited by PCA. This means, all, or nearly all of the principal components would be a needed to capture the multivariate variance in the D ATA, making the use of the PCA moot.
2. Variance May is not is what we want condensed into a few variables. For example, if we were using PCA to reduce data for predictive model construction and then it was not necessarily the case tha t the first principal components yield a better model than the last principal components (though it often works out more O R less that).
3. PCA is built from components, such as the sample covariance, which was not statistically robust. This means, PCA may is thrown off by outliers and other data pathologies. How seriously this affects the result was specific to the data and application.
4. Though PCA can cram much of the variance in a data set into fewer variables, it still requires all of the variables to Generate the principal components of the future observations. Note that this is true, regardless of what many principal are retained for the application. PCA is notA subset selection procedure, and this could have important logistical implications.
[ZZ] Principal Component Analysis (PCA) principal components