Stanford Machine Learning Open Course Notes (11)-data Dimensionality Reduction

Source: Internet
Author: User

Public Course address:Https://class.coursera.org/ml-003/class/index 

INSTRUCTOR:Andrew Ng

1. Motivation 1: Data Compression ( Motivation 1- Data Compression )

The so-called data compression is to reduce the dimension of high-dimensional data, thus reducing the data storage capacity. As for the reason for data compression, it is obvious that the data volume is too large. Let's take a look at the following example:

We can see that we perform dimensionality reduction on two-dimensional points on the plane, so that these points are all placed on a straight line. Finally, we only need to save this line without saving the whole plane. From two-dimensional to one-dimensional, the data storage space is obviously reduced, but this method also has advantages and disadvantages. Obviously, the data is not so accurate after dimensionality reduction, in addition, we can only record the positional relationship between each data point in a straight line, and some original information is lost.

In addition to dropping two-dimensional data to one-dimensional data, we can also reduce Three-dimensional data to two-dimensional data. Similar to the above practice, we can only project three-dimensional points to two-dimensional planes:

2. Motivation 2: Data Visualization ( Motivation 2- Data Visualization )

Consider an example of comparison between countries. There are many factors for comparison, suchGDPAnd living environment index, as shown in the following table:


The data in the table is very detailed, but we cannot use a picture to represent it. It is not intuitive. Therefore, we need to adopt the idea of Data dimensionality reduction, first reduce the data to two dimensional, and then draw:

However, you must specify the two dimensions of the horizontal and vertical coordinates. For exampleZ1RepresentativeCountry size/GDP,Z2RepresentativePer. Person GDP.This specification varies from person to person and is related to the content to be analyzed.

3. Principal Component Analysis problem formulation ( Principal component analysis problem form )

Principal Component Analysis (PCA) is a dimension reduction method. By analyzing the characteristics of data points, the data is reducedNDimensionality ReductionKPurpose:


Note that principal component analysis is not linear regression, especially for dimensionality reduction of two-dimensional data, as shown in[Blue LinePCASurface],PCAThe orthogonal distance between the sample point and the straight line is measured (the sample point and the straight line are vertical lines), while linear regression measures the direct vertical distance between the sample point and the predicted value (in parallelYAxis ):


The more standard explanation is:PCAIn order to findSurface, after the sample points are projected to this surface, the variance between each point is the largest (YIt doesn't matter. It is to find the best way to express these.A plane of a feature), while linear regression gives sample points and predicts them based on the sample points.Y.

4. Principle Component Analysis Algorithm ( Principal Component AnalysisAlgorithm )

The principal component analysis algorithm first needs to pre-process the data, such as the feature standardization described earlier:

The next step is to usePCAThe algorithm is computed.NDimension Data is reducedKFirst, calculate the covariance matrix, and then use the Singular Value Decomposition.SVDTo obtainUMatrix, not understandSVDHere:

Http://zh.wikipedia.org/wiki/%E5%A5%87%E5%BC%82%E5%80%BC%E5%88%86%E8%A7%A3

if you understand SVD , from n dimensionality reduction to K dimension, that is, select this n the most important K , that is, select K . MATLAB U the matrix feature values are sorted in ascending order. You only need to select the K feature vectors. So we get the ureduce matrix:


Then compare it with the previousNIf we multiply the data of the sample points represented by the feature vectorKSample Data represented by the feature vectors. Obviously, we have no choice here.UIt is called principal component analysis.

5. Reconstruction from compressed data ( Recover from compressed data )

We already know how to compress data, but sometimes we need to restore the data. You can use the inverse operation to restore the data:

Xapprox = ureduce * zYou can find the recovered data. Note that the original data cannot be obtained here, but only the approximate data can be obtained.

6. Choosing the number of principle of components ( Number of Principal Components )

You must specifyKBecause part of the data cannot be recovered after compression, we try to compress the data without losing too much information. This is also a trade-off:


After setting a threshold value, you can determine it by calculating the error rate.KFor example, when the threshold is set0.01Indicates that99%Variance. It can be proved that ifMATLABFor calculation, the error rate above can be changedS (SeeSVD)To indicate:


When the threshold is0.01Hour:


7. Advice for applying PCA (PCA Application suggestions )

We already know that datasets are divided into three categories: training set, verification set, and test set. HoweverPCADuring dimensionality reduction, you must only perform dimensionality reduction for the training set. The selected primary component is only applied to the training set:


PCABenefits:


It should be noted that when overfitting is usedPCAIt is wrong to perform Dimensionality Reduction and eliminate overfitting, but normalization parameters should be used to prevent overfitting:


Finally, in the ApplicationPCABefore that, you should make sure that you have tried the raw data, and use it only when the attempt fails to meet your expected results.PCA.

-----------------------------------------Weak split line------------------------------------------------

This article focuses on Dimensionality Reduction of data, the most common method is PCA . However, in practice, sometimes we can see at a glance which features are redundant and should be deleted, but use PCA this is not the case. Therefore, do not blindly trust PCA try multiple methods. In my opinion, SVD I barely talked about it. In fact, SVD should be the focus. Those who have learned matrix analysis before do not understand SVD what are your children's shoes, now you need to understand SVD Why.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.