Machine Learning Dimension Reduction Algorithm 1: PCA (Principal Component Analysis)

Source: Internet
Author: User
Introduction:

In the machine learning field, dimensionality reduction refers to the use of some ing method to map data points in the original high-dimensional space to the low-dimensional space. The essence of dimensionality reduction is to learn a ing function.F: X-> Y, WhereXIs the expression of the original data point. Currently, most vector expressions are used.YIs a low-dimensional vector expression after data point ing, usuallyYDimension is smallerX(Of course, it is also possible to improve the dimension ).FIt may be explicit or implicit, linear, or non-linear.

Of course, there is also a major category of methods that are essentially dimensionality reduction, called feature selection, which aims to select a part of the original data feature set as the data expression.

At present, most of the Dimensionality ReductionAlgorithmSome dimensionality reduction algorithms are used to process the data expressed by vectors.

The data representation after dimensionality reduction is used because:

(1) In the original high-dimensional space, redundant information and noise information are contained, which creates errors in the actual application sample image recognition, reducing the accuracy.,We hope to reduce the error caused by redundant information.,Improves the accuracy of recognition (or other applications.

(2) You may want to use a dimensionality reduction algorithm to find the essential structural features inside the data.

(3) Use dimensionality reduction to accelerate subsequent computing

(4) There are many other purposes, such as solving the sparse problem of data.

In many algorithms, dimensionality reduction algorithms become part of data preprocessing, suchPCA. In fact, there are some algorithms that are difficult to achieve good results without Dimensionality Reduction preprocessing.


If you need to process data, but the original attributes of the data do not need to be fully retained, then PCa may be an option.

 

Principal Component Analysis Algorithm ( PCA )

Principal Component Analysis (PCA)Is the most common Linear dimensionality reduction method. Its goal is to map high-dimensional data to a low-dimensional space through a linear projection, it is expected that the variance of the data is the largest in the projected dimension, so as to use a small data dimension and retain the characteristics of a large number of original data points.

In general, if all vertices are mapped together, almost all information (such as the distance between points) is lost. If the variance after ing is as big as possible, data points are scattered to retain more information. It can prove that,PCAIt is a linear dimensionality reduction method that minimizes original data information. (It is actually the closest to the original data,PCADoes not try to explore the internal structure of data)

SetNDimension vectorWIt is a coordinate axis direction (called a ing vector) of the target sub-space. The variance after maximizing data ing includes:

WhereMIs the number of data instances,XIYesdata instanceIVector Expression,XPulling is the mean vector of all data instances. DefinitionWFor a matrix containing all ing vectors as column vectors, after linear algebra transformation, the following optimization objective functions can be obtained:

W' W = I indicates that each feature of the expected result is orthogonal, so that no redundant information exists between each dimension.

 WhereTrIndicates the trace of a matrix,AIs the data covariance matrix.

Easy to getOptimalWBefore the data Covariance MatrixKFeature vectors corresponding to the largest feature values are constructed as column vectors. These feature vectors form a set of orthogonal basis and best retain the information in the data.

PCAThe output isY = w'x,XThe original dimension is reducedKDimension. Therefore, it doesn't matter if you don't know the derivation. You only need to calculate the result. Pay attention to the mean of X.

Let's look at an example:

 


When one feature vector is used, the basic contour of 3 has been preserved. The more feature vectors are used, the closer they are to the original data.

PCAThe goal is to maximize the internal information of data after dimensionality reduction, and to measure the importance of this direction by measuring the data variance in the Projection Direction. However, after projection, the distinction between data is not significant, but may make the data points together unable to be distinguished. This is alsoPCAThe biggest problem, which leads to the usePCAIn many cases, the classification effect is not good. For details, see. If you usePCAWhen a data point is projected into a one-dimensional space,PCASelect2Axis, which makes it impossible to distinguish the two clusters that are easily distinguished from each other.1The axis will be well differentiated.

discriminant analysis goals and objectives PCA different, you do not want to keep the most data, it is expected that the data can be easily distinguished after dimensionality reduction. LDA is another common Linear dimensionality reduction method. In addition, some non-linear dimensionality reduction methods use the local properties of data points to better differentiate results, such as lle , Laplacian eigenmap . It will be introduced later.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.