Long time no See, Hulu machine learning questions and Answers series and updated again!
You can click "Machine Learning" in the menu bar to review all the previous installments of this series and leave a message to express your thoughts and ideas, and perhaps see your testimonials in the next article.
Today's theme is
"Dimensionality Reduction"
Introduction
The universe is the sum of time and space. Time is a one-dimensional, and the dimensions of space, there is no conclusion to date. String theory is 9-dimensional, and Hawking agrees that M-theory is 10-dimensional. They explain that the dimensions that human beings can perceive outside the three-dimensional dimension are curled up in small spatial scales. Of course, talking about these is not to promote the "three-body" series of books, not to guide readers to explore the true meaning of the universe, and even doubt the nature of life, but to elicit today's machine learning classroom theme-dimensionality.
The dimension of data in machine learning is the same as the space dimension of the real world. In machine learning, data usually needs to be represented as a vector form to be trained in the input model. However, it is well known that the processing and analysis of high-dimensional vectors can greatly consume system resources and even create dimensional disasters. For example, in the field of CV (computer vision) to extract a 100x100 RGB image pixel features, the dimension will reach 30000, in NLP (Natural language Processing) in the field of < document-word > Feature matrix, but also produce tens of thousands of-dimensional eigenvector. Therefore, it is very important to use a low-dimensional vector to represent the characteristics of the original high-dimensional dimension in dimensionality reduction. Imagine, if the universe is really like M theory, the position of each celestial body is described by a 10-dimensional coordinate, there should not be a normal person can imagine the space structure. But when we project these planets onto a two-dimensional plane, the whole universe will be as intuitive as the Milky Way above.
The common dimensionality reduction methods include principal component analysis (PCA), linear discriminant analysis (LDA), equidistant Mapping (ISOMAP), local linear embedding (LLE), Laplace feature Mapping (LES), local reserved projection (LPP), etc. These methods can be divided in terms of linear/non-linear, supervised/unsupervised, global/local. PCA as the most classical method, has been more than 100 years of history, it belongs to a linear, unsupervised, global dimensionality reduction algorithm. Let's review this enduring century-old classics today.
"PCA"
Scenario Description
In the field of machine learning, we extract the original data, and sometimes we get the higher-dimensional feature vectors. In the high dimensional space where these vectors are located, there is a lot of redundancy and noise. We hope to find the characteristics of the data in the way of dimensionality reduction, so as to improve the ability of feature expression and reduce the complexity of training. PCA (principal component analysis), as the most classical method in dimensionality reduction, is a frequently asked question in the interview.
Problem
Principle and objective function of PCA
The method of solving PCA
Background knowledge: linear algebra
Solutions and Analysis
PCA (Principal), principal component analysis, is designed to find the principal component in the data and to characterize the original data using these principal components to achieve the purpose of dimensionality reduction. For a simple example, there are a series of data points in three-dimensional space that are distributed on a plane that has an over-origin point. If we use the three axes of the natural coordinate system x, Y, Z to represent the data, we need three dimensions, and in fact these points only appear on a two-dimensional plane, and if we rotate through the coordinate system so that the data plane coincides with the X, y plane, then we can pass X ', Y ' two dimensions represent the original data without any loss, so the data is reduced, and the information contained in X ', y ' two axes is the principal component we want to find.
But in the high-dimensional space, we often can't visualize the distribution of the data as we have just done, and it is more difficult to find out exactly which axis the principal component corresponds to. Let's start with the simplest two-dimensional data to see how PCA works.
(left) is a centralized set of data in a two-dimensional space, and it is easy to see the approximate direction of the axis (hereinafter referred to as the spindle) where the principal component resides, i.e. the axis of the Green Line in the right figure. Because the data is distributed more evenly on the axis where the green Line is located, it also means that the data is much worse in this direction. In the field of signal processing we think that the signal has a large variance, the noise has a small variance, the ratio of signal to noise is called Snr, the greater the SNR means the better the quality of the data. Thus, it is not difficult to elicit the objective of PCA, that is, to maximize the projection variance, that is, to make the data projected on the spindle the largest variance.
A reader familiar with linear algebra will soon find out that the variance of thex projection is the eigenvalues of the covariance matrix. We need to find the maximum variance, the maximum eigenvalue of the covariance matrix, and the best projection direction is the eigenvector of the maximum eigenvalue. The second best projection direction is located in the orthogonal space of the optimal projection direction, which is the characteristic vector corresponding to the second major eigenvalue, and so on. At this point, we get the PCA solution:
Summary and extension
At this point, we explain the principle of PCA, objective function and solution method from the angle of maximizing projection variance, in fact, PCA can also use other ideas (such as the angle of the minimum regression error) to analyze and get new target function, but finally it will find that the corresponding principle and solution method are equivalent to this article. In addition, PCA is a linear dimensionality reduction method, although it is classic, but it has some limitations. We can extend the PCA by nuclear mapping to get the KPCA method, or we can do non-linear dimensionality reduction for some complex datasets with poor PCA effect through the dimensionality reduction method of manifold mapping (such as Isomap, LLE, le, etc.). These methods will be involved in the subsequent push, please look forward to.
Next Topic Preview
"Unsupervised learning Algorithms and evaluation"
Scenario Description
In real life, we often encounter a class of problems, expect to enter a large number of observations of the machine, and through induction and learning to find some of the common features or structure of these data, or the existence of some kind of correlation between the data eigenvalues. For example, video sites group users according to their viewing behavior, and establish different referral strategies based on grouped results, or find some kind of relationship between video playback fluency and user unsubscribe, among others. Usually, the observational data of this kind of problem does not have the label information, need to seek the data intrinsic structure (structure) and pattern through the algorithm model, this kind of learning algorithm is also called unsupervised learning, mainly contains two kinds of learning methods: Data Clustering (clustering) and Variable association ( Correlation). Compared with supervised learning, unsupervised learning usually does not have the correct answer, the design of the algorithm model directly affects the final output and performance, it is necessary to find the optimal parameters of the model through multiple iterative methods.
Problem description
Taking the clustering algorithm as an example, assuming that there is no external tag data, how to distinguish the two unsupervised learning (clustering) algorithm of the pros and cons?
Hulu machine learning questions and Answers series | The six rounds: PCA algorithm