PCA (Principal Component analysis) is a commonly used method for analyzing data. PCA transforms the original data into a set of linearly independent representations of each dimension by linear transformation, which can be used to extract the main feature components of the data, and is often used for dimensionality reduction of high dimensional data. There are many articles on PCA online, but most of them only describe the PCA process, not the principles. The purpose of this article is to introduce the basic mathematical principles of PCA and to help readers understand what the PCA works.

Of course, I do not intend to write the article as a purely mathematical article, but want to use an intuitive and understandable way to describe the principle of PCA, so the whole article will not introduce a rigorous mathematical deduction. I hope readers can understand how PCA works better after reading this article. **vector Representation and dimensionality reduction of data**

In general, data is represented as vectors in data mining and machine learning. For example, a Taobao store 2012 year-round traffic and trading situation can be seen as a set of records, each day of data is a record, the format is as follows:

(date, PageView, number of visitors, next singular, number of deals, transaction amount)

where "date" is a record flag, not a measure, and data mining is mostly concerned with metrics, so if we omit the date field, we get a set of records that each record can be represented as a five-dimensional vector, one of which looks something like this:

(500,240,25,13,2312.15) ^\mathsf{t}

Note that I used transpose here, because I used to use the column vectors to represent a record (see the reason later), which is followed later in this article. But sometimes I omit the transpose symbol for convenience, but we say that vectors are by default all refer to the column vectors.

We can certainly analyze and excavate this set of five-dimensional vectors, but we know that many machine learning algorithms are closely related to the dimensions of the data and even to the number of dimensions. Of course, the five-dimensional data here may not matter, but it is not uncommon to deal with thousands or even hundreds of thousands of of dimensions in real machine learning, in which case the resource consumption of machine learning is unacceptable, so we have to dimensionality the data.

dimensionality, of course, means the loss of information, but given the relevance of the actual data itself, we can find ways to minimize the loss of information while reducing dimensionality.

For example, if a student's data has two columns M and F, where the value of M-column is how the pupil is 1 for the male, the value for the female is 0, and the F-column is the student who values 1 for the female and 0 for the male. At this point, if we count all of the data, we will find that for any record, when M is 1 o'clock F must be 0, whereas when M is 0 o'clock F it must be 1. In this case, we will remove M or F from the loss that actually does not have any information, because the other column can be completely restored as long as one column is left.

Of course, the above is an extreme situation, in reality may not appear, but the similar situation is very common. For example, the above Taobao store data, from experience we can know, "views" and "number of visitors" often have strong correlation, and "the next singular" and "deal number" also has a strong correlation. Here we informally use the word "related relationship", which can be intuitively understood as "when a certain day the shop is viewed at a higher (or lower) level, we should largely consider the number of visitors on this day to be higher (or lower)". In the following chapters we will give a rigorous mathematical definition of relevance.

This situation shows that if we delete one of the metrics of pageviews or visitors, we should expect and not lose too much information. So we can remove one to reduce the complexity of the machine learning algorithm.

The above gives a simple thought description of dimensionality reduction, which can help to understand the motive and feasibility of dimensionality reduction, but it does not have the operational guidance meaning. For example, exactly which column we delete is the minimum amount of information to lose. Or rather than simply deleting a few columns, it transforms the raw data into fewer columns with some transformations, but minimizes the loss of information. How to measure how much information is lost. How to determine the specific dimensionality reduction operation steps according to the original data.

To answer the above questions, we should make a mathematical and formal discussion on the dimensionality reduction problem. PCA is a dimensionality reduction method with strict mathematical basis and has been widely adopted. I'm not going to describe the PCA directly, but by gradually analyzing the problem, let's re-invent the PCA again. **representation of vectors and base transformations**

Since the data we are facing is abstracted into a set of vectors, it is necessary to study the mathematical properties of some vectors. And these mathematical properties will be the theoretical basis for subsequent derivation of PCA. **inner product and projection**

Let's take a look at a high school vector operation: inner product. The inner product of a vector of the same number of two dimensions is defined as:

(A_1,a_2,\cdots,a_n) ^\mathsf{t}\cdot (b_1,b_2,\cdots,b_n) ^\mathsf{t}=a_1b_1+a_2b_2+\cdots+a_nb_n

The inner product operation maps two vectors to a real number. The calculation method is very easy to understand, but its significance is not obvious. Below we analyze the geometrical meaning of the inner product. Assuming that A and B are two n-dimensional vectors, we know that n-dimensional vectors can be represented equivalently as a directed segment emitted from the origin in n-dimensional space, and for the sake of simplicity we assume that both A and B are two-dimensional vectors, then a= (x_1,y_1), b= (x_2,y_2). On the two-dimensional plane A and B can be represented by two directed segments from the origin, see the following figure:

OK, now let's draw a vertical line from point A to B. We know that the intersection of the perpendicular and the B is called a projection on B, and the angle between A and B is a, the vector length of the projection is | A|cos (a), where | A|=\SQRT{X_1^2+Y_1^2} is a modulus of vector a, which is the scalar length of a segment.

Note here that we specifically distinguish between vector length and scalar length, the scalar length is always greater than or equal to 0, the value is the length of the segment, and the vector length may be negative, the absolute value is the segment length, and the symbol depends on its direction is the same or opposite the standard direction.

There's no way to see if the inner product has anything to do with this, but if we represent the inner product as another form we are familiar with:

A\cdot b=| a| | B|cos (a)

Now things seem to be a little bit of a thing: the inner product of A and B is equal to the projection length of a to B multiplied by the modulus of B. Further, if we assume that the modulus of B is 1, that is, let | B|=1, then it becomes:

A\cdot b=| A|cos (a)

That is, **if the modulus of vector b is 1, then the inner product value of a and B is equal to the vector length of the straight line projection of a to B** . This is a geometrical interpretation of the inner product and the first important conclusion we get. In the subsequent deduction, this conclusion will be used over and over again. **Base**

Here we continue to discuss vectors in two-dimensional space. As mentioned above, a two-dimensional vector can correspond to a directed segment in a two-dimensional Cartesian coordinate system, starting from the origin point. For example, the following vector:

In terms of algebraic representations, we often represent vectors with point coordinates at the end of the segment, such as the vectors above can be represented as (3,2), which is a vector representation that we are more familiar with.

However, we often ignore that **only one (3,2) itself is not able to accurately represent a vector** . Let's take a closer look, here's 3 actually means that the projected value of the vector on the x-axis is 3, and the projected value on the y-axis is 2. In other words, we implicitly introduce a definition of a vector with a positive direction length of 1 on the X-and Y-axes as standard. So a vector (3,2) is actually a projection of 3 in the x-axis and a 2 for the y-axis. Note that the projection is a vector, so it can be negative.

More formally, the vector (x, y) actually represents a linear combination:

X (1,0) ^\mathsf{t}+y (0,1) ^\mathsf{t}

It is not difficult to prove that all two-dimensional vectors can be represented as such linear combinations. Here (1,0) and (0,1) are called a set of bases in a two-dimensional space.

Therefore, **to accurately describe the vector, first of all to determine a set of bases, and then give the base of the various lines on the projection value, it can be** . But we often omit the first step, and the default is based on (1,0) and (0,1).

It is convenient for us to choose (1,0) and (0,1) as the base, of course, because they are the unit vectors of the x and Y axes, respectively, so it is convenient for the point coordinate and vector one by one to correspond on the two-dimensional plane. But in fact any two-dimensional vector that is linearly independent can be a group of bases, so-called linear Independent in a two-dimensional plane can be visualized as two not a straight line vector.

For example, (a) and ( -1,1) can also be a group of bases. In general, we hope that the modulus of the base is 1, because from the meaning of the inner product can be seen, if the base of the modulus is 1, then it is convenient to use the vector point multiplication base directly to obtain its coordinates on the new base. In fact, we can always find a vector that has a modulus of 1 in the same direction as any vector, as long as the two components are divided by the modulo. For example, the base above can become (\frac{1}{\sqrt{2}},\frac{1}{\sqrt{2}}) and (-\frac{1}{\sqrt{2}},\frac{1}{\sqrt{2}}).

Now, we want to get (3,2) the coordinates on the new base, that is, the projected vector values in two directions, then we just have to calculate the inner product of (3,2) and two bases separately according to the geometrical meaning of the inner product, it is not difficult to get the new coordinates (\frac{5}{\sqrt{2}},-\frac{1}{\ SQRT{2}}). The following diagram shows a schematic of the new base and the coordinate values (3,2) on the new base:

It is also important to note that the examples we cite are orthogonal (i.e., the inner product is 0, or intuitively speaking perpendicular to each other), but the only requirement that can be a group of bases is that it is linearly independent, and non-orthogonal bases are also possible. However, because the orthogonal base has better properties, the base used in general is orthogonal. **matrix representations of base transformations**

Let's look at a simple way to represent a base transformation. Take the above example, think about, the (3,2) transformation to the new base of the coordinates, is to use (3,2) and the first base to do the inner product operation, as the first new coordinate component, and then use (3,2) and the second base to do the inner product operation, as the second new coordinate component. In fact, we can represent this transformation succinctly in the form of matrix multiplication:

\begin{pmatrix} 1/\sqrt{2} & 1/\sqrt{2} \ -1/\sqrt{2} & 1/\sqrt{2} \end{pmatrix} \begin{pmatrix} 3 \ 2 \end{pmat RIX} = \begin{pmatrix} 5/\sqrt{2} \ -1/\sqrt{2} \end{pmatrix}

It's so beautiful. Where the two rows of the matrix are two bases, multiplied by the original vector, the result is just the coordinates of the new base. Can be slightly generalized, if we have m two-dimensional vector, as long as the two-dimensional vector columns into a two-row m-column matrix, and then using the "base matrix" multiplied by this matrix, we get all these vectors under the new base value. For example (2,2), (3,3), if you want to change to that group of bases, you can say:

\begin{pmatrix} 1/\sqrt{2} & 1/\sqrt{2} \ -1/\sqrt{2} & 1/\sqrt{2} \end{pmatrix} \begin{pmatrix} 1 & 2 & 3 \ 1 & 2 & 3 \end{pmatrix} = \begin{pmatrix} 2/\sqrt{2} & 4/\sqrt{2} & 6/\sqrt{2} \ 0 & 0 & 0 \ End{pmatrix}

The base transformation of a set of vectors is then cleanly represented as a multiplication of matrices.

**Generally, if we have m n-dimensional vector, want to transform it into a new space represented by R n-dimensional vectors, then first the R-base is formed by the row of the matrix A, and then the vector column Matrix B, then the two matrices of the product AB is the transformation results, where AB's M column is the result of the M-array transformation **.

The mathematical expression is:

\begin{pmatrix} p_1 \ p_2 \ \vdots \ P_r \end{pmatrix} \begin{pmatrix} a_1 & a_2 & \cdots & A_m \end{pmatri X} = \begin{pmatrix} p_1a_1 & p_1a_2 & \cdots & p_1a_m \ P_2a_1 & p_2a_2 & \cdots & p_2a_m \ \VD OTs & \vdots & \ddots & \vdots \ P_ra_1 & p_ra_2 & \cdots & P_ra_m \end{pmatrix}

Where P_i is a line vector representing the base I, A_j is a column vector representing the first J raw data record.

It is particularly important to note that R can be less than n, and R determines the dimension of the transformed data. That is, we can transform an n-dimensional data into a lower dimensional space, and the transformed dimension depends on the number of bases. Thus the representation of this matrix multiplication can also represent a reduced-dimensional transformation.

Finally, the above analysis also finds a physical explanation for the multiplication of matrices: the **meaning of two matrix multiplication is to transform each column vector in the right matrix into the space represented by the vector of each** line in the left matrix. More abstractly, a matrix can represent a linear transformation. Many students are curious about the method of multiplying matrices when they are learning linear algebra, but if we understand the physical meaning of matrix multiplication, the rationality is obvious. **covariance matrices and optimization targets**

Above we discussed the choice of different bases can give different representations of the same set of data, and if the number of bases is less than the dimensions of the vector itself, it can achieve the effect of dimensionality reduction. But we haven't answered one of the most critical questions: How to choose a base is optimal. Or, if we have a set of n-dimensional vectors that we now want to drop to K-dimensional (k less than n), how should we choose K-Base to keep the original information to a maximum extent?

To fully mathematically this problem is very complicated, here we use a non-formal visual method to see this problem.

To avoid overly abstract discussions, we are still taking a concrete example. Suppose our data consists of five records that represent them in matrix form:

\begin{pmatrix} 1 & 1 & 2 & 4 & 2 \ 1 & 3 & 3 & 4 & 4 \end{pmatrix}

Each of these columns is a data record, and the row is a single field. For the convenience of subsequent processing, we first subtract all the values in each field from the field mean, and the result is that each field is changed to a mean of 0 (which is the reason and the benefit you'll see later).

We look at the above data, the first field has a value of 2, the second field is 3, so after the transformation:

\begin{pmatrix}-1 &-1 & 0 & 2 & 0 \-2 & 0 & 0 & 1 & 1 \end{pmatrix}

We can look at the appearance of five data in a planar Cartesian coordinate system:

Now the question is: if we have to use one dimension to represent this data and want to keep the original information as much as possible, how do you choose.

Through the discussion of the base transformation in the previous section we know that the problem is actually to select a direction in the two-dimensional plane, all the data is projected in the direction of the line, with projected values to represent the original. This is an actual two-dimensional problem that falls to one dimension.

So how to choose this direction (or base) to try to keep the most original information. An intuitive view is that you want the projected values to be scattered as far as possible.

As an example of the above figure, it can be seen that if the x-axis projection, then the leftmost two points will overlap together, the middle of the two points will overlap together, so that itself four different two-dimensional point projection only left two distinct values, this is a serious loss of information, the same, If you project the top two points to the y-axis and two points that are distributed on the x-axis, they also overlap. So it seems that the X and Y axes are not the best projection options. We visually visually visualize that five points can be distinguished after projection, if projected by a slash through the first quadrant and the third quadrant.

Below, we use mathematical methods to express this problem. **Variance**

As we said above, we want the projected values to be scattered as far as possible, and this dispersion can be expressed in terms of mathematical variance. Here, the variance of a field can be regarded as the mean of the sum of squares of the difference between each element and the mean value of the field, namely:

Var (a) =\frac{1}{m}\sum_{i=1}^m{(A_I-\MU) ^2}

Since we have already converted the mean of each field to 0, the variance can be expressed directly by the sum of the squares of each element divided by the number of elements:

Var (a) =\frac{1}{m}\sum_{i=1}^m{a_i^2}

So the above question is formalized as: looking for a wiki so that all the data is transformed to the coordinates on this base, the variance value is the largest. **covariance**

For the above two-dimensional problem of descending into one dimension, find the one that makes the variance the biggest direction is possible. But for the higher dimensions, there is one more problem that needs to be addressed. Consider the three-dimensional drop to two-dimensional problem. The same as before, first we want to find a direction so that the projection behind the largest difference, so that the first choice of direction, and then we choose the second projection direction.

If we are simply choosing the direction of the most variance, it is clear that this direction should be "almost coincident" with the first direction, and obviously such dimensions are useless, so there should be other constraints. Intuitively, let two fields represent as much of the original information as possible, and we do not want to have (linear) correlations between them because the correlation means that the two fields are not completely independent, and there must be a duplication of information.

The covariance of two fields can be used mathematically to denote their relevance, because each field has been given a value of 0, then:

Cov (A, b) =\frac{1}{m}\sum_{i=1}^m{a_ib_i}

As you can see, with a field mean of 0, the covariance of two fields is concise expressed as the inner product divided by the number of elements M.

A covariance of 0 o'clock indicates that two fields are completely independent. In order to have a covariance of 0, we select the second base only in the direction of the first Kizheng intersection. Therefore, the final selection of the two directions must be orthogonal.

At this point, we have obtained the optimization goal of dimensionality reduction: **A set of n-dimensional vectors is reduced to k-dimensional (k is greater than 0, less than N), the goal is to select K units (modulo 1) orthogonal basis, so that the original data transformed to this set of bases, the 22 covariance between the fields of 0, and the field of the variance is as large as possible (under the orthogonal constraints, The maximum K-variance)**. **Covariance matrix**

We have derived the optimization goal above, but this goal does not seem to be a direct operation guide (or algorithm) because it only says what it wants, but doesn't say how. So we're going to continue to study computational solutions mathematically.

We see that the final goal to achieve is closely related to the intra-field variance and covariance between fields. Therefore, we hope that we can unify the two and observe that both can be expressed as the form of inner product, and the inner product is closely related to the multiplication of matrices. So we came to the inspiration:

Let's say we have only A and B two fields, so we'll make them a matrix X by line:

X=\begin{pmatrix} a_1 & a_2 & \cdots & a_m \ b_1 & b_2 & \cdots & B_m \end{pmatrix}

Then we multiply x by the transpose of X and multiply the coefficient 1/m:

\frac{1}{m}xx^\mathsf{t}=\begin{pmatrix} \frac{1}{m}\sum_{i=1}^m{a_i^2} & \frac{1}{m}\sum_{i=1}^m{a_ib_i} \ \ \ Frac{1}{m}\sum_{i=1}^m{a_ib_i} & \frac{1}{m}\sum_{i=1}^m{b_i^2} \end{pmatrix}

A miracle appeared. The two elements on the diagonal of this matrix are the variances of two fields, and the other elements are the covariance of A and B. The two are unified into a matrix.

According to the algorithm of matrix multiplication, this conclusion can easily be generalized to the general situation:

**we have m n-dimensional data records, which are ranked by column n by M of the Matrix X, set C=\frac{1}{m}xx^\mathsf{t}, C is a symmetric matrix, the diagonal of each field of the variance, and the first row J column and J row I column elements are the same, Represents the covariance of the I and J two fields** . **diagonalization of covariance matrices**

Based on the above deduction, we find that to achieve optimization at present, it is equivalent to diagonalization of the covariance matrix: that is, except for the diagonal of the other elements to 0, and on the diagonal of the elements by size from top to bottom, so that we achieve the optimization purposes. This may not be very clear, and we look further at the relationship between the original matrix and the matrix covariance matrix after the base transformation:

Set the original data matrix x corresponds to the covariance matrix C, and P is a set of base by row matrix, set y=px, then Y is x to P do base transformation data. With the covariance matrix of y as D, we derive the relationship between D and C:

\begin{array}{l L} D & = & \frac{1}{m}yy^\mathsf{t} \ & = & \frac{1}{m} (px) (px) ^\mathsf{t} \ & = & \frac{1}{m}pxx^\mathsf{t}p^\mathsf{t} \ \ & = & P (\frac{1}{m}xx^\mathsf{t}) p^\mathsf{t} \ \ & = & P Cp^\mathsf{t} \end{array}