Machine Learning Public Course notes (8): K-means Clustering and PCA dimensionality reduction

Source: Internet
Author: User

K-means algorithm

Unsupervised learning attempts to discover the underlying structure of a group of untagged data, including:

    • Market Division (segmentation)
    • Social networking Analytics (social network analysis)
    • Manage computer clusters (Organize computer Clusters)
    • Astronomical data Analysis (astronomical)

K-means algorithm belongs to unsupervised learning, the input of the algorithm is: training data set $\{x^{(1)},x^{(2)},\ldots, x^{(m)}\}$ (where $x^{(i)}\in r^{n}$) and the number of clusters $k$ (divide data into $k$ class) The algorithm output is $k$ cluster Center $\mu_1, \mu_2, \ldots, \mu_k$, and each data point $x^{(i)}$ the classification.

K-means algorithm Steps
    1. Random Initialization $k$ Cluster Center (cluster centroid) $\mu_1, \mu_2, \ldots, \mu_k$
    2. Cluster assignment: For each data point $x^{(i)}$, look for the cluster center closest to it and classify it into that class; $c^{(i)}=\min\limits_k| | x^{(i)}-\mu_k| | ^2$, where $c^{(i)}$ represents $x^{(i)}$ the class
    3. Move centroid: Update the value of the cluster center $u_k$ to the average of all data points that belong to the class $k$
    4. Repeat 2, 3 steps until convergence or maximum iteration count

Figure 1 K-means Algorithm Example

Optimization target of K-means algorithm

The cost function for}$ optimization is $ $J (K-means (1) c^{(m)},\ldots,c^{) using $\mu_{c^{(i)}}$ to represent the center of the class in which the $i$ data points $x^{(i)},\mu_1,\ldots,\mu_k; =\frac {1} {m}\sum\limits_{i=1}^{m}| | x^{(i)}-\mu_{c^{(i)}}| | ^2$$ wants to find the optimal parameter to minimize the function, i.e. $$\min\limits_{\substack{c^{(1)},\ldots,c^{(m)} \ \ \mu_1,\ldots,\mu_k}}j (c^{(1)},\ldots,c^ {(m)},\mu_1,\ldots,\mu_k) $$

Issues to be aware of
    • Random initialization: The commonly used initialization method is to randomly select $k$ ($K < m$) data points from the training point, as the initial cluster center $\mu_1, \mu_2, \ldots, \mu_k$
    • Local optimality: The performance of the algorithm clustering is related to the selection of the initial clustering center, in order to avoid falling into the local optimal (2), it should be run multiple times (50 times) to make $j$ the smallest result.
    • $K $ value Selection: Elbow method, Draw $j$ with $k$ curve, select the descending speed of the sudden slow turning point as the K value, for the transition is not obvious curve, according to the K-means algorithm follow-up target selection.


Fig. 2 Global optimal solution and local optimal solutions of K-means algorithm

Figure 3 cases where K values are selected using the Elbow method (left) and elbow (right)

PCA Reduced Dimension Algorithm motivation

Data compression: Compress high-dimensional data (n-dimensional) into low-dimensional data (k-dimensional)

Data visualization: Compress data into 2 D/3 dimensions for easy visualization

Formalization of PCA problem

If we need to compress the two-dimensional data points into one-dimensional data points, we need to find a direction that minimizes the error when the data points are projected in this direction (that is, the distance from the point to the line is the smallest), and more generally, if you need to compress the data points of the $n$ dimension to the $k$ dimension, we need to find $k$ )}, u^{(2)}, \ldots, u^{(k)}$ the smallest error when projecting data points to $u^{(i)}$ in each direction.


Figure 4 PCA instance, compress 2-D data points into 1-dimensional data points, find new direction $u_1$, so that the projection error (perpendicular distance in the graph such as $x^i$ to ${\widetilde x}^i$) is minimized

Note: The difference between PCA and linear regression, PCA is to ensure that the error of the projection (Figure 5 right Yellow line) is the smallest, and the linear regression is to ensure the error along the $y$ direction (Figure 5 left Yellow line) the smallest.

Fig. 5 difference between linear regression and PCA optimization target

PCA algorithm Steps

1. Data preprocessing: Mean Normalization:$\mu_j = \frac{1}{m}\sum\limits_{i=1}^{m}x_j^{(i)}, x_j^{(i)}=x_j-\mu_j$;feature Scaling: (optional, required when different feature range gaps are too large), $x _j^{(i)}=\frac{x^{(i)}-\mu_j}{\sigma_j}$

2. Calculate the covariance matrix (convariance matrix) $$\sigma=\frac{1}{m}\sum\limits_{i=1}^{m}x^{(i)} (x^{(i)}) ^t \quad \text{or} \quad \ Sigma = \frac{1}{m}x^tx$$

3. Calculating the eigenvectors of the covariance matrix $\sigma$ [U, S, V] = SVD (Sigma)

4. Select the first k-column vector of the U-matrix as the direction of the K-main element, forming a matrix $u_{reduce}$

5. For each raw data point $x$ ($x \in r^n$), its reduced-dimension data points $z$ ($z \in r^k$) are $z =u_{reduce}^t x$

Application of PCA

Refactoring data: For the descending dimension of the K-dimensional data point Z, the approximate point after the N-dimension is restored is $x _{apporx} (\approx x) =u_{reduce}z$

Select K Value

    • Average projection error (Average square projection error): $\frac{1}{m}\sum\limits_{i=1}^{m}| | x^{(i)}-x^{(i)}_{approx}| | ^2$
    • Total Variation: $\frac{1}{m}\sum\limits_{i=1}^{m}| | x^{(i)}| | ^2$
    • Select a minimum k value to make $\frac{\frac{1}{m}\sum\limits_{i=1}^{m}| | x^{(i)}-x^{(i)}_{approx}| | ^2}{\frac{1}{m}\sum\limits_{i=1}^{m}| | x^{(i)}| | ^2} \leq 0.01 (0.05) $, can also be selected using the SVD decomposition S matrix $1-\frac{\sum\limits_{i=1}^{k}s_{ii}}{\sum\limits_{i=1}^{n}s_{ii}}\leq 0.01 (0.05) $
Recommendations for applying PCA
    • For accelerated supervised Learning: (1) for tagged data, the PCA data is reduced after removing the label, (2) using the data of the reduced dimension to train the model, (3) for the new data points, the PCA reduced dimension to obtain the dimensionality reduction data, and the model to obtain the predicted value. Note : You should only use the training set data for PCA dimensionality reduction get Map $x^{(i)}\rightarrow z^{(i)}$, and then apply the mapping (PCA-selected principal matrix $u_reduce$) to the validation set and test set
    • do not use PCA to block overfitting, use regularization.
    • Before using PCA, model training with raw data, if not, consider using PCA instead of directly using PCA.
Reference documents

[1] Andrew Ng Coursera public class eighth week

[2] ramble on Clustering:k-means. Http://blog.pluskid.org/?p=17

[3] K-means clustering in a GIF. http://www.statsblogs.com/2014/02/18/k-means-clustering-in-a-gif/

[4] Wikipedia:principal component analysis. Https://en.wikipedia.org/wiki/Principal_component_analysis

[5] explained Visually:principal component Analysis http://setosa.io/ev/principal-component-analysis/

Machine Learning Public Course notes (8): K-means Clustering and PCA dimensionality reduction

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.