Original: http://blog.csdn.net/abcjennifer/article/details/7914952

This column (machine learning) includes linear regression with single parameters, linear regression with multiple parameters, Octave Tutorial, Logistic Regression, regularization, neural network, design of the computer learning system, SVM (Support vector machines), clustering, dimensionality reduction, anomaly detection, large-scale machine learning and other chapters. Most of the content comes from Standford's lectures and other books from the public class machine learning. (Https://class.coursera.org/ml/class/index)

**Nineth Lecture. Cluster--clustering**

**===============================**

**(a), what is unsupervised learning?**

(b), Kmeans Clustering algorithm

(iii) The cost function of the cluster problem (distortion)

(iv), how to select the Class center at the time of initialization

(v), the selection of the number of clusters

=====================================

(a), what is unsupervised learning

We only involve supervised learning in the previous chapters, and in this chapter we discuss another machine learning approach: unsupervised learning. First of all, let's take a look at the difference between supervised learning and unsupervised learning.

Given a set of data (Input,target) is z= (x, y).

Supervised learning: The most common are regression & classification.

- Regression:y is a real vector. Regression problem is the fitting (x, y) of a curve, which makes the lower cost function l the smallest.

- Classification:y is a finite number that can be seen as a class label. The classification problem needs to be given the label data training classifier first, so it belongs to supervised learning process. In the classification problem, the cost function L (x, y) is the negative logarithm of the probability that X belongs to class Y.

, where fi (X) =p (y=i | X);

Unsupervised learning: The purpose of unsupervised learning is to learn a function f, which allows it to describe the location distribution of a given data P (Z). Includes two kinds: density estimation & Clustering.

- Density estimation is the density estimation, which estimates the distribution density of the data at any location.
- A clustering is a cluster that aggregates Z into several classes (such as K-means), or gives a probability that a sample belongs to each class. Because do not need beforehand according to the training data to train the cluster device, therefore belongs to unsupervised study.
- PCA and many deep learning algorithms belong to unsupervised learning.

Well, you understand, unsupervised learning is the machine learning without class marking.

Practice:

=====================================

(b), K-means **clustering** **algorithm**

Kmeans is a kind of clustering algorithm, first to intuitively see how the algorithm is clustered. Given a set of data as shown, the clustering process of the K-means algorithm

The graph shows the Kmeans clustering process, given a set of input data {x (1), X (2),..., x (n)} and pre-classification K, the algorithm is as follows:

The center U1~uk of the K class is randomly assigned, and then the centroid is updated iteratively.

where C (i) indicates that the first data is nearest to the center of the class, that is, it is judged to belong to that class, and then the K-type centers are updated to the average of all data belonging to this class.

=====================================

(iii) The cost function of the cluster problem (distortion)

In supervised learning we talked about cost function, similar to the cost function in the K-means algorithm, which we sometimes call the distortion cost function.

As shown, J (c,u) is the function we want to minimize.

That is, minimizing the Euclidean distance of all data and its clustering centers.

Look at the Kmeans algorithm flow we talked about in a section, the first step is the fixed Class center U, the process of optimizing C:

The second step is to optimize the U process:

The optimization of cost function J can be accomplished by this iteration.

Practice:

It is noted here that the regression problem may be caused by the increase in the number of iterations as the rate of learning is set too high and the cost function increases. But clustering does not produce such a problem, because each cluster ensures that J drops and no learning rate is given.

=====================================

**(iv), how to select the Class center at the time of initialization**

In the Kmeans algorithm above, we mentioned that we can select the class center using randomly method, but sometimes the effect is not very good, as shown in:

Fig.1. Original data

For such a set of data, if we are lucky to initialize Class Center 2,

Fig.2. Lucky initialization

Fig.3. Unfortunate initialization

But if the data initialization Center chooses two of the 3 cases, it's tragic! The final clustering result cost function will also be relatively large. To solve this problem, we put forward the solution is to carry out different initialization (50~1000 times), each initialization the case of clustering, and finally select the cost function J (c,u) the smallest as a cluster result.

=====================================

(v), the selection of the number of clusters

How to choose the number of clusters? This should be a headache part of the clustering problem, such as the choice of K in the Kmeans algorithm. This section will address this issue.

One of the most famous methods is Elbow-method, and figure k-j (cost function) is as follows:

If the diagram is shown above, then we look at the elbow position in the graph as the selected value of K, if there is no obvious elbow point as shown in the image on the right, it is probably the data distribution as shown:

In this case we need to according to their own needs to cluster, such as tshirt size, can be clustered into {l,m,s} three categories, can also be divided into {xl,l,m,s,xs}5 class. We need the specific situation of the specific analysis of ~

Practice:

============================================== Summary This chapter describes another major branch of machine learning-unsupervised learning, In fact, we should be very familiar with the clustering problem in unsupervised learning, this chapter describes a few significant points is elbow method to deal with the number of clusters and cluster center initialization method, it is worth everyone into the application.

Stanford Machine Learning---nineth lecture. Clustering