Open Course address: https://class.coursera.org/ml-003/class/index
INSTRUCTOR: Andrew Ng
1. unsupervised learning introduction (Introduction to unsupervised learning)
We mentioned one of the two main branches of machine learning-supervised learning. Now we need to start learning another branch-unsupervised learning. What is unsupervised learning? What is the difference between unsupervised learning and supervised learning? I think we can say that we know the labeling of samples in supervised learning. For example, in regression, we only need to fit the curve that conforms to the sample law according to the sample, when classifying, we know the positive and negative conditions of the sample points to classify them. Unsupervised learning is targeted at unlabeled samples. We look for some implicit rules in these samples without knowing the conditions of these samples. Obviously, unsupervised learning is not an evaluation standard or basis, so it is more difficult than supervised learning. For more details, see:
Http://en.wikipedia.org/wiki/Unsupervised_learning
Http://en.wikipedia.org/wiki/Supervised_learning
Like supervised learning, unsupervised learning is also widely used in daily life. The most common example is grouping people on the Internet, people who have used Sina Weibo or Twitter must have seen irregular recommendations on the pages. In fact, these friends are not blindly recommended, in addition, some people can be grouped into a group similar to your interests or work experience. In terms of data mining, I personally think that unsupervised learning is more useful than supervised learning. The following are examples:
2. K-means algorithm (K-means algorithm)
Clustering is the most representative in unsupervised learning. In clustering methods, K-means is the most common algorithm, k-means inputs are K value (used to know the number of clustering classes) and sample points. Here we first provide an intuitive representation of the K-means algorithm process:
From the preceding clustering process, we can see that K is 2, and all the sample points are aggregated into two categories. The algorithm is as follows:
Translation: first, specify the center U1 ~ of K classes at random ~ UK, and then iteratively update the centroid. Where, C (I) indicates that the data on the I th is closest to the class center, that is, it is determined to belong to that class, then, the centers of the K categories are updated to the average values of all data belonging to the class.
Generally, the K-means algorithm is applicable to samples that are easy to separate on the plane. However, K-means can also be used in the following special cases:
3. optimization objective)
Even for unsupervised learning, we also have a cost function. For example, in the K-means algorithm, the clustering result is naturally to combine similar sample points, we have defined C and U above, and the cost function can be obtained to minimize the Euclidean distance between all data and its cluster center:
The K-means algorithm process is given again:
In the first blue box, the center point U is fixed, and the C is adjusted. The second blue box is the U adjustment process, and the optimization is achieved through these two steps.
4. Random initialization (random initialization)
In the above K-means algorithm, we randomly select K points as the initial point. However, this option sometimes does not work well, as shown below:
As you can see, they are all three initial points, but if the location is not good, the clustering effect will be poor. For example, the second graph on the right will combine the following two types, the above is a type of aggregation, which is obviously not consistent with our intuitive feelings. Therefore, we should perform multiple initialization to find a clustering result that minimizes the cost function:
5. Choosing the number of clusters (select the number of clusters)
Although K-means seems simple and practical, it has one of the biggest defects, that is, to specify the number of clusters in advance, that is, the K value. The reality is that we don't know what the result will be. It is too difficult to specify K in advance and we can only try one by one.
One way to try is to try multiple K, and then find a big turning point, which is called the elbow method, that is, to find a place similar to a person's elbow.
But what if I cannot find such a place? We can only specify the number according to our requirements. For example, we can specify the number according to the size category for clustering the clothes below:
---------------------------------- Weak split line ------------------------------------
This topic is an introduction to unsupervised learning. It introduces the most common K-means algorithm in clustering, with little content. There is still a lot of important information about clustering, such as K-means, K-center algorithm, and hierarchical clustering.