K-means algorithm principles and mathematical knowledge

Source: Internet
Author: User

Summary

In the big data algorithm, the clustering algorithm is generally used as the basis of other algorithm analysis, and the clustering of data can analyze some characteristics of the data from the whole. Clustering has a lot of algorithms, K-means is the simplest and most practical algorithm. Here is the principle of the K-means algorithm and the mathematical deduction behind it to do a

Detailed introduction and discuss some of the pits to avoid in practical applications.

Algorithm

The K-means algorithm is simple, but there are a lot of details to consider when we are actually using this algorithm in production, and these details are going to be discussed later. First, the steps of the K-means algorithm are given:

1, give the K Initial cluster center

2, repeat:

reassign each data object to a K cluster center, forming a K-cluster

recalculate cluster Centers for each cluster

3, until cluster Center does not change

Discussion on selection of K and initial cluster centers 1, k selection

When we get a batch of data, we don't know the number of clusters in most cases.

A, in some cases, we through the deepening of business understanding, is able to find the cluster of data, such as we have a batch of users to buy merchandise record data, has been counted on the user's working days and the number of weekends to buy items, in two-dimensional coordinates are: the number of items purchased during the week and the number of items purchased on weekends, From this we can find that we have divided the data into three clusters: Weekend purchases, weekday purchases and the number of people who buy weekends and weekdays

b, when we do not really know the data of the cluster, we can use the relevant algorithm to determine the approximate cluster of data. The k is evaluated multiple times, by a target function f to measure, choose to make this f value of the smallest K as a cluster center (after the objective function f), when the K value is selected, time and space complexity will increase. The other strategy is to select the number k and the initial Poly Center as the input of the K-means algorithm by an algorithm canopy algorithm, and the canopy algorithm does not need to input k and the initial clustering center, it can be used as the K-means algorithm preprocessing algorithm to select the K-means algorithm. Required K-Values and clustering centers

2, the choice of cluster center

Choosing the initial Cluster Center is one of the most frustrating things, if the choice is not good, it is easy to find the local optimal cluster center rather than the global optimal Clustering center.

A, after knowing the K, then select the initial Cluster center. One strategy is to make multiple random selection K points as the initial cluster center, compare the objective function f, select the most initial cluster center of the objective function f, this random choice has a lot of deficiencies: 1, virtually increases the time overhead and the space cost 2, the found cluster center may be local optimal rather than the global optimal , because when the randomly selected two cluster centers are in a cluster, no matter how the cluster centers are recalculated, the results are not globally optimal, and the results of the classification are not what we want. Although this strategy has a lot of shortcomings, does not mean that we can not use, in practical applications we can still choose this strategy for production.

b, when the k is known, there is a strategy to select the cluster Center: first we divided the data into two parts: the cluster center set and the original data set, first we randomly select a data from the original data collection center of the initial cluster center into the cluster center set, We then select from the original data set one of the most distant points in the Aggregation Clustering Class center collection as the next initial cluster center. This kind of choice is proved to be a better strategy in practical application, the result is better than a strategy, but this strategy is larger than the perturbation of outliers, and the calculation amount of selecting the initial cluster center is very large, and the space consumption is also very large.

C, when K does not know the situation, the most common strategy is to use the canopy algorithm to find K and cluster center.


K-means nearest neighbor's measure

In the K-means algorithm, we need to divide the dataset into the nearest cluster in the distance cluster, which requires the nearest neighbor's measurement strategy. What do we need to measure recently, how to measure? The K-means algorithm needs to calculate the distance, the computation distance needs the numerical value, therefore the K-means algorithm also is to the numerical data comparison practical. The most commonly used measure formula in the K-means algorithm: European-style space is used in the Euclidean distance, in the processing of the document is the cosine similarity function, sometimes also using the Manhattan distance as a measure, different cases of practical measurement formula is not the same.

European distance

650) this.width=650; "class=" confluence-embedded-image "title=" Korean square > K-means algorithm principles > screenshot 2016-10-21 11.00.48.png "Width=" "src=" http://wiki.sankuai.com/download/attachments/651886932/%E5%B1%8F%E5%B9%95%E5%BF%AB%E7%85%A7% 202016-10-21%20%e4%b8%8a%e5%8d%8811.00.48.png?version=1&modificationdate=1477018861353&api=v2 "Style=" Margin-left:2px;margin-right:2px;vertical-align:text-bottom, "alt="%e5%b1%8f%e5%b9%95%e5%bf%ab%e7%85%a7%202 "/ >

The calculation formula of cosine similarity

650) this.width=650; "class=" confluence-embedded-image "title=" Korean square > K-means algorithm principles > screenshot 2016-10-21 11.03.50.png "Src=" http://wiki.sankuai.com/download/attachments/651886932/%E5%B1%8F%E5%B9%95%E5%BF%AB%E7%85%A7%202016-10-21 %20%e4%b8%8a%e5%8d%8811.03.50.png?version=1&modificationdate=1477019064284&api=v2 "style=" Margin-left : 2px;margin-right:2px;vertical-align:text-bottom; "alt="%e5%b1%8f%e5%b9%95%e5%bf%ab%e7%85%a7%202 "/>

Vector notation

650) this.width=650; "class=" confluence-embedded-image "title=" Korean square > K-means algorithm principles > screenshot 2016-10-21 11.04.10.png "Src=" http://wiki.sankuai.com/download/attachments/651886932/%E5%B1%8F%E5%B9%95%E5%BF%AB%E7%85%A7%202016-10-21 %20%e4%b8%8a%e5%8d%8811.04.10.png?version=1&modificationdate=1477019069451&api=v2 "style=" Margin-left : 2px;margin-right:2px;vertical-align:text-bottom; "alt="%e5%b1%8f%e5%b9%95%e5%bf%ab%e7%85%a7%202 "/>

Manhattan Distance:

650) this.width=650; "class=" confluence-embedded-image "title=" Korean square > K-means algorithm principles > screenshot 2016-10-21 11.05.52.png "Src=" http://wiki.sankuai.com/download/attachments/651886932/%E5%B1%8F%E5%B9%95%E5%BF%AB%E7%85%A7%202016-10-21 %20%e4%b8%8a%e5%8d%8811.05.52.png?version=1&modificationdate=1477019149252&api=v2 "style=" Margin-left : 2px;margin-right:2px;vertical-align:text-bottom; "alt="%e5%b1%8f%e5%b9%95%e5%bf%ab%e7%85%a7%202 "/>

data knowledge behind the K-means algorithm (evaluation criteria for the K-means algorithm is good or bad)

The problem that the K-means algorithm solves is that we divide the data into different clusters, so what is the goal we are going to achieve? is to make the same cluster difference is very small, the difference between different clusters of data maximization, this is a literal description, can not be used to standardize the study or mathematical deduction, we want to just a sentence with a data formula or mathematical model to measure, build what kind of mathematical formula can be used to measure the above description? In general, the error squared is used as a measure of the objective function SSE, the above-mentioned objective function f is SSE is also the sum of squared errors. First, the formula:


650) this.width=650; "class=" confluence-embedded-image "title=" Korea square > K-means algorithm principle > Formula One. png "src="/http Wiki.sankuai.com/download/attachments/651886932/%e5%85%ac%e5%bc%8f%e4%b8%80.png?version=1&modificationdate =1477021413000&api=v2 "style=" Margin-left:2px;margin-right:2px;vertical-align:text-bottom; "alt="%E5%85%AC% E5%bc%8f%e4%b8%80.png?version= "/>

element Interpretation: C represents the value of the center of the cluster, x is a data point belonging to this cluster, and D is a European-style distance

In order to achieve the same cluster difference is very small, the difference between different clusters of element data maximization ( our default data are in the European space, the difference between the data is measured by the Euclidean distance). In order to achieve this goal, the error squared and SSE is actually the smallest, in the K-means algorithm, there are two places to reduce the SSE value: The data points are divided into the nearest cluster from the center point, so that the computed SSE will reduce, recalculate the cluster center point, and further reduce the SSE, But this optimization strategy is only to find the local optimal solution, if you want to find the global optimal solution need to find a reasonable initial clustering center.

There is one more question we need to discuss, why do we choose the average of the cluster set as the center of the cluster, because this is the smallest SSE, in mathematics to find the minimum value of a function, how to do? is not a derivative, we found that SSE is a two-dollar function, then the biased guide bar, the following deduction.

650) this.width=650; "class=" Confluence-embedded-image image-center "title=" Korean square > K-means algorithm principles > Equation two. png "height=" "Src=" http://wiki.sankuai.com/download/attachments/651886932/%e5%85%ac%e5%bc%8f%e4%ba%8c.png?version=1& Modificationdate=1477022827043&api=v2 "style=" Margin-left:auto;margin-right:auto;vertical-align:text-bottom ; "alt="%e5%85%ac%e5%bc%8f%e4%ba%8c.png?version= "/>



From the above deduction, we can see why we choose the mean as a cluster center, when the cluster Center is the mean of the cluster, can be the smallest SSE, in the next sparkmllib in the K-means algorithm for detailed introduction, and then fill it



This article is from the Big Data Learning blog, so be sure to keep this source http://9269309.blog.51cto.com/9259309/1864214

K-means algorithm principles and mathematical knowledge

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.