Kmeans algorithm Learning and sparkmllib Kmeans algorithm attempt

Last Update:2015-12-16 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

K-means algorithm is the most classical clustering method based on partition, and it is one of the ten classical data mining algorithms. The basic idea of the K-means algorithm is to classify the objects closest to them by clustering the K points in the space as a center. Through iterative method, the values of each cluster center are updated successively until the best clustering results are obtained . the algorithm accepts the parameter k, and then divides the N data objects into K clusters so as to satisfy the obtained clusters : the similarity of objects in the same cluster is higher, while the similarity of objects in different clusters is small. clustering similarity is obtained by using the mean value of the objects in each cluster to obtain a "center object" (gravitational center) to calculate.

Algorithm Description:

Assuming that the sample set is divided into C categories, the algorithm is described as follows: (1) Appropriate selection of the initial center of the C class, (2) in the K-iteration, the distance to the center of C for any sample, the sample is classified to the shortest distance of the center of the class; (3) Update the center value of the class with means ; (4) for all C cluster centers, if the value is unchanged after using the iteration method (2) (3), the iteration ends, or the iteration continues. The biggest advantage of the algorithm is simplicity and speed. The key of the algorithm is the selection of the initial center and the distance formula.

, aside from what the original data is, let's say we've mapped it to a Euclidean space and mapped it to Euclidean space, like this:

From the approximate shape of the data points, it can be seen that they are roughly clustered into three clusters, of which two are compact and the rest is loosely. Our goal is to group these data so that we can distinguish between different clusters of data, and if they are labeled in different colors, it looks like this:

The flow of the algorithm:

First, the K objects are selected from N data objects as the initial cluster centers, and for the remaining objects, they are assigned to their most similar clusters (the cluster center) according to their similarity (distance) to the cluster centers. And then computes the cluster center of each new cluster (the mean value of all the objects in the cluster); Repeat this process until the standard measure function begins to converge. Mean variance is generally used as the standard measure function. K clusters have the following characteristics: Each cluster itself is as compact as possible, and each cluster is as separate as possible.

Then the sample template:

Package Main.asiainfo.coc.sparkMLlib

Import Org.apache.spark.mllib.clustering.KMeans
Import Org.apache.spark.mllib.linalg.Vectors
Import Org.apache.spark. {sparkconf, Sparkcontext}

/**
* Created by Root on 12/15/15.
*/
Object Kmeans {
def main (args:array[string]): Unit = {
Val sparkconf = new sparkconf (). Setmaster ("local"). Setappname ("Cocapp")
Val sc = new Sparkcontext (sparkconf)
Load Data Set
Val data = Sc.textfile ("/usr/local/spark-1.4.0-bin-2.5.0-cdh5.2.1/ysy.txt")
Val parseddata = Data.map (s = = Vectors.dense (S.split ("). Map (_.todouble)))
Data aggregation classes, 2 classes, 20 iterations, model training to form a data model
Val numclusters = 2
Val numiterations = 20
Val model = Kmeans.train (Parseddata, Numclusters, numiterations)
Print the center point of the data model
println ("Cluster Centers:")
for (c <-model.clustercenters) {
println ("" + c.tostring)
}
Using the sum of squared errors to evaluate the data model
Val cost = Model.computecost (Parseddata)
println ("Within Set Sum of squared Errors =" + cost)
Cross-evaluation 1, return only results
Val testdata = Data.map (s = = Vectors.dense (S.split ("). Map (_.todouble)))
Val result1 = model.predict (testdata)
Result1.foreach (println)
println ("-----------------------")
Cross-evaluation 2, return datasets and results
Val result2 = data.map {
line =
Val linevectore = Vectors.dense (Line.split ("). Map (_.todouble))
Val prediction = model.predict (Linevectore)
Line + "" + Prediction
}
Result2.foreach (println)
Sc.stop ()
}
}

The center point of the data model:

Use the sum of squared errors to evaluate the data model:

Cross-Evaluation 1 and 2:

Kmeans algorithm Learning and sparkmllib Kmeans algorithm attempt

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Kmeans algorithm Learning and sparkmllib Kmeans algorithm attempt

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support