Kmeans algorithm Learning and sparkmllib Kmeans algorithm attempt

Source: Internet
Author: User

K-means algorithm is the most classical clustering method based on partition, and it is one of the ten classical data mining algorithms. The basic idea of the K-means algorithm is to classify the objects closest to them by clustering the K points in the space as a center. Through iterative method, the values of each cluster center are updated successively until the best clustering results are obtained . the algorithm accepts the parameter k, and then divides the N data objects into K clusters so as to satisfy the obtained clusters : the similarity of objects in the same cluster is higher, while the similarity of objects in different clusters is small. clustering similarity is obtained by using the mean value of the objects in each cluster to obtain a "center object" (gravitational center) to calculate.

Algorithm Description:

Assuming that the sample set is divided into C categories, the algorithm is described as follows: (1) Appropriate selection of the initial center of the C class, (2) in the K-iteration, the distance to the center of C for any sample, the sample is classified to the shortest distance of the center of the class; (3) Update the center value of the class with means ; (4) for all C cluster centers, if the value is unchanged after using the iteration method (2) (3), the iteration ends, or the iteration continues. The biggest advantage of the algorithm is simplicity and speed. The key of the algorithm is the selection of the initial center and the distance formula.

, aside from what the original data is, let's say we've mapped it to a Euclidean space and mapped it to Euclidean space, like this:

From the approximate shape of the data points, it can be seen that they are roughly clustered into three clusters, of which two are compact and the rest is loosely. Our goal is to group these data so that we can distinguish between different clusters of data, and if they are labeled in different colors, it looks like this:

The flow of the algorithm:

First, the K objects are selected from N data objects as the initial cluster centers, and for the remaining objects, they are assigned to their most similar clusters (the cluster center) according to their similarity (distance) to the cluster centers. And then computes the cluster center of each new cluster (the mean value of all the objects in the cluster); Repeat this process until the standard measure function begins to converge. Mean variance is generally used as the standard measure function. K clusters have the following characteristics: Each cluster itself is as compact as possible, and each cluster is as separate as possible.

Then the sample template:

Package Main.asiainfo.coc.sparkMLlib

Import Org.apache.spark.mllib.clustering.KMeans
Import Org.apache.spark.mllib.linalg.Vectors
Import Org.apache.spark. {sparkconf, Sparkcontext}

/**
* Created by Root on 12/15/15.
*/
Object Kmeans {
def main (args:array[string]): Unit = {
Val sparkconf = new sparkconf (). Setmaster ("local"). Setappname ("Cocapp")
Val sc = new Sparkcontext (sparkconf)
Load Data Set
Val data = Sc.textfile ("/usr/local/spark-1.4.0-bin-2.5.0-cdh5.2.1/ysy.txt")
Val parseddata = Data.map (s = = Vectors.dense (S.split ("). Map (_.todouble)))
Data aggregation classes, 2 classes, 20 iterations, model training to form a data model
Val numclusters = 2
Val numiterations = 20
Val model = Kmeans.train (Parseddata, Numclusters, numiterations)
Print the center point of the data model
println ("Cluster Centers:")
for (c <-model.clustercenters) {
println ("" + c.tostring)
}
Using the sum of squared errors to evaluate the data model
Val cost = Model.computecost (Parseddata)
println ("Within Set Sum of squared Errors =" + cost)
Cross-evaluation 1, return only results
Val testdata = Data.map (s = = Vectors.dense (S.split ("). Map (_.todouble)))
Val result1 = model.predict (testdata)
Result1.foreach (println)
println ("-----------------------")
Cross-evaluation 2, return datasets and results
Val result2 = data.map {
line =
Val linevectore = Vectors.dense (Line.split ("). Map (_.todouble))
Val prediction = model.predict (Linevectore)
Line + "" + Prediction
}
Result2.foreach (println)
Sc.stop ()
}
}

The center point of the data model:

Use the sum of squared errors to evaluate the data model:

Cross-Evaluation 1 and 2:

Kmeans algorithm Learning and sparkmllib Kmeans algorithm attempt

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.