Data mining algorithm Learning (I) K-means algorithm

Source: Internet
Author: User

Bloggers have recently started to explore Data Mining and share their study notes. Currently, WEKA is used. The next article will focus on this.


Algorithm introduction:

The K-means algorithm is a database with K input clustering numbers and N data objects. It outputs k clusters that meet the minimum variance standard. In addition, the obtained clustering satisfies the following requirements: the object similarity in the same cluster is high, while the similarity between different cluster objects is small.

Algorithm hypothesis:

Mean square error is the optimal parameter for calculating the group dispersion.

Algorithm input:

The number of clusters is K. datasets that contain N data objects.

Algorithm output:

K clusters

Algorithm idea:

(A) Green points indicate that the dataset is in the second-level Euclidean space. The initialized centers U1 and U2 are represented by red and blue forks, respectively.

(B) In the initial step E, each vertex is specified as a red or blue Cluster Based on the closest cluster center, this is equivalent to classifying these points based on which side of the separation line perpendicular to the two centers, which is represented by Purple lines.

(C) In the next M step, recalculate the average value of the center point of each cluster as the center point of each cluster.

Until the center point remains unchanged or the change is small.


Run WEKA:

The running result of weather. Nominal. ARFF is as follows:

From the results, we can see that this set of data is iterated four times using the K-means algorithm, and two centers are initially generated. Finally, 10 instances are aggregated into one class, and 4 instances are aggregated into one class.


Function call code:

// Read the sample

Filefile = new file ("F: \ Program Files (x86) \ WEKA-3-7 \ data \ weather. Nominal. ARFF ");

Arffloaderloader = newarffloader ();

Loader. setfile (File );

INS = loader. getdataset ();

// Initialize the clustering tool and set the K value

Km = new simplekmeans ();

Km. setnumclusters (2 );

// Perform Clustering

Km. buildclusterer (INS );

// Print the result

Tempins = km. getclustercentroids ();

System. Out. println ("centroids:" + tempins );

The running result is as follows:

@ Attributeoutlook {sunny, overcast, rainy}

@ Attribute temperature {hot, mild, cool}

@ Attribute humidity {high, normal}

@ Attribute windy {true, false}

@ Attribute play {yes, no}

@ Data

Sunny, mild, high, false, yes

Overcast, cool, normal, true, yes


Algorithm Application:

1. Image Segmentation

The figure shows the effect when different K values are obtained.

2. Analysis of commodity similarity in e-commerce and classification of commodities

3. Analyze the company's customer categories and use different business strategies



For Original Articles, please indicate the source. Thank you.



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.