Bloggers have recently started to explore Data Mining and share their study notes. Currently, WEKA is used. The next article will focus on this.
Algorithm introduction:
The K-means algorithm is a database with K input clustering numbers and N data objects. It outputs k clusters that meet the minimum variance standard. In addition, the obtained clustering satisfies the following requirements: the object similarity in the same cluster is high, while the similarity between different cluster objects is small.
Algorithm hypothesis:
Mean square error is the optimal parameter for calculating the group dispersion.
Algorithm input:
The number of clusters is K. datasets that contain N data objects.
Algorithm output:
K clusters
Algorithm idea:
(A) Green points indicate that the dataset is in the second-level Euclidean space. The initialized centers U1 and U2 are represented by red and blue forks, respectively.
(B) In the initial step E, each vertex is specified as a red or blue Cluster Based on the closest cluster center, this is equivalent to classifying these points based on which side of the separation line perpendicular to the two centers, which is represented by Purple lines.
(C) In the next M step, recalculate the average value of the center point of each cluster as the center point of each cluster.
Until the center point remains unchanged or the change is small.
Run WEKA:
The running result of weather. Nominal. ARFF is as follows:
From the results, we can see that this set of data is iterated four times using the K-means algorithm, and two centers are initially generated. Finally, 10 instances are aggregated into one class, and 4 instances are aggregated into one class.
Function call code:
// Read the sample
Filefile = new file ("F: \ Program Files (x86) \ WEKA-3-7 \ data \ weather. Nominal. ARFF ");
Arffloaderloader = newarffloader ();
Loader. setfile (File );
INS = loader. getdataset ();
// Initialize the clustering tool and set the K value
Km = new simplekmeans ();
Km. setnumclusters (2 );
// Perform Clustering
Km. buildclusterer (INS );
// Print the result
Tempins = km. getclustercentroids ();
System. Out. println ("centroids:" + tempins );
The running result is as follows:
@ Attributeoutlook {sunny, overcast, rainy}
@ Attribute temperature {hot, mild, cool}
@ Attribute humidity {high, normal}
@ Attribute windy {true, false}
@ Attribute play {yes, no}
@ Data
Sunny, mild, high, false, yes
Overcast, cool, normal, true, yes
Algorithm Application:
1. Image Segmentation
The figure shows the effect when different K values are obtained.
2. Analysis of commodity similarity in e-commerce and classification of commodities
3. Analyze the company's customer categories and use different business strategies
For Original Articles, please indicate the source. Thank you.