1. The data set P04-17.csv K-means, canopy, fuzzy kmeans clustering, attention to data conversion.
Replace the comma in the P04-17.csv dataset with a space to meet the requirements of org.apache.mahout.clustering.conversion.InputDriver
Hadoop FS-MKDIR/USER/KEVIN/MAHOUT6
Hadoop FS-COPYFROMLOCAL/HOME/KEVIN/DATAGURU/P04-17.TXT/USER/KEVIN/MAHOUT6
Data conversion
Mahout org.apache.mahout.clustering.conversion.inputdriver-i/user/kevin/mahout6/p04-17.txt-o/user/kevin/mahout6 /vecfile-v Org.apache.mahout.math.RandomAccessSparseVector
Kmeans Clustering
Mahout kmeans-i/user/kevin/mahout6/vecfile-o/user/kevin/mahout6/result1-c/user/kevin/mahout6/clu1-x 20-k 2-cd 0. 1-DM ORG.APACHE.MAHOUT.COMMON.DISTANCE.SQUAREDEUCLIDEANDISTANCEMEASURE–CL
Canopy Clustering
Mahout canopy-i/user/kevin/mahout6/vecfile-o/user/kevin/mahout6/result2-t1 2-t2 1-ow
Fuzzy Kmeans Clustering
Mahout fkmeans-i/user/kevin/mahout6/vecfile-o/user/kevin/mahout6/result3-c/user/kevin/mahout6/clu2-m 2-x 20-k 2- CD 0.1-DM ORG.APACHE.MAHOUT.COMMON.DISTANCE.SQUAREDEUCLIDEANDISTANCEMEASURE-OW-CL
2 for Kmeans clustering, the same initial centroid, the influence of different k values, different iteration times and different convergence thresholds on the clustering results is analyzed.
Set the K value to 3 so that the final cluster result is Class 3
Mahout kmeans-i/user/kevin/mahout6/vecfile-o/user/kevin/mahout6/resultkmeans1-c/user/kevin/mahout6/clukmeans1-x 20-k 3-CD 0.1-DM ORG.APACHE.MAHOUT.COMMON.DISTANCE.SQUAREDEUCLIDEANDISTANCEMEASURE-CL
Mahout clusterdump-s/user/kevin/mahout6/resultkmeans1/clusters-1-final-p/user/kevin/mahout6/resultkmeans1/ Clusteredpoints-o/home/kevin/dataguru/cluster1.csv-of CSV
The result is three rows representing a cluster
It can be exported as a TXT file in another way, which is easier to read, but it is more convenient for subsequent programs to work with CSV files.
Mahout clusterdump-s/user/kevin/mahout6/resultkmeans1/clusters-1-final-p/user/kevin/mahout6/resultkmeans1/ Clusteredpoints-o/home/kevin/dataguru/cluster1.txt
Divide the Cluster1.csv document into K separate documents for the R language drawing
Use the attachment program to process each cluster to generate a CSV file
There are 3 centers,
Cluster1 contains 594 result centers for 0.728, 0.604
vl-1792{n=594 c=[0.728, 0.604] r=[0.096, 0.094]}
Cluster2 contains 690 result Centers for 0.390, 0.622
vl-1794{n=690 c=[0.390, 0.622] r=[0.126, 0.120]}
Cluster3 contains 516 result Centers for 0.593, 0.365
vl-1799{n=516 c=[0.593, 0.365] r=[0.221, 0.119]}
To debug in Rstudio:
> c1<-read.csv (file= "/users/kevin/k-means0.csv", sep= ",", Header=false)
> c2<-read.csv (file= "/users/kevin/k-means1.csv", sep= ",", Header=false)
> c3<-read.csv (file= "/users/kevin/k-means2.csv", sep= ",", Header=false)
> Y<-rbind (C1,C2,C3)
> cols<-c (Rep (1,nrow (C1)), Rep (2,nrow (C2)), Rep (3,nrow (C3)))
> Plot (y, col=c ("Red", "green", "yellow") [cols])
> Center<-matrix (C (0.728, 0.604,0.390, 0.622,0.593, 0.365), ncol=2,byrow=true)
> Points (center, col= "violetred", pch = 19)
Iteration count K value 3 convergence value 0.1
As the above figure shows, the clustering effect is not very good and is recalculated by adjusting the parameters below. The visual point is divided into 4 circular regions, the value of K is 4, and most of the scenes in the actual application exceed the planar or three-dimensional space, and the parameters cannot be adjusted visually.
K=4 Iteration count: 20 Convergence Value: 0.1
vl-1799{n=172 c=[0.331, 0.260] r=[0.132, 0.111]}
vl-1780{n=382 c=[0.374, 0.724] r=[0.165, 0.094]}
vl-1796{n=1129 c=[0.625, 0.540] r=[0.154, 0.089]}
vl-1798{n=117 c=[0.871, 0.390] r=[0.058, 0.127]}
> c1<-read.csv (file= "/users/kevin/k-means0.csv", sep= ",", Header=false)
> c2<-read.csv (file= "/users/kevin/k-means1.csv", sep= ",", Header=false)
> c3<-read.csv (file= "/users/kevin/k-means2.csv", sep= ",", Header=false)
> c4<-read.csv (file= "/users/kevin/k-means3.csv", sep= ",", Header=false)
> Y<-rbind (C1,C2,C3,C4)
> cols<-c (Rep (1,nrow (C1)), Rep (2,nrow (C2)), Rep (3,nrow (C3)), Rep (4,nrow (C4)))
> Plot (y, col=c ("Red", "green", "yellow", "blue") [cols])
> Center<-matrix (C (0.331, 0.260, 0.374, 0.724, 0.625, 0.540, 0.871, 0.390), ncol=2,byrow=true)
> Points (center, col= "violetred", pch = 19)
Iteration count K value 4 convergence value 0.1
Iteration count 4 K value Convergence value 0.1
Iteration count K value 4 convergence value 0.01
The number of iterations is 4 of the convergence value of the value 0. 0000001
Finally, summarize:
For the same initial centroid, the effects of different k values, different iterations and different convergence thresholds on the clustering results are analyzed, but the initial centroid is randomly generated, so there is no way to fix them temporarily. For the K value, the number of iterations, and the convergence of different values have a direct impact on the final clustering results. It takes a lot of attempts to get better results.