Mahout Cluster instances

Last Update:2018-07-25 Source: Internet

Author: User

Tags hadoop fs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. The data set P04-17.csv K-means, canopy, fuzzy kmeans clustering, attention to data conversion.

Replace the comma in the P04-17.csv dataset with a space to meet the requirements of org.apache.mahout.clustering.conversion.InputDriver

Hadoop FS-MKDIR/USER/KEVIN/MAHOUT6

Hadoop FS-COPYFROMLOCAL/HOME/KEVIN/DATAGURU/P04-17.TXT/USER/KEVIN/MAHOUT6

Data conversion

Mahout org.apache.mahout.clustering.conversion.inputdriver-i/user/kevin/mahout6/p04-17.txt-o/user/kevin/mahout6 /vecfile-v Org.apache.mahout.math.RandomAccessSparseVector

Kmeans Clustering

Mahout kmeans-i/user/kevin/mahout6/vecfile-o/user/kevin/mahout6/result1-c/user/kevin/mahout6/clu1-x 20-k 2-cd 0. 1-DM ORG.APACHE.MAHOUT.COMMON.DISTANCE.SQUAREDEUCLIDEANDISTANCEMEASURE–CL

Canopy Clustering

Mahout canopy-i/user/kevin/mahout6/vecfile-o/user/kevin/mahout6/result2-t1 2-t2 1-ow

Fuzzy Kmeans Clustering

Mahout fkmeans-i/user/kevin/mahout6/vecfile-o/user/kevin/mahout6/result3-c/user/kevin/mahout6/clu2-m 2-x 20-k 2- CD 0.1-DM ORG.APACHE.MAHOUT.COMMON.DISTANCE.SQUAREDEUCLIDEANDISTANCEMEASURE-OW-CL

2 for Kmeans clustering, the same initial centroid, the influence of different k values, different iteration times and different convergence thresholds on the clustering results is analyzed.

Set the K value to 3 so that the final cluster result is Class 3

Mahout kmeans-i/user/kevin/mahout6/vecfile-o/user/kevin/mahout6/resultkmeans1-c/user/kevin/mahout6/clukmeans1-x 20-k 3-CD 0.1-DM ORG.APACHE.MAHOUT.COMMON.DISTANCE.SQUAREDEUCLIDEANDISTANCEMEASURE-CL

Mahout clusterdump-s/user/kevin/mahout6/resultkmeans1/clusters-1-final-p/user/kevin/mahout6/resultkmeans1/ Clusteredpoints-o/home/kevin/dataguru/cluster1.csv-of CSV

The result is three rows representing a cluster

It can be exported as a TXT file in another way, which is easier to read, but it is more convenient for subsequent programs to work with CSV files.

Mahout clusterdump-s/user/kevin/mahout6/resultkmeans1/clusters-1-final-p/user/kevin/mahout6/resultkmeans1/ Clusteredpoints-o/home/kevin/dataguru/cluster1.txt

Divide the Cluster1.csv document into K separate documents for the R language drawing

Use the attachment program to process each cluster to generate a CSV file

There are 3 centers,

Cluster1 contains 594 result centers for 0.728, 0.604

vl-1792{n=594 c=[0.728, 0.604] r=[0.096, 0.094]}

Cluster2 contains 690 result Centers for 0.390, 0.622

vl-1794{n=690 c=[0.390, 0.622] r=[0.126, 0.120]}

Cluster3 contains 516 result Centers for 0.593, 0.365

vl-1799{n=516 c=[0.593, 0.365] r=[0.221, 0.119]}

To debug in Rstudio:

> c1<-read.csv (file= "/users/kevin/k-means0.csv", sep= ",", Header=false)

> c2<-read.csv (file= "/users/kevin/k-means1.csv", sep= ",", Header=false)

> c3<-read.csv (file= "/users/kevin/k-means2.csv", sep= ",", Header=false)

> Y<-rbind (C1,C2,C3)

> cols<-c (Rep (1,nrow (C1)), Rep (2,nrow (C2)), Rep (3,nrow (C3)))

> Plot (y, col=c ("Red", "green", "yellow") [cols])

> Center<-matrix (C (0.728, 0.604,0.390, 0.622,0.593, 0.365), ncol=2,byrow=true)

> Points (center, col= "violetred", pch = 19)

Iteration count K value 3 convergence value 0.1

As the above figure shows, the clustering effect is not very good and is recalculated by adjusting the parameters below. The visual point is divided into 4 circular regions, the value of K is 4, and most of the scenes in the actual application exceed the planar or three-dimensional space, and the parameters cannot be adjusted visually.

K=4 Iteration count: 20 Convergence Value: 0.1

vl-1799{n=172 c=[0.331, 0.260] r=[0.132, 0.111]}

vl-1780{n=382 c=[0.374, 0.724] r=[0.165, 0.094]}

vl-1796{n=1129 c=[0.625, 0.540] r=[0.154, 0.089]}

vl-1798{n=117 c=[0.871, 0.390] r=[0.058, 0.127]}

> c1<-read.csv (file= "/users/kevin/k-means0.csv", sep= ",", Header=false)

> c2<-read.csv (file= "/users/kevin/k-means1.csv", sep= ",", Header=false)

> c3<-read.csv (file= "/users/kevin/k-means2.csv", sep= ",", Header=false)

> c4<-read.csv (file= "/users/kevin/k-means3.csv", sep= ",", Header=false)

> Y<-rbind (C1,C2,C3,C4)

> cols<-c (Rep (1,nrow (C1)), Rep (2,nrow (C2)), Rep (3,nrow (C3)), Rep (4,nrow (C4)))

> Plot (y, col=c ("Red", "green", "yellow", "blue") [cols])

> Center<-matrix (C (0.331, 0.260, 0.374, 0.724, 0.625, 0.540, 0.871, 0.390), ncol=2,byrow=true)

> Points (center, col= "violetred", pch = 19)

Iteration count K value 4 convergence value 0.1

Iteration count 4 K value Convergence value 0.1

Iteration count K value 4 convergence value 0.01

The number of iterations is 4 of the convergence value of the value 0. 0000001

Finally, summarize:

For the same initial centroid, the effects of different k values, different iterations and different convergence thresholds on the clustering results are analyzed, but the initial centroid is randomly generated, so there is no way to fix them temporarily. For the K value, the number of iterations, and the convergence of different values have a direct impact on the final clustering results. It takes a lot of attempts to get better results.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More