Mahout Cluster instances

Source: Internet
Author: User
Tags hadoop fs

1. The data set P04-17.csv K-means, canopy, fuzzy kmeans clustering, attention to data conversion.

Replace the comma in the P04-17.csv dataset with a space to meet the requirements of org.apache.mahout.clustering.conversion.InputDriver

Hadoop FS-MKDIR/USER/KEVIN/MAHOUT6

Hadoop FS-COPYFROMLOCAL/HOME/KEVIN/DATAGURU/P04-17.TXT/USER/KEVIN/MAHOUT6

Data conversion

Mahout org.apache.mahout.clustering.conversion.inputdriver-i/user/kevin/mahout6/p04-17.txt-o/user/kevin/mahout6 /vecfile-v Org.apache.mahout.math.RandomAccessSparseVector

Kmeans Clustering

Mahout kmeans-i/user/kevin/mahout6/vecfile-o/user/kevin/mahout6/result1-c/user/kevin/mahout6/clu1-x 20-k 2-cd 0. 1-DM ORG.APACHE.MAHOUT.COMMON.DISTANCE.SQUAREDEUCLIDEANDISTANCEMEASURE–CL

Canopy Clustering

Mahout canopy-i/user/kevin/mahout6/vecfile-o/user/kevin/mahout6/result2-t1 2-t2 1-ow

Fuzzy Kmeans Clustering

Mahout fkmeans-i/user/kevin/mahout6/vecfile-o/user/kevin/mahout6/result3-c/user/kevin/mahout6/clu2-m 2-x 20-k 2- CD 0.1-DM ORG.APACHE.MAHOUT.COMMON.DISTANCE.SQUAREDEUCLIDEANDISTANCEMEASURE-OW-CL

2 for Kmeans clustering, the same initial centroid, the influence of different k values, different iteration times and different convergence thresholds on the clustering results is analyzed.

Set the K value to 3 so that the final cluster result is Class 3

Mahout kmeans-i/user/kevin/mahout6/vecfile-o/user/kevin/mahout6/resultkmeans1-c/user/kevin/mahout6/clukmeans1-x 20-k 3-CD 0.1-DM ORG.APACHE.MAHOUT.COMMON.DISTANCE.SQUAREDEUCLIDEANDISTANCEMEASURE-CL

Mahout clusterdump-s/user/kevin/mahout6/resultkmeans1/clusters-1-final-p/user/kevin/mahout6/resultkmeans1/ Clusteredpoints-o/home/kevin/dataguru/cluster1.csv-of CSV

The result is three rows representing a cluster

It can be exported as a TXT file in another way, which is easier to read, but it is more convenient for subsequent programs to work with CSV files.

Mahout clusterdump-s/user/kevin/mahout6/resultkmeans1/clusters-1-final-p/user/kevin/mahout6/resultkmeans1/ Clusteredpoints-o/home/kevin/dataguru/cluster1.txt

Divide the Cluster1.csv document into K separate documents for the R language drawing

Use the attachment program to process each cluster to generate a CSV file

There are 3 centers,

Cluster1 contains 594 result centers for 0.728, 0.604

vl-1792{n=594 c=[0.728, 0.604] r=[0.096, 0.094]}

Cluster2 contains 690 result Centers for 0.390, 0.622

vl-1794{n=690 c=[0.390, 0.622] r=[0.126, 0.120]}

Cluster3 contains 516 result Centers for 0.593, 0.365

vl-1799{n=516 c=[0.593, 0.365] r=[0.221, 0.119]}

To debug in Rstudio:

> c1<-read.csv (file= "/users/kevin/k-means0.csv", sep= ",", Header=false)

> c2<-read.csv (file= "/users/kevin/k-means1.csv", sep= ",", Header=false)

> c3<-read.csv (file= "/users/kevin/k-means2.csv", sep= ",", Header=false)

> Y<-rbind (C1,C2,C3)

> cols<-c (Rep (1,nrow (C1)), Rep (2,nrow (C2)), Rep (3,nrow (C3)))

> Plot (y, col=c ("Red", "green", "yellow") [cols])

> Center<-matrix (C (0.728, 0.604,0.390, 0.622,0.593, 0.365), ncol=2,byrow=true)

> Points (center, col= "violetred", pch = 19)

Iteration count K value 3 convergence value 0.1

As the above figure shows, the clustering effect is not very good and is recalculated by adjusting the parameters below. The visual point is divided into 4 circular regions, the value of K is 4, and most of the scenes in the actual application exceed the planar or three-dimensional space, and the parameters cannot be adjusted visually.

K=4 Iteration count: 20 Convergence Value: 0.1

vl-1799{n=172 c=[0.331, 0.260] r=[0.132, 0.111]}

vl-1780{n=382 c=[0.374, 0.724] r=[0.165, 0.094]}

vl-1796{n=1129 c=[0.625, 0.540] r=[0.154, 0.089]}

vl-1798{n=117 c=[0.871, 0.390] r=[0.058, 0.127]}

> c1<-read.csv (file= "/users/kevin/k-means0.csv", sep= ",", Header=false)

> c2<-read.csv (file= "/users/kevin/k-means1.csv", sep= ",", Header=false)

> c3<-read.csv (file= "/users/kevin/k-means2.csv", sep= ",", Header=false)

> c4<-read.csv (file= "/users/kevin/k-means3.csv", sep= ",", Header=false)

> Y<-rbind (C1,C2,C3,C4)

> cols<-c (Rep (1,nrow (C1)), Rep (2,nrow (C2)), Rep (3,nrow (C3)), Rep (4,nrow (C4)))

> Plot (y, col=c ("Red", "green", "yellow", "blue") [cols])

> Center<-matrix (C (0.331, 0.260, 0.374, 0.724, 0.625, 0.540, 0.871, 0.390), ncol=2,byrow=true)

> Points (center, col= "violetred", pch = 19)

Iteration count K value 4 convergence value 0.1

Iteration count 4 K value Convergence value 0.1

Iteration count K value 4 convergence value 0.01

The number of iterations is 4 of the convergence value of the value 0. 0000001

Finally, summarize:

For the same initial centroid, the effects of different k values, different iterations and different convergence thresholds on the clustering results are analyzed, but the initial centroid is randomly generated, so there is no way to fix them temporarily. For the K value, the number of iterations, and the convergence of different values have a direct impact on the final clustering results. It takes a lot of attempts to get better results.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.