Run the K-means example in mahout

Last Update:2018-12-03 Source: Internet

Author: User

Tags hadoop fs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First of all, the files processed in mahout must be in the sequencefile format. Therefore, you need to convert txtfile to sequencefile. Sequencefile is a class in hadoop that allows us to write binary key-value pairs to the file. For more information, see the http://www.hadoopor.com/viewthread.php written by eyjian? Tid = 144 & Highlight = sequencefile
Mahout provides a method to convert a file under a specified file into a sequencefile. (You may find Tika (http://lucene.apache.org/tika) helpful in converting binary documents to text .)
The usage is as follows::
$ Mahout_home/bin/mahout seqdirectory \
-- Input <parent dir where docs are located> -- output <output directory> \
<-C <charset name of the input parameters> {UTF-8 | cp1252 | ASCII...}> \
<-Chunk <Max size of each chunk in megabytes> 64> \
<-Prefix <prefix to add to the Document ID>
For example:Bin/mahout seqdirectory -- input/hive/hadoopuser/-- output/mahout/seq/-- charset UTF-8

Example of running K-means

The kmeans algorithm is also simple.: Visible http://baike.baidu.com/view/3447609.htm
First, K objects are randomly selected from n data objects as the initial cluster center. For all other objects, the similarity (distance) between these objects and the cluster center is determined ), they are allocated to the most similar (represented by the cluster center) Clustering respectively, and then the cluster center of each obtained new cluster is calculated (the mean of all objects in the cluster ); repeat this process until the standard measure function starts to converge.

Running process:Refer to the handler on the official website (the exception is thrown during mahout-0.3 running, which has not been solved yet, but the mahout-0.4 test runs normally)

FirstDownload the dataset synthetic_control.data. download it here click to download and import it to the distributed file system, $ hadoop_home/bin/hadoop FS-mkdir testdata
$ Hadoop_home/bin/hadoop FS-Put/home/hadoop/synthetic_control.data testdata

Second, Using the K-means algorithm, directly mahout Org. apache. mahout. clustering. syntheticcontrol. kmeans. job or $ hadoop_home/bin/hadoop JAR/home/hadoop/mahout-distribution-0.4/mahout-examples-0.4-job.jar Org. apache. mahout. clustering. syntheticcontrol. kmeans. the job runs for a long time. Please wait patiently because of iteration.

LastTo view the running result. If the result is displayed on the console: mahout vectordump -- seqfile/user/hadoop/output/data/part-00000, or run the following commands in sequence: $ hadoop_home/bin/hadoop FS-LSR output $ hadoop_home/bin/hadoop FS-Get output $ mahout_home/examples (export the results from the Distributed File System ), $ CD mahout_home/examples/Output
If the following result is displayed, the algorithm runs successfully: canopies clusters-1 clusters-3 clusters-5 clusters-7 points
Clusters-0 clusters-2 clusters-4 clusters-6 data

For a long time I do not know how to view the results of kmeans, for example, to view the part-r-00000 in clusters-I, it should be distributed to the local TXT format (command ):. /mahout seqdumper-S/user/hadoop/output/cluster-9/part-r-00000-O/home/hadoop/out/part-0

N indicates the number of samples of a certain type, C indicates the center of each attribute, and r indicates the radius of each attribute.

Mahout kmeans clustering implementation:
(1) The input parameter specifies all data points to be clustered, and clusters specifies the initial cluster center.
If the parameter k is specified. apache. mahout. clustering. kmeans. randomseedgenerator. buildrandom, through Org. apache. hadoop. FS directly reads K points from the specified input file and puts them in clusters.
(2) Calculate the clustering center of the current iteration based on the original data point and the clustering center of the previous iteration (or initial clustering) and output it to the clusters-N directory.
This process is implemented by kmeansmapper \ kmeanscombiner \ kmeansreducer \ kmeansdriver under org. Apache. mahout. Clustering. kmeans.
Kmeansmapper: Read the previous iteration or initial cluster center when initializing er in configure (each mapper is read into all cluster centers). The map method applies to each vertex of the input, calculate the class closest to it, and add the output key to the cluster ID of the vertex. The value is the kmeansinfo instance, which includes the number of vertices and the sum of each component. Kmeanscombiner: accumulates the number of points and kmeansreducer of each component under the same cluster ID output by kmeansmapper locally, calculate the clustering center of this iteration, and judge whether the clustering has been converged based on the input delta: the distance between the clustering center of the previous iteration and the cluster center of this iteration <delta; output the clustering centers and whether or not they converge. Kmeansdriver: controls the iteration process until the maximum number of iterations is exceeded or all clusters have been converged. After each iteration, kmeansdriver reads all clusters under its clusters-N directory. If all clusters have been converged, then the entire kmeans clustering process converges.

Bin/mahout kmeans \
-I <input vectors directory> \
-C <input clusters directory> \
-O <output working directory> \
-K <optional number of initial clusters to sample from input vectors> \
-Dm <distancemeasure> \
-X <maximum number of iterations> \
-CD <optional convergence Delta. Default is 0.5> \
-Ow <overwrite output directory if present>
-Cl <run input vector clustering after computing canopies>
-XM <execution method: sequential or mapreduce>
Note:When-K is specified, all clusters in the-C directory will be overwritten, and-K points will be randomly extracted from the input data vector as the center of the initial cluster.

Parameter Adjustment: Mahout kmeans clustering has two important parameters: convergence delta and maximum number of iterations. The smaller the Delta value, the higher the convergence condition. Therefore, the number of eventually converged clusters may decrease. The maximum number of iterations is determined by the number of converged clusters after each iteration, iteration can be stopped when the number of converged clusters is almost no longer changing or fluctuating.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More