Run the K-means example in mahout

Source: Internet
Author: User
Tags hadoop fs

First of all, the files processed in mahout must be in the sequencefile format. Therefore, you need to convert txtfile to sequencefile. Sequencefile is a class in hadoop that allows us to write binary key-value pairs to the file. For more information, see the http://www.hadoopor.com/viewthread.php written by eyjian? Tid = 144 & Highlight = sequencefile
Mahout provides a method to convert a file under a specified file into a sequencefile. (You may find Tika (http://lucene.apache.org/tika) helpful in converting binary documents to text .)
The usage is as follows::
$ Mahout_home/bin/mahout seqdirectory \
-- Input <parent dir where docs are located> -- output <output directory> \
<-C <charset name of the input parameters> {UTF-8 | cp1252 | ASCII...}> \
<-Chunk <Max size of each chunk in megabytes> 64> \
<-Prefix <prefix to add to the Document ID>
For example:Bin/mahout seqdirectory -- input/hive/hadoopuser/-- output/mahout/seq/-- charset UTF-8

Example of running K-means

The kmeans algorithm is also simple.: Visible http://baike.baidu.com/view/3447609.htm
First, K objects are randomly selected from n data objects as the initial cluster center. For all other objects, the similarity (distance) between these objects and the cluster center is determined ), they are allocated to the most similar (represented by the cluster center) Clustering respectively, and then the cluster center of each obtained new cluster is calculated (the mean of all objects in the cluster ); repeat this process until the standard measure function starts to converge.

Running process:Refer to the handler on the official website (the exception is thrown during mahout-0.3 running, which has not been solved yet, but the mahout-0.4 test runs normally)

FirstDownload the dataset synthetic_control.data. download it here click to download and import it to the distributed file system, $ hadoop_home/bin/hadoop FS-mkdir testdata
$ Hadoop_home/bin/hadoop FS-Put/home/hadoop/synthetic_control.data testdata

Second, Using the K-means algorithm, directly mahout Org. apache. mahout. clustering. syntheticcontrol. kmeans. job or $ hadoop_home/bin/hadoop JAR/home/hadoop/mahout-distribution-0.4/mahout-examples-0.4-job.jar Org. apache. mahout. clustering. syntheticcontrol. kmeans. the job runs for a long time. Please wait patiently because of iteration.

LastTo view the running result. If the result is displayed on the console: mahout vectordump -- seqfile/user/hadoop/output/data/part-00000, or run the following commands in sequence: $ hadoop_home/bin/hadoop FS-LSR output $ hadoop_home/bin/hadoop FS-Get output $ mahout_home/examples (export the results from the Distributed File System ), $ CD mahout_home/examples/Output
If the following result is displayed, the algorithm runs successfully: canopies clusters-1 clusters-3 clusters-5 clusters-7 points
Clusters-0 clusters-2 clusters-4 clusters-6 data

For a long time I do not know how to view the results of kmeans, for example, to view the part-r-00000 in clusters-I, it should be distributed to the local TXT format (command ):. /mahout seqdumper-S/user/hadoop/output/cluster-9/part-r-00000-O/home/hadoop/out/part-0

N indicates the number of samples of a certain type, C indicates the center of each attribute, and r indicates the radius of each attribute.

Mahout kmeans clustering implementation:
(1) The input parameter specifies all data points to be clustered, and clusters specifies the initial cluster center.
If the parameter k is specified. apache. mahout. clustering. kmeans. randomseedgenerator. buildrandom, through Org. apache. hadoop. FS directly reads K points from the specified input file and puts them in clusters.
(2) Calculate the clustering center of the current iteration based on the original data point and the clustering center of the previous iteration (or initial clustering) and output it to the clusters-N directory.
This process is implemented by kmeansmapper \ kmeanscombiner \ kmeansreducer \ kmeansdriver under org. Apache. mahout. Clustering. kmeans.
Kmeansmapper: Read the previous iteration or initial cluster center when initializing er in configure (each mapper is read into all cluster centers). The map method applies to each vertex of the input, calculate the class closest to it, and add the output key to the cluster ID of the vertex. The value is the kmeansinfo instance, which includes the number of vertices and the sum of each component. Kmeanscombiner: accumulates the number of points and kmeansreducer of each component under the same cluster ID output by kmeansmapper locally, calculate the clustering center of this iteration, and judge whether the clustering has been converged based on the input delta: the distance between the clustering center of the previous iteration and the cluster center of this iteration <delta; output the clustering centers and whether or not they converge. Kmeansdriver: controls the iteration process until the maximum number of iterations is exceeded or all clusters have been converged. After each iteration, kmeansdriver reads all clusters under its clusters-N directory. If all clusters have been converged, then the entire kmeans clustering process converges.

Bin/mahout kmeans \
-I <input vectors directory> \
-C <input clusters directory> \
-O <output working directory> \
-K <optional number of initial clusters to sample from input vectors> \
-Dm <distancemeasure> \
-X <maximum number of iterations> \
-CD <optional convergence Delta. Default is 0.5> \
-Ow <overwrite output directory if present>
-Cl <run input vector clustering after computing canopies>
-XM <execution method: sequential or mapreduce>
Note:When-K is specified, all clusters in the-C directory will be overwritten, and-K points will be randomly extracted from the input data vector as the center of the initial cluster.

Parameter Adjustment: Mahout kmeans clustering has two important parameters: convergence delta and maximum number of iterations. The smaller the Delta value, the higher the convergence condition. Therefore, the number of eventually converged clusters may decrease. The maximum number of iterations is determined by the number of converged clusters after each iteration, iteration can be stopped when the number of converged clusters is almost no longer changing or fluctuating.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.