Mahout installation configuration, run Kmeans algorithm, Bin/mahout-help appears mahout_local is not set; Adding Hadoop_conf_dir to Classpath

Source: Internet
Author: User

Let mahout kmeans cluster analysis run on Hadoop

This article is very good, for my novice mahout novice, the original address: http://yoyzhou.github.io/blog/2013/06/04/mahout-clustering-with-hadoop/

The previous article, "Mahout and Cluster Analysis," describes how to use mahout for clustering analysis, and combined with the use of K-means in the case of Weibo celebrities with a common focus on data for a common focus on cluster analysis. Mahout run with both local and Hadoop running modes, local run means running in the user's local standalone mode, like running other normal programs, but this will not maximize the advantages of mahout, In this article we describe how to get our Mahout clustering program to run on the Hahoop cluster (in practice, the pseudo-distributed Hadoop used by the author, rather than the real Hadoop cluster). Configuring the Mahout Runtime Environment

MAHOUT Run configuration can be set in $mahout_home/bin/mahout, in fact $mahout_home/bin/mahout is MAHOUT on the command line of the startup script, which is similar to Hadoop, but also different, HADOOP also provides a dedicated hadoop-env.sh file for the configuration of related environment variables under $hadoop_home\conf, and Mahout does not provide such a file in the Conf directory. mahout_local and Hadoop_conf_dir

The single parameter above is the key to controlling whether the mahout is running locally or on Hadoop.

The $MAHOUT _home/bin/mahout file states that whenever the value of the mahout_local is set to a non-null (not empty string) value, the user has no settings hadoop_conf_dir and Hadoop_ Home These two parameters, Mahout are running in local mode , in other words, if you want Mahout to run on Hadoop, mahout_local must be empty.

The Hadoop_conf_dir parameter specifies the Hadoop configuration information that is used when Mahout runs Hadoop mode, which typically points to the CONF directory under the $hadoop_home directory.

In addition, we should also set the Java_home or Mahout_java_home variables, and must add the Hadoop executable file to the path.

Sum up:

1. Add Java_home variable, can be set directly in the $mahout_home/bin/mahout, also can be set in User/bash profile (such as./bashrc)

2. Set the Mahout_home and add the Hadoop execution file to path

The two steps in ~/.BASHRC are set up as follows:

Export java_home=/usr/lib/jvm/java-7-openjdk-i386
#export hadoop_home=/home/yoyzhou/workspace/hadoop-1.1.2
Export mahout_home=/home/yoyzhou/workspace/mahout-0.7
export path= $PATH:/home/yoyzhou/workspace/ Hadoop-1.1.2/bin: $MAHOUT _home/bin

After editing the ~/.BASHRC, restart terminal to take effect.

3. Edit $mahout_home/bin/mahout, set Hadoop_conf_dir to $hadoop_home\conf

Hadoop_conf_dir=/home/yoyzhou/workspace/hadoop-1.1.2/conf

Readers can modify the relevant Hadoop and Mahout home directories on their own system directory address, set up and restart Terminal, at the command line input mahout, if you see the following information, it means that the mahout Hadoop operating mode is configured.

Mahout_local is not set; Adding Hadoop_conf_dir to Classpath. 
Running on Hadoop ...

To run in local mode, simply add a statement with the $mahout_home/bin/mahout setting mahout_local to non-empty. mahout command Line

Mahout provides the corresponding command-line entry for the related data mining algorithms, and provides some tools for data analysis and processing. These commands can be obtained by entering Mahout in the terminal. Some of the information for entering Mahout is shown below:

....
Valid program names is:
  arff.vector:: Generate Vectors from an Arff file or directory
  Baumwelch:: Baum-welch al Gorithm for unsupervised HMM training
  canopy:: Canopy Clustering
  Cat:: Print a file or resource as the logistic Regression models would see it
  CLEANSVD:: Cleanup and verification of SVD output
  clusterdump:: Dump cluster OUTP UT to text ...
  .
  Fkmeans:: Fuzzy k-means Clustering
  FPG:: Frequent Pattern growth
  hmmpredict:: Generate random sequence of obse Rvations by given HMM
  itemsimilarity:: Compute the item-item-similarities for item-based collaborative Filtering
  kmeans:: K-means clustering ...
.
Mahout Kmeans

In the previous article, we started the Kmeans algorithm directly from the Mahout program by calling the Kmeansdriver.run () method, which is useful for debugging programs locally, but in real-world projects, whether it's running in Hadoop mode or running locally, The good thing about running mahout from the command line is that we only need to provide mahout with the input data that meets the corresponding algorithm requirements, that is, the benefits of mahout distributed processing can be exploited. For example, in this example, using the Kmeans algorithm, only the input data that is processed into the mahout Kmeans algorithm is required beforehand, and then the Mahout Kmeans [options] are invoked at the command line.

Entering Mahout kmeans,mahout without any parameters at the command line will list the use of the Kmeans algorithm on the command line.

Usage:                                                                          
 [--input <input>--output <output>--distancemeasure <distanceMeasure>         
--clusters <clusters>--numclusters <k>--convergencedelta <convergenceDelta>   
--maxiter <maxIter>- -overwrite--clustering--method <method>                  
--outlierthreshold <outlierThreshold>--help--tempdir < tempdir>--startphase   
<startPhase>--endphase <endphase>]                                             
--clusters (-c) clusters the    Input centroids, as Vectors.  Must be a         
	                        sequencefile of writable, cluster/canopy.  If k  
	                        is also specified, then a random set of vectors would be the selected and written out to this   
	                        path first 

The relevant parameters have been mentioned in the previous article.

The specific steps are as follows:

1. The data processing is in the form of a mahout vector (vector)
2. Convert mahout vectors to Hadoop sequencefile
3. Create a K initial centroid \[optional \]
4. Copy the sequencefile of the mahout vector to HDFs
5. Run ' mahout kmeans [options] '

The following command shows the Kmeans clustering of mahout vector data in the Data/vectors directory using Cosinedistancemeasure, and the output is saved in the outputs directory.

Mahout kmeans-i data/vectors-o output-c data/clusters \
-DM org.apache.mahout.common.distance.CosineDistanceMeasure \
x 10-OW-CD 0.001-cl

More detailed command-line parameters can be found on the Mahout wiki k-means-commandline. Summary

This article first describes how to configure the operating environment for Mahout Hadoop, and then describes how to run cluster analysis on Hadoop using mahout kmeans command lines.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.