Mahout Quick Start Tutorial

Last Update:2015-03-07 Source: Internet

Author: User

Tags hadoop fs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Mahout is a powerful data mining tool that is a collection of distributed machine learning algorithms, including: implementation, classification, clustering of distributed collaborative filtering called taste. Mahout The biggest advantage is based on Hadoop implementation, a lot of previously run on a single-machine algorithm, converted to MapReduce mode, which greatly improved the algorithm can handle the amount of data and processing performance.

First, mahout installation, configuration

1. Download and unzip Mahout
http://archive.apache.org/dist/mahout/
TAR-ZXVF mahout-distribution-0.9.tar.gz

2. Configure Environment variables
# Set Mahout Environment
Export mahout_home=/mnt/jediael/mahout/mahout-distribution-0.9
Export mahout_conf_dir= $MAHOUT _home/conf
Export path= $MAHOUT _home/conf: $MAHOUT _home/bin: $PATH

3, Installation Mahout
[Email protected] mahout-distribution-0.9]$ pwd
/mnt/jediael/mahout/mahout-distribution-0.9
[[email protected] mahout-distribution-0.9]$ mvn Install

4. Verify that the mahout is installed successfully
Executes the command mahout. If you list some algorithms, you are successful:

[[email protected] mahout-distribution-0.9]$ mahoutrunning on Hadoop, using/mnt/jediael/hadoop-1.2.1/bin/ Hadoop and Hadoop_conf_dir=mahout-job:/mnt/jediael/mahout/mahout-distribution-0.9/examples/target/ Mahout-examples-0.9-job.jaran Example program must is given as the first argument. Valid program names Are:arff.vector:: Generate Vectors from an Arff file or directory Baumwelch:: Baum-welch algorith M for unsupervised HMM training canopy:: Canopy Clustering Cat:: Print a file or resource as the logistic regression m Odels would see it CLEANSVD:: Cleanup and verification of the SVD output clusterdump:: Dump cluster output to text Cluste RPP:: Groups clustering Output in Clusters cmdump:: Dump confusion Matrix in HTML or text formats concatmatrices:: Co Ncatenates 2 matrices of same cardinality into a single matrix CVB:: LDA via collapsed variation Bayes (0th deriv. appro  x) Cvb0_local:: LDA via collapsed variation Bayes, in memory locally. Evaluatefactorization::Compute RMSE and MAE of a rating matrix factorization against probes Fkmeans:: Fuzzy K-means Clustering hmmpredict:: G Enerate random sequence of observations by given HMM itemsimilarity:: Compute the item-item-similarities for item-based Collaborative filtering Kmeans:: K-means Clustering Lucene.vector:: Generate Vectors from a lucene index LUCENE2SEQ:  : Generate Text sequencefiles from a Lucene index matrixdump:: Dump Matrix in CSV format Matrixmult:: Take the product of matrices Parallelals:: ALS-WR factorization of a rating matrix Qualcluster:: Runs Clustering Experiments and S ummarizes results in a CSV recommendfactorized:: Compute Recommendations using the factorization of a rating matrix rec Ommenditembased:: Compute recommendations using item-based Collaborative filtering regexconverter:: Convert text files  On a per-line basis based-Regular expressions Resplit:: Splits a set of sequencefiles into a number of equal splits ROWID:: Map sequencefilE<text,vectorwritable> to {sequencefile<intwritable,vectorwritable>, Sequencefile<intwritable,text Rowsimilarity:: Compute The pairwise similarities of the rows of a matrix Runadaptivelogistic:: Score New Produc tion data using a probably trained and validated Adaptivelogisticregression model runlogistic:: Run a logistic Regressio N model against CSV data seq2encoded:: Encoded Sparse Vector generation from Text sequence files Seq2sparse:: Sparse V Ector generation from text sequence files Seqdirectory:: Generate sequence files (of the text) from a directory seqdumper:  : Generic Sequence File dumper seqmailarchives:: Creates sequencefile from a directory containing gzipped mail archives Seqwiki:: Wikipedia XML dump to sequence file Spectralkmeans:: Spectral K-means Clustering split:: Split Input data into Test and train sets Splitdataset:: Split a rating datasets into training and probe parts SSVD:: Stochastic SVD St Reamingkmeans:: Streaming K-meanS clustering SVD:: Lanczos Singular Value decomposition TESTNB:: Test the vector-based Bayes classifier trainadaptive Logistic:: Train an adaptivelogisticregression model trainlogistic:: Train a logistic regression using stochastic Gradi Ent descent TRAINNB:: Train the vector-based Bayes classifier transpose:: Take the transpose of a matrix VALIDATEADAP Tivelogistic:: Validate an adaptivelogisticregression model against hold-out data set Vecdist:: Compute the distances B Etween a set of Vectors (or Cluster or Canopy, they must fit in memory) and a list of Vectors Vectordump:: Dump Vectors From a sequence file to text Viterbi:: Viterbi decoding of hidden states from given output states sequence

second, use a simple example to verify Mahout
1. Start Hadoop
2. Download test data
Http://archive.ics.uci.edu/ml/databases/synthetic_control/links in Synthetic_control.data
Or Baidu is also easy to find this sample data.
3. Upload test data
Hadoop fs-put Synthetic_control.data testdata
4, using the Kmeans clustering algorithm in Mahout, execute the command:
Mahout-core Org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
It takes about 9 minutes to complete clustering.
5. View Clustering Results
Perform Hadoop fs-ls/user/root/output to view clustering results.

[[email protected] mahout-distribution-0.9]$ Hadoop fs-ls outputfound items-rw-r--r--2 jediael supergroup 194 2015-03-07 15:07/user/jediael/output/_policydrwxr-xr-x-jediael supergroup 0 2015-03-07 15:07/user/je Diael/output/clusteredpointsdrwxr-xr-x-jediael supergroup 0 2015-03-07 15:02/user/jediael/output/clusters-0d Rwxr-xr-x-jediael supergroup 0 2015-03-07 15:02/user/jediael/output/clusters-1drwxr-xr-x-jediael Supergr OUP 0 2015-03-07 15:07/user/jediael/output/clusters-10-finaldrwxr-xr-x-jediael supergroup 0 2015-03 -07 15:03/user/jediael/output/clusters-2drwxr-xr-x-jediael supergroup 0 2015-03-07 15:03/USER/JEDIAEL/OUTPU T/clusters-3drwxr-xr-x-jediael supergroup 0 2015-03-07 15:04/user/jediael/output/clusters-4drwxr-xr-x-JE Diael SuperGroup 0 2015-03-07 15:04/user/jediael/output/clusters-5drwxr-xr-x-jediael supergroup 0 2 015-03-07 15:05/user/jeDiael/output/clusters-6drwxr-xr-x-jediael supergroup 0 2015-03-07 15:05/user/jediael/output/clusters-7drwxr-          Xr-x-jediael supergroup 0 2015-03-07 15:06/user/jediael/output/clusters-8drwxr-xr-x-jediael supergroup 0 2015-03-07 15:07/user/jediael/output/clusters-9drwxr-xr-x-jediael supergroup 0 2015-03-07 15:02/u Ser/jediael/output/datadrwxr-xr-x-jediael supergroup 0 2015-03-07 15:02/user/jediael/output/random-seeds

Mahout Quick Start Tutorial

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More