Mahout Quick Start Tutorial

Source: Internet
Author: User
Tags hadoop fs


Mahout is a powerful data mining tool that is a collection of distributed machine learning algorithms, including: implementation, classification, clustering of distributed collaborative filtering called taste. Mahout The biggest advantage is based on Hadoop implementation, a lot of previously run on a single-machine algorithm, converted to MapReduce mode, which greatly improved the algorithm can handle the amount of data and processing performance.

First, mahout installation, configuration

1. Download and unzip Mahout
http://archive.apache.org/dist/mahout/
TAR-ZXVF mahout-distribution-0.9.tar.gz

2. Configure Environment variables
# Set Mahout Environment
Export mahout_home=/mnt/jediael/mahout/mahout-distribution-0.9
Export mahout_conf_dir= $MAHOUT _home/conf
Export path= $MAHOUT _home/conf: $MAHOUT _home/bin: $PATH

3, Installation Mahout
[Email protected] mahout-distribution-0.9]$ pwd
/mnt/jediael/mahout/mahout-distribution-0.9
[[email protected] mahout-distribution-0.9]$ mvn Install

4. Verify that the mahout is installed successfully
Executes the command mahout. If you list some algorithms, you are successful:

[[email protected] mahout-distribution-0.9]$ mahoutrunning on Hadoop, using/mnt/jediael/hadoop-1.2.1/bin/ Hadoop and Hadoop_conf_dir=mahout-job:/mnt/jediael/mahout/mahout-distribution-0.9/examples/target/ Mahout-examples-0.9-job.jaran Example program must is given as the first argument. Valid program names Are:arff.vector:: Generate Vectors from an Arff file or directory Baumwelch:: Baum-welch algorith M for unsupervised HMM training canopy:: Canopy Clustering Cat:: Print a file or resource as the logistic regression m Odels would see it CLEANSVD:: Cleanup and verification of the SVD output clusterdump:: Dump cluster output to text Cluste RPP:: Groups clustering Output in Clusters cmdump:: Dump confusion Matrix in HTML or text formats concatmatrices:: Co Ncatenates 2 matrices of same cardinality into a single matrix CVB:: LDA via collapsed variation Bayes (0th deriv. appro  x) Cvb0_local:: LDA via collapsed variation Bayes, in memory locally. Evaluatefactorization::Compute RMSE and MAE of a rating matrix factorization against probes Fkmeans:: Fuzzy K-means Clustering hmmpredict:: G Enerate random sequence of observations by given HMM itemsimilarity:: Compute the item-item-similarities for item-based Collaborative filtering Kmeans:: K-means Clustering Lucene.vector:: Generate Vectors from a lucene index LUCENE2SEQ:  : Generate Text sequencefiles from a Lucene index matrixdump:: Dump Matrix in CSV format Matrixmult:: Take the product of matrices Parallelals:: ALS-WR factorization of a rating matrix Qualcluster:: Runs Clustering Experiments and S ummarizes results in a CSV recommendfactorized:: Compute Recommendations using the factorization of a rating matrix rec Ommenditembased:: Compute recommendations using item-based Collaborative filtering regexconverter:: Convert text files  On a per-line basis based-Regular expressions Resplit:: Splits a set of sequencefiles into a number of equal splits ROWID:: Map sequencefilE<text,vectorwritable> to {sequencefile<intwritable,vectorwritable>, Sequencefile<intwritable,text Rowsimilarity:: Compute The pairwise similarities of the rows of a matrix Runadaptivelogistic:: Score New Produc tion data using a probably trained and validated Adaptivelogisticregression model runlogistic:: Run a logistic Regressio N model against CSV data seq2encoded:: Encoded Sparse Vector generation from Text sequence files Seq2sparse:: Sparse V Ector generation from text sequence files Seqdirectory:: Generate sequence files (of the text) from a directory seqdumper:  : Generic Sequence File dumper seqmailarchives:: Creates sequencefile from a directory containing gzipped mail archives Seqwiki:: Wikipedia XML dump to sequence file Spectralkmeans:: Spectral K-means Clustering split:: Split Input data into Test and train sets Splitdataset:: Split a rating datasets into training and probe parts SSVD:: Stochastic SVD St Reamingkmeans:: Streaming K-meanS clustering SVD:: Lanczos Singular Value decomposition TESTNB:: Test the vector-based Bayes classifier trainadaptive Logistic:: Train an adaptivelogisticregression model trainlogistic:: Train a logistic regression using stochastic Gradi Ent descent TRAINNB:: Train the vector-based Bayes classifier transpose:: Take the transpose of a matrix VALIDATEADAP Tivelogistic:: Validate an adaptivelogisticregression model against hold-out data set Vecdist:: Compute the distances B Etween a set of Vectors (or Cluster or Canopy, they must fit in memory) and a list of Vectors Vectordump:: Dump Vectors From a sequence file to text Viterbi:: Viterbi decoding of hidden states from given output states sequence



second, use a simple example to verify Mahout
1. Start Hadoop
2. Download test data
Http://archive.ics.uci.edu/ml/databases/synthetic_control/links in Synthetic_control.data
Or Baidu is also easy to find this sample data.
3. Upload test data
Hadoop fs-put Synthetic_control.data testdata
4, using the Kmeans clustering algorithm in Mahout, execute the command:
Mahout-core Org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
It takes about 9 minutes to complete clustering.
5. View Clustering Results
Perform Hadoop fs-ls/user/root/output to view clustering results.
[[email protected] mahout-distribution-0.9]$ Hadoop fs-ls outputfound items-rw-r--r--2 jediael supergroup 194 2015-03-07 15:07/user/jediael/output/_policydrwxr-xr-x-jediael supergroup 0 2015-03-07 15:07/user/je Diael/output/clusteredpointsdrwxr-xr-x-jediael supergroup 0 2015-03-07 15:02/user/jediael/output/clusters-0d Rwxr-xr-x-jediael supergroup 0 2015-03-07 15:02/user/jediael/output/clusters-1drwxr-xr-x-jediael Supergr OUP 0 2015-03-07 15:07/user/jediael/output/clusters-10-finaldrwxr-xr-x-jediael supergroup 0 2015-03 -07 15:03/user/jediael/output/clusters-2drwxr-xr-x-jediael supergroup 0 2015-03-07 15:03/USER/JEDIAEL/OUTPU T/clusters-3drwxr-xr-x-jediael supergroup 0 2015-03-07 15:04/user/jediael/output/clusters-4drwxr-xr-x-JE Diael SuperGroup 0 2015-03-07 15:04/user/jediael/output/clusters-5drwxr-xr-x-jediael supergroup 0 2 015-03-07 15:05/user/jeDiael/output/clusters-6drwxr-xr-x-jediael supergroup 0 2015-03-07 15:05/user/jediael/output/clusters-7drwxr-          Xr-x-jediael supergroup 0 2015-03-07 15:06/user/jediael/output/clusters-8drwxr-xr-x-jediael supergroup 0 2015-03-07 15:07/user/jediael/output/clusters-9drwxr-xr-x-jediael supergroup 0 2015-03-07 15:02/u Ser/jediael/output/datadrwxr-xr-x-jediael supergroup 0 2015-03-07 15:02/user/jediael/output/random-seeds




Mahout Quick Start Tutorial

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.