Mahout is a powerful data mining tool that is a collection of distributed machine learning algorithms, including: implementation, classification, clustering of distributed collaborative filtering called taste. Mahout The biggest advantage is based on Hadoop implementation, a lot of previously run on a single-machine algorithm, converted to MapReduce mode, which greatly improved the algorithm can handle the amount of data and processing performance.
First, mahout installation, configuration
1. Download and unzip Mahout
http://archive.apache.org/dist/mahout/
TAR-ZXVF mahout-distribution-0.9.tar.gz
2. Configure Environment variables
# Set Mahout Environment
Export mahout_home=/mnt/jediael/mahout/mahout-distribution-0.9
Export mahout_conf_dir= $MAHOUT _home/conf
Export path= $MAHOUT _home/conf: $MAHOUT _home/bin: $PATH
3, Installation Mahout
[Email protected] mahout-distribution-0.9]$ pwd
/mnt/jediael/mahout/mahout-distribution-0.9
[[email protected] mahout-distribution-0.9]$ mvn Install
4. Verify that the mahout is installed successfully
Executes the command mahout. If you list some algorithms, you are successful:
[[email protected] mahout-distribution-0.9]$ mahoutrunning on Hadoop, using/mnt/jediael/hadoop-1.2.1/bin/ Hadoop and Hadoop_conf_dir=mahout-job:/mnt/jediael/mahout/mahout-distribution-0.9/examples/target/ Mahout-examples-0.9-job.jaran Example program must is given as the first argument. Valid program names Are:arff.vector:: Generate Vectors from an Arff file or directory Baumwelch:: Baum-welch algorith M for unsupervised HMM training canopy:: Canopy Clustering Cat:: Print a file or resource as the logistic regression m Odels would see it CLEANSVD:: Cleanup and verification of the SVD output clusterdump:: Dump cluster output to text Cluste RPP:: Groups clustering Output in Clusters cmdump:: Dump confusion Matrix in HTML or text formats concatmatrices:: Co Ncatenates 2 matrices of same cardinality into a single matrix CVB:: LDA via collapsed variation Bayes (0th deriv. appro x) Cvb0_local:: LDA via collapsed variation Bayes, in memory locally. Evaluatefactorization::Compute RMSE and MAE of a rating matrix factorization against probes Fkmeans:: Fuzzy K-means Clustering hmmpredict:: G Enerate random sequence of observations by given HMM itemsimilarity:: Compute the item-item-similarities for item-based Collaborative filtering Kmeans:: K-means Clustering Lucene.vector:: Generate Vectors from a lucene index LUCENE2SEQ: : Generate Text sequencefiles from a Lucene index matrixdump:: Dump Matrix in CSV format Matrixmult:: Take the product of matrices Parallelals:: ALS-WR factorization of a rating matrix Qualcluster:: Runs Clustering Experiments and S ummarizes results in a CSV recommendfactorized:: Compute Recommendations using the factorization of a rating matrix rec Ommenditembased:: Compute recommendations using item-based Collaborative filtering regexconverter:: Convert text files On a per-line basis based-Regular expressions Resplit:: Splits a set of sequencefiles into a number of equal splits ROWID:: Map sequencefilE<text,vectorwritable> to {sequencefile<intwritable,vectorwritable>, Sequencefile<intwritable,text Rowsimilarity:: Compute The pairwise similarities of the rows of a matrix Runadaptivelogistic:: Score New Produc tion data using a probably trained and validated Adaptivelogisticregression model runlogistic:: Run a logistic Regressio N model against CSV data seq2encoded:: Encoded Sparse Vector generation from Text sequence files Seq2sparse:: Sparse V Ector generation from text sequence files Seqdirectory:: Generate sequence files (of the text) from a directory seqdumper: : Generic Sequence File dumper seqmailarchives:: Creates sequencefile from a directory containing gzipped mail archives Seqwiki:: Wikipedia XML dump to sequence file Spectralkmeans:: Spectral K-means Clustering split:: Split Input data into Test and train sets Splitdataset:: Split a rating datasets into training and probe parts SSVD:: Stochastic SVD St Reamingkmeans:: Streaming K-meanS clustering SVD:: Lanczos Singular Value decomposition TESTNB:: Test the vector-based Bayes classifier trainadaptive Logistic:: Train an adaptivelogisticregression model trainlogistic:: Train a logistic regression using stochastic Gradi Ent descent TRAINNB:: Train the vector-based Bayes classifier transpose:: Take the transpose of a matrix VALIDATEADAP Tivelogistic:: Validate an adaptivelogisticregression model against hold-out data set Vecdist:: Compute the distances B Etween a set of Vectors (or Cluster or Canopy, they must fit in memory) and a list of Vectors Vectordump:: Dump Vectors From a sequence file to text Viterbi:: Viterbi decoding of hidden states from given output states sequence
second, use a simple example to verify Mahout
1. Start Hadoop
2. Download test data
Http://archive.ics.uci.edu/ml/databases/synthetic_control/links in Synthetic_control.data
Or Baidu is also easy to find this sample data.
3. Upload test data
Hadoop fs-put Synthetic_control.data testdata
4, using the Kmeans clustering algorithm in Mahout, execute the command:
Mahout-core Org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
It takes about 9 minutes to complete clustering.
5. View Clustering Results
Perform Hadoop fs-ls/user/root/output to view clustering results.
[[email protected] mahout-distribution-0.9]$ Hadoop fs-ls outputfound items-rw-r--r--2 jediael supergroup 194 2015-03-07 15:07/user/jediael/output/_policydrwxr-xr-x-jediael supergroup 0 2015-03-07 15:07/user/je Diael/output/clusteredpointsdrwxr-xr-x-jediael supergroup 0 2015-03-07 15:02/user/jediael/output/clusters-0d Rwxr-xr-x-jediael supergroup 0 2015-03-07 15:02/user/jediael/output/clusters-1drwxr-xr-x-jediael Supergr OUP 0 2015-03-07 15:07/user/jediael/output/clusters-10-finaldrwxr-xr-x-jediael supergroup 0 2015-03 -07 15:03/user/jediael/output/clusters-2drwxr-xr-x-jediael supergroup 0 2015-03-07 15:03/USER/JEDIAEL/OUTPU T/clusters-3drwxr-xr-x-jediael supergroup 0 2015-03-07 15:04/user/jediael/output/clusters-4drwxr-xr-x-JE Diael SuperGroup 0 2015-03-07 15:04/user/jediael/output/clusters-5drwxr-xr-x-jediael supergroup 0 2 015-03-07 15:05/user/jeDiael/output/clusters-6drwxr-xr-x-jediael supergroup 0 2015-03-07 15:05/user/jediael/output/clusters-7drwxr- Xr-x-jediael supergroup 0 2015-03-07 15:06/user/jediael/output/clusters-8drwxr-xr-x-jediael supergroup 0 2015-03-07 15:07/user/jediael/output/clusters-9drwxr-xr-x-jediael supergroup 0 2015-03-07 15:02/u Ser/jediael/output/datadrwxr-xr-x-jediael supergroup 0 2015-03-07 15:02/user/jediael/output/random-seeds
Mahout Quick Start Tutorial