Mahout configuration took a lot of time, mainly because it wasted a lot of time on some small issues.
1. Download mahout
: Http://mahout.apache.org
The latest version I downloaded: mahout-distribution-0.9
2. Unzip mahout to the file you want to store. I put it in the/users/Jia/documents/hadoop-0.20.2, that is, the hadoop installation directory.
3. Configure the environment for mahout
Open the terminal and open the directory where the profile file is located
JIAS-MacBook-Pro:~ jia$ open /etc
Copy the profile file to the desktop, edit it, and add environment variables to it.
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0.jdk/Contents/Homeexport HADOOP_HOME=Documents/hadoop-0.20.2export MAHOUT_HOME=Documents/hadoop-0.20.2/mahout-distribution-0.9export MAVEN_HOME=Documents/apache-maven-3.2.2export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$MAVEN_HOME/bin:$MAHOUT_HOME/binexport HADOOP_CONF_DIR=Documents/hadoop-0.20.2/confexport MAHOUT_CONF_DIR=Documents/hadoop-0.20.2/mahout-distribution-0.9/confexport classpath=$classpath:$JAVA_HOME/lib:$MAHOUT_HOME/lib:$HADOOP_CONF_DIR:$MAHOUT_CONF_DIR
Overwrite the profile file on the desktop to the profile on/etc, and enter the administrator password.
Note:
When configuring mahou_conf_dir some websites say export mahout_conf_dir = documents/hadoop-0.20.2/mahout-distribution-0.9/src/Conf
The correct configuration for version 0.9 is: Export mahout_conf_dir = documents/hadoop-0.20.2/mahout-distribution-0.9/Conf, because when you open the mahout folder, you find that the directory SRC is not found
4. Check whether mahout is configured successfully.
4.1 start hadoop
JIAS-MacBook-Pro:hadoop-0.20.2 jia$ bin/start-all.sh
4.2 view mahout
JIAS-MacBook-Pro:mahout-distribution-0.9 jia$ bin/mahout MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, running locallyAn example program must be given as the first argument.Valid program names are: arff.vector: : Generate Vectors from an ARFF file or directory baumwelch: : Baum-Welch algorithm for unsupervised HMM training canopy: : Canopy clustering cat: : Print a file or resource as the logistic regression models would see it cleansvd: : Cleanup and verification of SVD output clusterdump: : Dump cluster output to text clusterpp: : Groups Clustering Output In Clusters cmdump: : Dump confusion matrix in HTML or text formats concatmatrices: : Concatenates 2 matrices of same cardinality into a single matrix cvb: : LDA via Collapsed Variation Bayes (0th deriv. approx) cvb0_local: : LDA via Collapsed Variation Bayes, in memory locally. evaluateFactorization: : compute RMSE and MAE of a rating matrix factorization against probes fkmeans: : Fuzzy K-means clustering hmmpredict: : Generate random sequence of observations by given HMM itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering kmeans: : K-means clustering lucene.vector: : Generate Vectors from a Lucene index lucene2seq: : Generate Text SequenceFiles from a Lucene index matrixdump: : Dump matrix in CSV format matrixmult: : Take the product of two matrices parallelALS: : ALS-WR factorization of a rating matrix qualcluster: : Runs clustering experiments and summarizes results in a CSV recommendfactorized: : Compute recommendations using the factorization of a rating matrix recommenditembased: : Compute recommendations using item-based collaborative filtering regexconverter: : Convert text files on a per line basis based on regular expressions resplit: : Splits a set of SequenceFiles into a number of equal splits rowid: : Map SequenceFile<Text,VectorWritable> to {SequenceFile<IntWritable,VectorWritable>, SequenceFile<IntWritable,Text>} rowsimilarity: : Compute the pairwise similarities of the rows of a matrix runAdaptiveLogistic: : Score new production data using a probably trained and validated AdaptivelogisticRegression model runlogistic: : Run a logistic regression model against CSV data seq2encoded: : Encoded Sparse Vector generation from Text sequence files seq2sparse: : Sparse Vector generation from Text sequence files seqdirectory: : Generate sequence files (of Text) from a directory seqdumper: : Generic Sequence File dumper seqmailarchives: : Creates SequenceFile from a directory containing gzipped mail archives seqwiki: : Wikipedia xml dump to sequence file spectralkmeans: : Spectral k-means clustering split: : Split Input data into test and train sets splitDataset: : split a rating dataset into training and probe parts ssvd: : Stochastic SVD streamingkmeans: : Streaming k-means clustering svd: : Lanczos Singular Value Decomposition testnb: : Test the Vector-based Bayes classifier trainAdaptiveLogistic: : Train an AdaptivelogisticRegression model trainlogistic: : Train a logistic regression using stochastic gradient descent trainnb: : Train the Vector-based Bayes classifier transpose: : Take the transpose of a matrix validateAdaptiveLogistic: : Validate an AdaptivelogisticRegression model against hold-out data set vecdist: : Compute the distances between a set of Vectors (or Cluster or Canopy, they must fit in memory) and a list of Vectors vectordump: : Dump vectors from a sequence file to text viterbi: : Viterbi decoding of hidden states from given output states sequence
Here we need to explain that when you see the following code, you think it is wrong, but it is not because:
Mahout_local: Set whether to run locally. If this parameter is set, hadoop will not run. Once this parameter is set, the values of hadoop_conf_dir and hadoop_home are
The setting will automatically expire.
At the beginning, I had a long struggle on this issue.
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, running locally
5. Run the mahout Algorithm
Download test data from 5.1 to the address below
Http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data
5.2 create the test directory testdata and import the data to the tastdata directory.
JIAS-MacBook-Pro:hadoop-0.20.2 jia$ bin/hadoop fs -mkdir testdata
5.3 upload the test data to HDFS. Instead of storing the test data in a document created on Mac using pages, create a new file command: Touch data
JIAS-MacBook-Pro:hadoop-0.20.2 jia$ bin/hadoop fs -put workspace/data testdata/
5.4 run the kmeans Algorithm on mahout.
JIAS-MacBook-Pro:hadoop-0.20.2 jia$ bin/hadoop jar mahout-distribution-0.9/mahout-examples-0.9-job.jar org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
5.5 view results
JIAS-MacBook-Pro:~ jia$ cd Documents/hadoop-0.20.2/JIAS-MacBook-Pro:hadoop-0.20.2 jia$ bin/hadoop fs -ls output/Found 15 items-rwxrwxrwx 1 jia staff 194 2014-08-03 14:42 /Users/jia/Documents/hadoop-0.20.2/output/_policydrwxr-xr-x - jia staff 136 2014-08-03 14:42 /Users/jia/Documents/hadoop-0.20.2/output/clusteredPointsdrwxr-xr-x - jia staff 544 2014-08-03 14:41 /Users/jia/Documents/hadoop-0.20.2/output/clusters-0drwxr-xr-x - jia staff 204 2014-08-03 14:41 /Users/jia/Documents/hadoop-0.20.2/output/clusters-1drwxr-xr-x - jia staff 204 2014-08-03 14:42 /Users/jia/Documents/hadoop-0.20.2/output/clusters-10-finaldrwxr-xr-x - jia staff 204 2014-08-03 14:41 /Users/jia/Documents/hadoop-0.20.2/output/clusters-2drwxr-xr-x - jia staff 204 2014-08-03 14:41 /Users/jia/Documents/hadoop-0.20.2/output/clusters-3drwxr-xr-x - jia staff 204 2014-08-03 14:41 /Users/jia/Documents/hadoop-0.20.2/output/clusters-4drwxr-xr-x - jia staff 204 2014-08-03 14:41 /Users/jia/Documents/hadoop-0.20.2/output/clusters-5drwxr-xr-x - jia staff 204 2014-08-03 14:41 /Users/jia/Documents/hadoop-0.20.2/output/clusters-6drwxr-xr-x - jia staff 204 2014-08-03 14:41 /Users/jia/Documents/hadoop-0.20.2/output/clusters-7drwxr-xr-x - jia staff 204 2014-08-03 14:42 /Users/jia/Documents/hadoop-0.20.2/output/clusters-8drwxr-xr-x - jia staff 204 2014-08-03 14:42 /Users/jia/Documents/hadoop-0.20.2/output/clusters-9drwxr-xr-x - jia staff 136 2014-08-03 14:41 /Users/jia/Documents/hadoop-0.20.2/output/datadrwxr-xr-x - jia staff 136 2014-08-03 14:41 /Users/jia/Documents/hadoop-0.20.2/output/random-seeds