Hadoop (13), hadoop
1. mahout introduction:
Mahout is a powerful data mining tool and a collection of distributed machine learning algorithms, including the implementation, classification, and clustering of distributed collaborative filtering called Taste. The biggest advantage of Mahout is its hadoop-based implementation, which converts many previous algorithms running on a single machine into the MapReduce mode, which greatly improves the data size and processing performance that the algorithm can process.
The machine learning algorithm implemented in mahout is as follows:
Algorithm |
Algorithm name |
Chinese name |
Classification Algorithm |
Logistic Regression |
Logistic Regression |
Bayesian |
Bayes |
SVM |
SVM |
Perceptron |
Sensor Algorithm |
Neural Network |
Neural Network |
Random Forests |
Random Forest |
Restricted Boltzmann Machines |
Limited Polman Machine |
Clustering Algorithm |
Canopy Clustering |
Canopy Clustering |
K-means Clustering |
K-means algorithm |
Fuzzy K-means |
Fuzzy K-means |
Expectation Maximization |
EM clustering (expectation maximization clustering) |
Mean Shift Clustering |
Mean Shift Clustering |
Hierarchical Clustering |
Hierarchical Clustering |
Dirichlet Process Clustering |
Dirichlet process Clustering |
Latent Dirichlet Allocation |
LDA Clustering |
Spectral Clustering |
Spectral clustering |
Association Rule Mining |
Parallel FP Growth Algorithm |
Parallel FP Growth algorithm |
Regression |
Locally Weighted Linear Regression |
Local Weighted Linear Regression |
Dimension Reduction/Dimension Reduction |
Singular Value Decomposition |
Singular Value Decomposition |
Principal Components Analysis |
Principal Component Analysis |
Independent Component Analysis |
Independent Component Analysis |
Gaussian Discriminative Analysis |
Gaussian Discriminant Analysis |
Evolutionary Algorithms |
Concurrency of the Watchmaker framework |
|
Recommendation/Collaborative Filtering |
Non-distributed recommenders |
Taste (UserCF, ItemCF, SlopeOne) |
Distributed Recommenders |
ItemCF |
Vector similarity calculation |
RowSimilarityJob |
Calculate similarity between columns |
VectorDistanceJob |
Calculate the distance between vectors |
Non-Map-Reduce Algorithm |
Hidden Markov Models |
Hidden Markov Model |
Set Method Extension |
Collections |
Added java Collections classes. |
Ii. Mahout installation and configuration
1. Download Mahouthttp: // archive.apache.org/dist/mahout/
2. Decompress tar-zxvf mahout-distribution-0.9.tar.gz
3. configure environment variable 3.1, configure the Mahout environment variable # set mahout environmentexport MAHOUT_HOME =/home/yujianxin/mahout/mahout-distribution-0.9export environment = $ MAHOUT_HOME/confexport PATH = $ MAHOUT_HOME/conf: $ MAHOUT_HOME/bin: $ PATH
3.2 configure the Hadoop environment variable required for Mahout # set hadoop environmentexport HADOOP_HOME =/home/yujianxin/hadoop/hadoop-1.1.2
Export HADOOP_CONF_DIR = $ HADOOP_HOME/conf
Export PATH = $ PATH: $ HADOOP_HOME/binexport HADOOP_HOME_WARN_SUPPRESS = not_null
4. Verify that Mahout is successfully installed: run the mahout command. If some algorithms are listed, the operation is successful.
Iii. Entry-level use of Mahout
1. Start Hadoop
2. Download synthetic_control.data from the test data http://archive.ics.uci.edu/ml/databases/synthetic_control/ Link
3. Upload test data hadoop fs-put synthetic_control.data/user/root/testdata
4. Use the kmeans clustering Algorithm in Mahout to execute the command: mahout-core org. apache. mahout. clustering. syntheticcontrol. kmeans. Job
5. view the cluster result: Run hadoop fs-ls/user/root/output.