The Apache Mahout Project consists of the following five parts:
Frequent pattern mining: mining frequently occurring itemsets in the data.
Clustering: Divides data, such as text, documents, into locally related groups.
Classification: Classification of unclassified documents by using the existing classification document training classifier.
Recommendation engine (Collaborative filtering): Get the user's behavior and discover the transactions that the user might like.
Frequent child mining: Use an item set (query record or shopping directory) to identify items that often appear together.
Machine learning algorithms implemented in Mahout:
Algorithm classes |
Algorithm name |
Chinese name |
Classification algorithm |
Logistic Regression |
Logistic regression |
Bayesian |
Bayesian |
Svm |
Support Vector Machine |
Perceptron |
Perceptron algorithm |
Neural Network |
Neural network |
Random forests |
Random Forest |
Restricted Boltzmann Machines |
Finite-Boltzmann machine |
Clustering algorithm |
Canopy Clustering |
Canopy Clustering |
K-means Clustering |
K-mean-value algorithm |
Fuzzy K-means |
Fuzzy K-Mean value |
Expectation maximization |
EM clustering (expected maximum clustering) |
Mean Shift Clustering |
Mean Drift Clustering |
Hierarchical clustering |
Hierarchical clustering |
Dirichlet Process Clustering |
Dirichlet process Clustering |
Latent Dirichlet Allocation |
LDA Clustering |
Spectral clustering |
Spectral clustering |
Mining Association Rules |
Parallel FP Growth algorithm |
Parallel FP growth algorithm |
Regression |
Locally Weighted Linear Regression |
Local weighted linear regression |
dimensionality Reduction/Vieux- |
Singular Value decomposition |
Mystery value decomposition |
Principal Components Analysis |
Principal component Analysis |
Independent Component Analysis |
Independent component Analysis |
Gaussian discriminative Analysis |
Gaussian discriminant analysis |
Evolutionary algorithms |
Parallelization of the Watchmaker framework |
|
Recommended/Collaborative filtering |
Non-distributed recommenders |
Taste (USERCF, ITEMCF, Slopeone) |
Distributed recommenders |
Itemcf |
Calculation of vector similarity |
Rowsimilarityjob |
Calculate the similarity between columns |
Vectordistancejob |
Calculate distance between vectors |
Non-map-reduce algorithm |
Hidden Markov Models |
Hidden Markov model |
Collection method Extension |
Collections |
Extends Java's collections class
|
The mahout can operate in local mode, and can also take advantage of the Mr Running jobs in Hadoop.
The Mahout API is divided into the following sections:
Org.apache.mahout.cf.taste: Taste-related APIs based on collaborative filtering.
Org.apache.mahout.clustering: Clustering algorithm-related APIs
Org.apache.mahout.classifier: Classification algorithm
ORG.APACHE.MAHOUT.FPM: Frequent pattern algorithms
Org.apache.mahout.math: Mathematical computation-related algorithms
Org.apache.mahout.vectorizer: Vector computation-related algorithms
1.KMeansConfigKeys interface
Distance measurement method used by Distance_measure_key:k-means clustering algorithm
Convergence value of Cluster_convergence_key:k-means clustering algorithm
Path of Cluster_path_key:k-means Clustering algorithm
2.KCLUSTER class
is usually called by the main function, and the new cluster is calculated by a given new cluster center and distance function.
and determine whether clustering is convergent.
List of main functions of class Kcluster
Kcluster (vertor center,int clusterid,distancemeasure mesure)
: Initializes the construction method of the K-means clustering algorithm, using the input points as the center of the cluster
To create a new cluster. The parameter measure is used to compare the distance between points, center
For the new cluster Center, Clusterid is the ID of the new cluster
public static String Formatcluster (kcluster cluster)
: Formatted output
Public boolean computeconvergence (distancemeasure measure,
Double convergencedelta)
: Calculate convergence for this cluster
3.KMeansDriver class
The class is an entry function that performs clustering, including functions such as Buildclusters, Clusterdata,
Run, and Main,
Function list:
public static void Run (Org.apache.hadoop.conf.Configuration conf,
Org.apache.hadoop.fs.Path input, Org.apache.hadoop.fs.Path Clusterin,
Org.apache.hadoop.fs.Path output,distancemeasure measure,double Convergencedelta,
int maxiterations,boolean runclustering,double Clusterclassificationthreshold,
Boolean Runclustering,double Clusterclassificationthreshold,boolean runsequential)
throws IOException, Interruptedexception,classnotfoundexception the meaning of the
parameter:
conf, enter the directory path name of the point
Input, initialize the computed input point with the path name
Clustersin, Initialize and compute the path of the cluster
output, the path name of the export cluster point
measure, the class name of the distance measurement
Convergencedelta, convergence value
MaxIterations, maximum iterations
Runclustering, whether to continue clustering
Clusterclassificationthreshold after iteration is complete, points below which will not have parameter clustering
Runsequential, whether the sequential algorithm is executed