Preliminary understanding of Mahout

Source: Internet
Author: User
The Apache Mahout Project consists of the following five parts:
Frequent pattern mining: mining frequently occurring itemsets in the data.
Clustering: Divides data, such as text, documents, into locally related groups.
Classification: Classification of unclassified documents by using the existing classification document training classifier.
Recommendation engine (Collaborative filtering): Get the user's behavior and discover the transactions that the user might like.
Frequent child mining: Use an item set (query record or shopping directory) to identify items that often appear together.

Machine learning algorithms implemented in Mahout:

Algorithm classes

Algorithm name

Chinese name

Classification algorithm

Logistic Regression

Logistic regression

Bayesian

Bayesian

Svm

Support Vector Machine

Perceptron

Perceptron algorithm

Neural Network

Neural network

Random forests

Random Forest

Restricted Boltzmann Machines

Finite-Boltzmann machine

Clustering algorithm

Canopy Clustering

Canopy Clustering

K-means Clustering

K-mean-value algorithm

Fuzzy K-means

Fuzzy K-Mean value

Expectation maximization

EM clustering (expected maximum clustering)

Mean Shift Clustering

Mean Drift Clustering

Hierarchical clustering

Hierarchical clustering

Dirichlet Process Clustering

Dirichlet process Clustering

Latent Dirichlet Allocation

LDA Clustering

Spectral clustering

Spectral clustering

Mining Association Rules

Parallel FP Growth algorithm

Parallel FP growth algorithm

Regression

Locally Weighted Linear Regression

Local weighted linear regression

dimensionality Reduction/Vieux-

Singular Value decomposition

Mystery value decomposition

Principal Components Analysis

Principal component Analysis

Independent Component Analysis

Independent component Analysis

Gaussian discriminative Analysis

Gaussian discriminant analysis

Evolutionary algorithms

Parallelization of the Watchmaker framework

Recommended/Collaborative filtering

Non-distributed recommenders

Taste (USERCF, ITEMCF, Slopeone)

Distributed recommenders

Itemcf

Calculation of vector similarity

Rowsimilarityjob

Calculate the similarity between columns

Vectordistancejob

Calculate distance between vectors

Non-map-reduce algorithm

Hidden Markov Models

Hidden Markov model

Collection method Extension

Collections

Extends Java's collections class


The mahout can operate in local mode, and can also take advantage of the Mr Running jobs in Hadoop.

The Mahout API is divided into the following sections:

Org.apache.mahout.cf.taste: Taste-related APIs based on collaborative filtering.
Org.apache.mahout.clustering: Clustering algorithm-related APIs
Org.apache.mahout.classifier: Classification algorithm
ORG.APACHE.MAHOUT.FPM: Frequent pattern algorithms
Org.apache.mahout.math: Mathematical computation-related algorithms
Org.apache.mahout.vectorizer: Vector computation-related algorithms


1.KMeansConfigKeys interface

Distance measurement method used by Distance_measure_key:k-means clustering algorithm
Convergence value of Cluster_convergence_key:k-means clustering algorithm
Path of Cluster_path_key:k-means Clustering algorithm

2.KCLUSTER class
is usually called by the main function, and the new cluster is calculated by a given new cluster center and distance function.
and determine whether clustering is convergent.

List of main functions of class Kcluster

Kcluster (vertor center,int clusterid,distancemeasure mesure)
: Initializes the construction method of the K-means clustering algorithm, using the input points as the center of the cluster
To create a new cluster. The parameter measure is used to compare the distance between points, center
For the new cluster Center, Clusterid is the ID of the new cluster

public static String Formatcluster (kcluster cluster)
: Formatted output

Public boolean computeconvergence (distancemeasure measure,
Double convergencedelta)
: Calculate convergence for this cluster




3.KMeansDriver class
The class is an entry function that performs clustering, including functions such as Buildclusters, Clusterdata,
Run, and Main,


Function list:
public static void Run (Org.apache.hadoop.conf.Configuration conf,
Org.apache.hadoop.fs.Path input, Org.apache.hadoop.fs.Path Clusterin,
Org.apache.hadoop.fs.Path output,distancemeasure measure,double Convergencedelta,
int maxiterations,boolean runclustering,double Clusterclassificationthreshold,
Boolean Runclustering,double Clusterclassificationthreshold,boolean runsequential)
throws IOException, Interruptedexception,classnotfoundexception the meaning of the
parameter:
conf, enter the directory path name of the point
Input, initialize the computed input point with the path name
Clustersin, Initialize and compute the path of the cluster
output, the path name of the export cluster point
measure, the class name of the distance measurement
Convergencedelta, convergence value
MaxIterations, maximum iterations
Runclustering, whether to continue clustering
Clusterclassificationthreshold after iteration is complete, points below which will not have parameter clustering
Runsequential, whether the sequential algorithm is executed

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.