Dirichlet Clustering algorithm
The three clustering algorithms described above are based on partitioning, and below we briefly introduce a clustering algorithm based on probability distribution model, Dirichlet clustering (Dirichlet Processes clustering).
First, we briefly introduce the principle of clustering algorithm based on probability distribution model (hereinafter referred to as the model-based Clustering algorithm): First, we need to define a distribution model, such as: circular, triangular, etc., complex such as regular distribution, Poisson distribution, etc., then classify the data according to the model, By adding different objects to a model, the model will grow or shrink, and each of these will need to recalculate the parameters of the model and estimate the probability that the object belongs to the model. Therefore, the core of the model-based clustering algorithm is the definition of the model, for a cluster problem, the model definition of the advantages and disadvantages directly affect the results of clustering, the following is a simple example, assuming that our problem is to divide some two-dimensional points into three groups, in the diagram with different colors, figure A is a circular model clustering results, figure B is a clustering result using a triangular model. As can be seen, the circular model is a correct choice, and the result of the triangle model is both missing and false, is a wrong choice.
Figure 3 Clustering results with different models
The Dirichlet clustering algorithm implemented by Mahout is working according to the following procedure: First, we have a set of objects to be clustered and a distribution model. Use Modeldistribution to generate various models in Mahout. In the initial state, we have an empty model, and then try to add the object to the model, and then step-by-step to calculate the probability that each object belongs to each model. The following list shows the Dirichlet clustering algorithm based on memory implementation.
Listing 6. Dirichlet Clustering Algorithm Example
public static void dirichletprocessesclusterinmemory () { // Specifies the alpha parameter of the Dirichlet algorithm, which is a transition parameter that allows the object to be distributed smoothly before and after different models double alphaValue = 1.0; // Specify the number of cluster models int numModels = 3; // specify thin and burn interval parameters, which are used to reduce the amount of memory used during clustering int thinIntervals = 2; int burnintervals = 2; // Specifies the maximum number of iterations int maxiter = 3; list <vectorwritable> pointvectors = simpledataset.getpoints (SimpleDataSet.points); // generate an empty distribution model at the initial stage, which is used in NormalModelDistribution ModelDistribution< Vectorwritable> model = new normalmodeldistribution (New VectorWritable (new densevector (2)); // perform cluster DirichletClusterer dc = new Dirichletclusterer (Pointvectors, model, alphavalue, nummodels, thinintervals, burnintervals); list<cluster[]> result = dc.cluster (maxiter); // Print Clustering results for (cluster cluster : Result.get (Result.size () -1)) { system.out.println ("cluster id: " + Cluster.getid () + " center: " + cluster.getcenter (). asformatstring ()); system.out.println (" Points: " + Cluster.getnumpoints ()); } } execution Results dirichlet processes clustering in memory result cluster id: 0 center:{"class": " Org.apache.mahout.math.DenseVector ", " vector ":" {\ "values\":[5.2727272727272725,5.2727272727272725], \ "size\": 2,\ "lengthsquared\": -1.0} "} points: 11 cluster id: 1 center:{"class": "Org.aPache.mahout.math.DenseVector ", " vector ":" {\ "values\": [1.0,2.0],\ "size\": 2,\ "lengthsquared\":-1.0} "} Points: 1 Cluster id: 2 center:{"class": "Org.apache.mahout.math.DenseVector", "vector": "{\" values\ ": [9.0,8.0],\" size\ ": 2,\" lengthsquared\ ": -1.0}"} points: 0
Mahout provides a variety of probability distribution model implementations, they all inherit modeldistribution, 4, the user can choose the appropriate model according to their own data set characteristics, detailed introduction please refer to Mahout official documents.
Figure 4 The probabilistic distribution model hierarchy in Mahout
Summary of Mahout Clustering algorithm
The previous detailed introduction of Mahout provides four kinds of clustering algorithm, here to do a brief summary, analysis of the advantages and disadvantages of each algorithm, in fact, in addition to these four kinds, Mahout also provides some more complex clustering algorithm, here do not detail, detailed information, please refer to Mahout Wiki The clustering algorithm given above is described in detail.
Table 1 Summary of Mahout Clustering algorithm
algorithm |
Memory Implementation |
Map/reduce Implementation |
the number of clusters is determined |
whether clusters allow overlapping |
K mean value |
Kmeansclusterer |
Kmeansdriver |
Y |
N |
Canopy |
Canopyclusterer |
Canopydriver |
N |
N |
Fuzzy K-Mean value |
Fuzzykmeansclusterer |
Fuzzykmeansdriver |
Y |
Y |
Dirichlet |
Dirichletclusterer |
Dirichletdriver |
N |
Y |
Back to top of page
Summarize
The clustering algorithm is widely used in information intelligent processing system. Firstly, this paper introduces the idea of clustering concept and clustering algorithm, which makes the reader understand the important technology of clustering in the whole. Then from the angle of actual construction application, this paper introduces the implementation framework of the Open source software Apache Mahout, including the mathematical model, various clustering algorithms and the implementation on different infrastructures. Through the code example, the reader can know the specific data problem for him, how to quantify the data, how to choose a variety of different clustering algorithms.
The next article in this series will continue to delve into the relevant algorithms for the recommendation engine-classification. As with clustering, classification is also a classic problem of data mining, mainly used to extract models describing important data classes, and then we can predict according to this model, the recommendation is a predictive behavior. At the same time clustering and classification are often complementary to each other, they are for the high-volume data on the efficient recommendation to provide assistance. So the next article in this series will detail the classification algorithms, their principles, advantages and disadvantages, and practical scenarios, and give an efficient implementation of the classification algorithm based on Apache Mahout.
Finally, thank you for your interest and support in this series.
ResourcesLearn
Cluster Analysis: Introduction to Wikipedia on clustering analysis
Data mining: Concept and Technology (Han Jiawei): The Classic of data mining, introduces the various problems and applications in data mining, in which the classical algorithm of clustering analysis is also explained in detail.
Data mining: Practical machine learning technology: Also is the classic of data mining, the domain of the algorithm, the development of the algorithm is introduced in detail.
"Apache Mahout Profile" (Grant ingersoll,developerworks,2009 October): Mahout's founder, Grant Ingersoll, introduces the basic concepts of machine learning and demonstrates how to use Mahout to actually Present document clusters, make recommendations and organize content.
Apache Mahout:apache Mahout Project home page, search all content about Mahout.
Apache Mahout Algorithm Summary: The Apache Mahout Wiki on the implementation of the algorithm detailed introduction.
Mahout in Action:sean Owen introduces the Mahout project in detail, with a very large description of the clustering algorithm provided by Mahout, and gives some simple examples.
Tf-idf:wikipedia on the TF-IDF, including its calculation methods, advantages and disadvantages, application scenarios and so on.
Reuters data set: Reuters provides a large number of news datasets that can be used as a data source for clustering, with the Reuters "Reuters-21578" DataSet used in the Text cluster Analysis section of this article
Efficient clustering of high dimensional Data sets and application to Reference Matching, published in 2000 of the Canopy algorithm paper.
Dirichlet distribution: Introduction to Wikipedia on Dirichlet distribution, which is the basis of Dirichlet clustering algorithm introduced in this paper
Building a social recommendation engine based on Apache Mahout: A developerWorks article published in 09 on the Mahout implementation recommendation engine, detailing the installation steps of Mahout and giving an example of a simple movie recommendation engine.
Machine learning: The Wikipedia page of machine learning can help you learn more about machine learning.
DeveloperWorks Java Technology Zone: Hundreds of articles on various aspects of Java programming.
DeveloperWorks Web Development Zone: Extend your skills in web development with articles and tutorials dedicated to web technology.
DeveloperWorks Ajax Resource Center: This is a one-stop center for information about AJAX programming models, including many documents, tutorials, forums, blogs, wikis, and news. Any new Ajax information can be found here.
The DeveloperWorks Web 2.0 Resource Center, a one-stop Center for Web 2.0-related information, includes a large number of Web 2.0 technical articles, tutorials, downloads, and related technical resources. You can also quickly learn about the concepts of Web 2.0 through the Web 2.0 starter section.
Check out the HTML5 topic for more information and trends related to HTML5.
Explore the secrets of the recommended engine, part 3rd: In-depth recommendation engine-related algorithms-Clustering (iv)