Explore the secrets of the recommended engine, part 3rd: In-depth recommendation engine-related algorithms-Clustering (iv)

Source: Internet
Author: User
Tags idf

Dirichlet Clustering algorithm

The three clustering algorithms described above are based on partitioning, and below we briefly introduce a clustering algorithm based on probability distribution model, Dirichlet clustering (Dirichlet Processes clustering).

First, we briefly introduce the principle of clustering algorithm based on probability distribution model (hereinafter referred to as the model-based Clustering algorithm): First, we need to define a distribution model, such as: circular, triangular, etc., complex such as regular distribution, Poisson distribution, etc., then classify the data according to the model, By adding different objects to a model, the model will grow or shrink, and each of these will need to recalculate the parameters of the model and estimate the probability that the object belongs to the model. Therefore, the core of the model-based clustering algorithm is the definition of the model, for a cluster problem, the model definition of the advantages and disadvantages directly affect the results of clustering, the following is a simple example, assuming that our problem is to divide some two-dimensional points into three groups, in the diagram with different colors, figure A is a circular model clustering results, figure B is a clustering result using a triangular model. As can be seen, the circular model is a correct choice, and the result of the triangle model is both missing and false, is a wrong choice.

Figure 3 Clustering results with different models

The Dirichlet clustering algorithm implemented by Mahout is working according to the following procedure: First, we have a set of objects to be clustered and a distribution model. Use Modeldistribution to generate various models in Mahout. In the initial state, we have an empty model, and then try to add the object to the model, and then step-by-step to calculate the probability that each object belongs to each model. The following list shows the Dirichlet clustering algorithm based on memory implementation.

Listing 6. Dirichlet Clustering Algorithm Example
 public static void dirichletprocessesclusterinmemory ()  {  //  Specifies the  alpha  parameter of the Dirichlet algorithm, which is a transition parameter that allows the object to be distributed smoothly before and after different models  double alphaValue = 1.0;   //  Specify the number of cluster models  int numModels = 3;  //  specify  thin  and   burn  interval parameters, which are used to reduce the amount of memory used during clustering  int thinIntervals = 2;  int  burnintervals = 2;  //  Specifies the maximum number of iterations  int maxiter = 3;  list <vectorwritable> pointvectors =  simpledataset.getpoints (SimpleDataSet.points);   //  generate an empty distribution model at the initial stage, which is used in  NormalModelDistribution  ModelDistribution< Vectorwritable> model =  new normalmodeldistribution (New VectorWritable (new  densevector (2));  //  perform cluster  DirichletClusterer dc = new  Dirichletclusterer (Pointvectors, model, alphavalue,  nummodels, thinintervals, burnintervals);   list<cluster[]> result  = dc.cluster (maxiter);  //  Print Clustering results  for (cluster cluster :  Result.get (Result.size ()  -1)) {  system.out.println ("cluster id: "  +  Cluster.getid ()  +  " center: "  +  cluster.getcenter (). asformatstring ());   system.out.println ("       Points: "  +  Cluster.getnumpoints ());  }  }  execution Results  dirichlet processes clustering  in memory result  cluster id: 0  center:{"class": " Org.apache.mahout.math.DenseVector ", " vector ":" {\ "values\":[5.2727272727272725,5.2727272727272725],  \ "size\": 2,\ "lengthsquared\": -1.0} "}        points: 11   cluster id: 1  center:{"class": "Org.aPache.mahout.math.DenseVector ", " vector ":" {\ "values\": [1.0,2.0],\ "size\": 2,\ "lengthsquared\":-1.0} "}         Points: 1  Cluster id: 2   center:{"class": "Org.apache.mahout.math.DenseVector",  "vector": "{\" values\ ": [9.0,8.0],\" size\ ": 2,\" lengthsquared\ ": -1.0}"}        points: 0

Mahout provides a variety of probability distribution model implementations, they all inherit modeldistribution, 4, the user can choose the appropriate model according to their own data set characteristics, detailed introduction please refer to Mahout official documents.

Figure 4 The probabilistic distribution model hierarchy in Mahout

Summary of Mahout Clustering algorithm

The previous detailed introduction of Mahout provides four kinds of clustering algorithm, here to do a brief summary, analysis of the advantages and disadvantages of each algorithm, in fact, in addition to these four kinds, Mahout also provides some more complex clustering algorithm, here do not detail, detailed information, please refer to Mahout Wiki The clustering algorithm given above is described in detail.

Table 1 Summary of Mahout Clustering algorithm
algorithm Memory Implementation Map/reduce Implementation the number of clusters is determined whether clusters allow overlapping
K mean value Kmeansclusterer Kmeansdriver Y N
Canopy Canopyclusterer Canopydriver N N
Fuzzy K-Mean value Fuzzykmeansclusterer Fuzzykmeansdriver Y Y
Dirichlet Dirichletclusterer Dirichletdriver N Y

Back to top of page

Summarize

The clustering algorithm is widely used in information intelligent processing system. Firstly, this paper introduces the idea of clustering concept and clustering algorithm, which makes the reader understand the important technology of clustering in the whole. Then from the angle of actual construction application, this paper introduces the implementation framework of the Open source software Apache Mahout, including the mathematical model, various clustering algorithms and the implementation on different infrastructures. Through the code example, the reader can know the specific data problem for him, how to quantify the data, how to choose a variety of different clustering algorithms.

The next article in this series will continue to delve into the relevant algorithms for the recommendation engine-classification. As with clustering, classification is also a classic problem of data mining, mainly used to extract models describing important data classes, and then we can predict according to this model, the recommendation is a predictive behavior. At the same time clustering and classification are often complementary to each other, they are for the high-volume data on the efficient recommendation to provide assistance. So the next article in this series will detail the classification algorithms, their principles, advantages and disadvantages, and practical scenarios, and give an efficient implementation of the classification algorithm based on Apache Mahout.

Finally, thank you for your interest and support in this series.

ResourcesLearn
  • Cluster Analysis: Introduction to Wikipedia on clustering analysis

  • Data mining: Concept and Technology (Han Jiawei): The Classic of data mining, introduces the various problems and applications in data mining, in which the classical algorithm of clustering analysis is also explained in detail.

  • Data mining: Practical machine learning technology: Also is the classic of data mining, the domain of the algorithm, the development of the algorithm is introduced in detail.

  • "Apache Mahout Profile" (Grant ingersoll,developerworks,2009 October): Mahout's founder, Grant Ingersoll, introduces the basic concepts of machine learning and demonstrates how to use Mahout to actually Present document clusters, make recommendations and organize content.

  • Apache Mahout:apache Mahout Project home page, search all content about Mahout.

  • Apache Mahout Algorithm Summary: The Apache Mahout Wiki on the implementation of the algorithm detailed introduction.

  • Mahout in Action:sean Owen introduces the Mahout project in detail, with a very large description of the clustering algorithm provided by Mahout, and gives some simple examples.

  • Tf-idf:wikipedia on the TF-IDF, including its calculation methods, advantages and disadvantages, application scenarios and so on.

  • Reuters data set: Reuters provides a large number of news datasets that can be used as a data source for clustering, with the Reuters "Reuters-21578" DataSet used in the Text cluster Analysis section of this article

  • Efficient clustering of high dimensional Data sets and application to Reference Matching, published in 2000 of the Canopy algorithm paper.

  • Dirichlet distribution: Introduction to Wikipedia on Dirichlet distribution, which is the basis of Dirichlet clustering algorithm introduced in this paper

  • Building a social recommendation engine based on Apache Mahout: A developerWorks article published in 09 on the Mahout implementation recommendation engine, detailing the installation steps of Mahout and giving an example of a simple movie recommendation engine.

  • Machine learning: The Wikipedia page of machine learning can help you learn more about machine learning.

  • DeveloperWorks Java Technology Zone: Hundreds of articles on various aspects of Java programming.

  • DeveloperWorks Web Development Zone: Extend your skills in web development with articles and tutorials dedicated to web technology.

  • DeveloperWorks Ajax Resource Center: This is a one-stop center for information about AJAX programming models, including many documents, tutorials, forums, blogs, wikis, and news. Any new Ajax information can be found here.

  • The DeveloperWorks Web 2.0 Resource Center, a one-stop Center for Web 2.0-related information, includes a large number of Web 2.0 technical articles, tutorials, downloads, and related technical resources. You can also quickly learn about the concepts of Web 2.0 through the Web 2.0 starter section.

  • Check out the HTML5 topic for more information and trends related to HTML5.

Explore the secrets of the recommended engine, part 3rd: In-depth recommendation engine-related algorithms-Clustering (iv)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.