Mahout 0.3: open-source machine learning project

Source: Internet
Author: User
Tags svm

Http://www.cvchina.info/2010/05/04/mahout-0-3/

Apache mahout, an open-source project for machine learning, launched version March in May 0.3. This new version adds some new features based on the previous version, which is more stable than the previous version, performance has also been improved accordingly. Infoq interviewed Apache
Developers of the mahout Project Grant Ingersoll and Ted Dunning, where Grant
Ingresoll is also one of the founders of the project.

In the past decade, the demand for parsing relevant information from large raw data has increased dramatically, resulting in clustering and collaborative filtering.
Requirements for machine learning technologies such as filtering and Categorization
It is also a steady growth trend.

Grant Ingersoll introduces the mahout project as follows:

  • Clustering of documents in the context of known methods helps to focus on specific clustering and content, so as to avoid wasting effort on irrelevant content.
  • The recommendation algorithm (collaborative filtering recommendation algorithm-collaborative filtering) is often used to recommend books, music, movies, and other content to users. It can also be used in multi-user Collaboration applications to streamline the data that needs to be followed.
  • Pattern Matching (Naive Bayes classifier-naive ve Bayes classifier and other classification algorithms) can be used to classify documents that have not been seen before. When a new document is classified, the algorithm searches for the words involved in the document in the pattern, calculates the probability that the document belongs to each pattern, and finally the document is in the pattern with the highest probability, the input results are usually numerical values indicating the accuracy of the results.
  • The mahout project achieved scalability through the support of Apache hadoop.

Another focus of mahout is that it provides a series of tools to Represent Text data in a matrix form. This is also the primary task of using the mahout machine learning algorithm to process data.

The mahout project is initiated by several technicians in the Apache Lucene (open source search project) community who are keen on clustering, classification, and other machine learning algorithms. The initial development of the community was "followed" by the paper published by ng et al.
Map-
Reduce framework (Map-reduce for machine learning on multicore) ", the Community has been committed
Development of machine learning algorithms and models.

Highlights of the latest Apache mahout version include:

  • New functions: Math and collections modules based on high-performance Colt Library
  • Faster frequency with FP-bonsai pruning
    Frequent Pattern growtt Algorithm
  • Parallel Computing Dirichlet clustering algorithm (Model-Based Clustering Algorithm)
  • Recommendation engine for Parallel Computing Based on co-occurrence Algorithms
  • Using the Ngram Generation Algorithm Based on LLR to concurrently process the conversion from text documents to Vectors
  • Parallel Lanczos SVD (Singular Value Decomposition) computing
  • Provides script programs that run algorithms, tools, and examples.

When asked about the most exciting features in this version, Ingersoll replied:

The newly added Singular Value Decomposition calculation is very promising. In addition, there are many tools that allow users to import content to mahout. Among them, the most exciting thing is not tangible, but the growth of the mahout community. The Community has attracted a number of objective contributors and users. During the development process of any open-source project, the initial stage is often miserable, and there are usually only one or two people doing their work. Once one of them leaves, it may even slow down the development speed, the entire project may crash. But I believe mahout has passed this test, and now many very active community members are trying to turn it into a truly exciting project.

Future plans of the mahout project include:

  • Version 1.0 is released this year.
  • Stable APIs are released from Version 1.0.
  • Implement online learning functions such as stochastic gradient descent-SGD
  • Support vector machine-SVM algorithm implementation

The implementation of SGD and SVM will be suitable for document mining and other applications related to text or duplicate classification data. In particular, it is expected that the SGD system will introduce online interactive variable creation.
Capability.

View Original English text:Mahout 0.3: Open Source Machine Learning

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.