Smart applications that can learn from data and user input will become more common when research institutes and companies have access to a dedicated budget. The need for machine learning techniques, such as clustering, collaborative filtering, and classification, has grown ever more, whether it's finding the commonality of a large group of people or automatically tagging mass Web content. The Apache Mahout project is designed to help developers create smart applications more easily and quickly. Mahout's founder Grant Ingersoll introduces the basic concepts of machine learning and demonstrates how to use Mahout to implement a document cluster, propose recommendations, and organize content.
In the information age, the success of companies and individuals is increasingly dependent on the rapid and efficient conversion of large amounts of data into operational information. Whether you're dealing with thousands of personal e-mail messages every day, or speculating about the intentions of users from a massive blog post, you'll need to use tools to organize and enhance your data. Machine learning is a branch of artificial intelligence that involves the use of techniques to allow computers to improve their output based on previous experience. This area is closely related to data mining and often requires the use of a variety of techniques, including statistics, probability theory, and pattern recognition. Although machine learning is not an emerging field, its development speed is beyond doubt. Many large companies, including IBM®, Google, Amazon, Yahoo! and Facebook, have implemented machine learning algorithms in their applications. In addition, many companies have applied machine learning in their own applications to learn from users and past experiences, thus earning benefits.
After briefly outlining the concept of machine learning, I will describe the characteristics, history, and objectives of the Apache Mahout project. Then I'll show you how to do some interesting machine learning tasks using Mahout, which requires a free Wikipedia dataset.
Machine Learning 101
Machine learning can be used for a variety of purposes, from gaming, fraud detection to stock market analysis. It is used to build systems similar to those offered by Netflix and Amazon, to recommend products to users based on their purchase history, or to build systems that can find all similar articles for a specific period of time. It can also be used to automatically categorize Web pages based on categories (sports, economics, warfare, etc.), or to mark junk e-mail messages. This article does not fully list all the applications of machine learning.
Some machine learning methods can be used to solve the problem. I will focus on two of the most commonly used-regulatory and unregulated learning-because they are the main features of Mahout support.
The task of supervising learning is to learn the function of tagged training data in order to predict any valid input value. Common examples of regulatory learning include classifying e-mail messages as spam, tagging pages by category, and recognizing handwriting. Creating a regulatory Learning program requires a number of algorithms, most commonly including neural networks, Support Vector machines (SVMs), and Naive Bayes categorizer.
The task of unregulated learning is to play the meaning of data, regardless of whether the data is correct or not. It is most commonly used to integrate similar inputs into logical groupings. It can also be used to reduce dimension data in a dataset so that only focus on the most useful attributes, or use it to detect trends. Common methods of unregulated learning include k-means, layered clusters, and self-organizing maps.
In this article, I will focus on the three specific machine learning tasks that Mahout currently implements. They are exactly the three areas that are quite common in real-world applications:
Collaboration filtering
Cluster
Classification
Before studying their implementation in Mahout, I will discuss these tasks in more depth from the conceptual level.