Mahout Learning Road Map-Zhang Dan Teacher

Source: Internet
Author: User

Objective

Mahout is a unique member of the Hadoop family and is based on a distributed computing framework for machine learning and data mining in Hadoop. Mahout is an interdisciplinary product and one of the most competitive, hard-to-learn, and most rewarding projects I think the Hadoop family has to offer.

Mahout is a data analyst who solves the threshold of big data, provides a base algorithm library for algorithmic engineers, provides data modeling standards for Hadoop developers, and connects with Hadoop for operations personnel.

Mahout is a training elephant, creating new wisdom on Hadoop!

Directory

    1. Mahout Introduction
    2. Mahout Learning Road Map
    3. My Learning Experience
    4. Use cases of mahout
1. Mahout Introduction

Mahout is a distributed framework for machine learning and data mining based on Hadoop. Mahout implements some data mining algorithms with MapReduce, and solves the problem of parallel mining.

According to the "Mahout in Action" book, Mahout implements 3 large-class algorithms, recommended (recommendation), Clustering (clustering), classification (classification).

The learning roadmap described below will be shown in the "Mahout in Action" book.

2. Mahout Learning route Map

Mahout knowledge points, I have been listed in the picture, I hope to help others better understand the mahout.

Next, is my study experience, who has no shortcut. It's not so hard to put your heart down.

3. My Study Experience

Before, probably spent half a year of time, specifically studied mahout, at that time mahout very little information, Chinese information is only a few. Until the "Mahout in Action" unsanitary environment was found, it began to read repeatedly. Don't worry about what to do first, read it over and over. until after reading 3 times, the mentality has a little certainty .

Starting with the "recommended" algorithm, USERCF, ITEMCF. Remember the first time in the company to the group, also designed a questionnaire, I listed 10 sites, (of which 6 it stations, 2 individuals blog,2 Social community), respectively, let everyone go to vote, 0-5 points, 0 for not know, 1-5 for the site favorite program.

Questionnaire result Format:

User1, WebSite1, 5
User1, Website2, 2
User1, Website3, 4
User2, Website3, 2
User3, Website3, 5
User4, Website3, 0
.....

Use this questionnaire to simulate the recommended model for trying mahout! The result of the calculation is quite strange to everyone. Why is there such a recommendation? Then, in- depth mahout source code , look at the implementation of the algorithm, know the similarity matrix, distance algorithm, recommendation algorithm, model verification , different business requirements, different algorithm calls, the results are affected. Put all the concepts in the book, the key words have been collated (unfortunately did not write a blog). It took 3 months, 12 hours of strength a day, to complete the recommendation.

Then, apply to the actual business. My task is to do " job recommendation ", I only users to browse the position, the collection of posts, apply for the position of the behavior data .

The first attempt was to apply the recommended model directly, but the results were very poor.
The problem occurs because there are 2 points:

    • 1. The position is timeliness, and each position may expire in 3 months: The recommended results include many overdue positions.
    • 2. A lot of user behavior is historical, even 2-3 years ago: the recommendation results do not meet the expectations of users. I estimate that there may be a rise in the positions of users every six months, so historical behavior is not directly available for the current user's calculations.

To modify a scenario:
1. Filter the user behavior data set to calculate user behavior for the last six months only.
2. Filter the result set to exclude out-of-date positions.
3. Calculate separately with different algorithmic models (I remember Tanimoto's item base results best)

There has been a significant increase in the recommended results. This is the end of the story! Although I have done more things, but this product due to the company's structural adjustment, and ultimately not on-line. (Programmer's Sorrow!) )

Clustering model, I apply this algorithm to site users ' activity analysis. Suppose a website, registered user 1000W, landing 1W per day. We would like to know what is the characteristics of 999W users without landing!! Use the Mahout K-means and canopy to do clustering, assuming that 1000W users can be divided into 5 large groups. Finally we got a result and shared it with the team. That's the end of the story. (Realization is so sad!) )

Classification model, I tried to use native Bayes to classify my personal messages as spam. According to the process of machine learning, historical data health segmentation, training classifier, daily data through the classifier to judge. The entire automation process has been completed. The story is over again! (Accept the reality.) )

In fact, there are some, I strive to get all sorted out.

Mahout is a certain threshold of learning, and requires interdisciplinary knowledge. Just stick to the study, there is no crossing the gap! Optimistic efforts!

4. Use cases of Mahout

Cases that have been organized into articles

      • Using R to parse Mahout user recommended collaborative filtering algorithm (USERCF)
      • Rhadoop Practice series of three R implementation of MapReduce collaborative filtering algorithm
      • Build Mahout projects with Maven
      • Mahout Recommended Algorithm API
      • Profiling Mahout recommendation engine from source code
      • Mahout development of collaborative filtering ITEMCF based on item-by-step program
      • Mahout-Step program development of clustering Kmeans
      • Building a job recommendation engine with Mahout

Mahout Learning Road Map-Zhang Dan Teacher

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.