Mahout learning Roadmap

Source: Internet
Author: User

Preface

Mahout is a distinctive Member of the hadoop family and a distributed computing framework based on hadoop machine learning and data mining. Mahout is an interdisciplinary product and one of the projects that I think are the most competitive, difficult to master, and worth learning among the hadoop family.

Mahout solves the big data threshold for data analysts, provides basic algorithm libraries for algorithm engineers, provides data modeling standards for hadoop developers, and provides O & M personnel, connected to hadoop.

Mahout is a trainee and creates new wisdom on hadoop!

Directory

  1. Mahout Introduction
  2. Mahout learning Roadmap
  3. My Learning Experience
  4. Use Cases of mahout
1. mahout Introduction

Mahout is a distributed framework of hadoop-based machine learning and data mining. Mahout uses mapreduce to implement some data mining algorithms and solves the problem of parallel mining.

According to the introduction in the "mahout in action" book, mahout implements three types of algorithms: Recommendation, clustering, and classification ).

The Learning roadmap described below will be shown in the "mahout in action" book.

2. mahout learning Roadmap

I have already listed the mahout knowledge points in the figure and hope to help others better understand mahout.

The next step is my learning experience. No one has a shortcut. It's not that difficult to put your mind down.

3. My Learning Experience

Previously, it took about half a year to study mahout. At that time, there were very few mahout documents and only a few Chinese documents. Until you find "mahout in action", you can read it repeatedly. Do not worry about what to do first, read it over and over again. It was not until I had read it three times that I had a certain mental grasp.

Starting from the "recommendation" algorithm, usercf and itemcf. I remember the first time I spoke to a group, I also designed a questionnaire. I listed 10 websites (6 of which are it websites and 2 blogs, 2 Social Communities). Let everyone vote for the website. 0-5 points, 0 points are unknown, and 1-5 points are website-loving programs.

Questionnaire result format:

User1, website1, 5
User1, website2, 2
User1, website3, 4
User2, website3, 2
User3, website3, 5
User4, website3, 0
.....

Use this questionnaire to simulate the recommendation model of mahout! The calculation results are strange to everyone. Why is there such a recommendation. Then, go deep into the mahout source code, look at the implementation of the algorithm, know the similarity matrix, distance algorithm, recommendation algorithm, model verification, and so on, different business requirements, different algorithm calls, results are affected. I sorted out all the concepts and keywords in the book (unfortunately I didn't write a blog at the time ). It took three months and 12 hours a day to complete the recommendation.

Then, it is applied to the actual business. My job is to do "job recommendations". I only have users who browse jobs, add favorite jobs, and apply for job behavior data.

The recommendation model is applied directly in the first attempt, but the results are very poor.
There are two reasons for the problem:

  • 1. The position is time-sensitive and may expire in three months. The recommendation result contains many expired positions.
  • 2. A large number of user behaviors are historical, or even two or three years ago: the recommendation results do not meet user expectations. I estimate that the positions of users may increase every six months. Therefore, historical behaviors cannot be directly used for calculation of current users.

Solution:
1. filter user behavior datasets and only calculate user behavior in the last six months.
2. filter the result set to exclude expired positions.
3. calculate with different algorithm models (I remember tanimoto's item base has the best results)

The recommendation results have been greatly improved. This is the end of the story! Although I have done more things, this product was not launched due to the company's structural adjustment. (Programmer's sorrow !)

Clustering Model. I applied this algorithm to the activity analysis of website users. Assume that the number of registered users on a website is 1000, and the number of registered users is. We would like to know what are the characteristics of Unlogged million users !! K-means and canopy of mahout are used for clustering. Assume that million users may be divided into 5 large groups. Finally, we got a result and shared it with the team. This is the end of the story. (Implementation is so sad !)

Classification model, I tried to use native Bayes to classify my personal emails as spam. Training a classifier based on the operating process of machine learning and the key word segmentation of historical data. The data generated every day is judged by the classifier. The entire automation process has been completed. The story is over again! (Accept the reality .)

In fact, there are some other issues that I want to work out.

Mahout has a certain learning threshold and requires interdisciplinary knowledge. As long as you keep learning, there is no gap between them! Optimistic effort!

4. Use Cases of mahout

Cases that have been compiled into articles

    • Use R to parse the mahout user recommendation collaborative filtering algorithm (usercf)
    • Rhadoop Practice Series 3 R implement the collaborative filtering algorithm of mapreduce
    • Build a mahout project using Maven
    • Mahout recommendation algorithm API details
    • Analysis of mahout recommendation engine from source code
    • Mahout step-by-step program development item-based collaborative filtering itemcf
    • Mahout, step-by-step program development, clustering, kmeans
    • Use mahout to build a job recommendation engine

Mahout learning Roadmap

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.