Mahout learning Roadmap

Last Update:2015-03-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Preface

Mahout is a distinctive Member of the hadoop family and a distributed computing framework based on hadoop machine learning and data mining. Mahout is an interdisciplinary product and one of the projects that I think are the most competitive, difficult to master, and worth learning among the hadoop family.

Mahout solves the big data threshold for data analysts, provides basic algorithm libraries for algorithm engineers, provides data modeling standards for hadoop developers, and provides O & M personnel, connected to hadoop.

Mahout is a trainee and creates new wisdom on hadoop!

Directory

Mahout Introduction
Mahout learning Roadmap
My Learning Experience
Use Cases of mahout

1. mahout Introduction

Mahout is a distributed framework of hadoop-based machine learning and data mining. Mahout uses mapreduce to implement some data mining algorithms and solves the problem of parallel mining.

According to the introduction in the "mahout in action" book, mahout implements three types of algorithms: Recommendation, clustering, and classification ).

The Learning roadmap described below will be shown in the "mahout in action" book.

2. mahout learning Roadmap

I have already listed the mahout knowledge points in the figure and hope to help others better understand mahout.

The next step is my learning experience. No one has a shortcut. It's not that difficult to put your mind down.

3. My Learning Experience

Previously, it took about half a year to study mahout. At that time, there were very few mahout documents and only a few Chinese documents. Until you find "mahout in action", you can read it repeatedly. Do not worry about what to do first, read it over and over again. It was not until I had read it three times that I had a certain mental grasp.

Starting from the "recommendation" algorithm, usercf and itemcf. I remember the first time I spoke to a group, I also designed a questionnaire. I listed 10 websites (6 of which are it websites and 2 blogs, 2 Social Communities). Let everyone vote for the website. 0-5 points, 0 points are unknown, and 1-5 points are website-loving programs.

Questionnaire result format:

User1, website1, 5
User1, website2, 2
User1, website3, 4
User2, website3, 2
User3, website3, 5
User4, website3, 0
.....

Use this questionnaire to simulate the recommendation model of mahout! The calculation results are strange to everyone. Why is there such a recommendation. Then, go deep into the mahout source code, look at the implementation of the algorithm, know the similarity matrix, distance algorithm, recommendation algorithm, model verification, and so on, different business requirements, different algorithm calls, results are affected. I sorted out all the concepts and keywords in the book (unfortunately I didn't write a blog at the time ). It took three months and 12 hours a day to complete the recommendation.

Then, it is applied to the actual business. My job is to do "job recommendations". I only have users who browse jobs, add favorite jobs, and apply for job behavior data.

The recommendation model is applied directly in the first attempt, but the results are very poor.
There are two reasons for the problem:

1. The position is time-sensitive and may expire in three months. The recommendation result contains many expired positions.
2. A large number of user behaviors are historical, or even two or three years ago: the recommendation results do not meet user expectations. I estimate that the positions of users may increase every six months. Therefore, historical behaviors cannot be directly used for calculation of current users.

Solution:
1. filter user behavior datasets and only calculate user behavior in the last six months.
2. filter the result set to exclude expired positions.
3. calculate with different algorithm models (I remember tanimoto's item base has the best results)

The recommendation results have been greatly improved. This is the end of the story! Although I have done more things, this product was not launched due to the company's structural adjustment. (Programmer's sorrow !)

Clustering Model. I applied this algorithm to the activity analysis of website users. Assume that the number of registered users on a website is 1000, and the number of registered users is. We would like to know what are the characteristics of Unlogged million users !! K-means and canopy of mahout are used for clustering. Assume that million users may be divided into 5 large groups. Finally, we got a result and shared it with the team. This is the end of the story. (Implementation is so sad !)

Classification model, I tried to use native Bayes to classify my personal emails as spam. Training a classifier based on the operating process of machine learning and the key word segmentation of historical data. The data generated every day is judged by the classifier. The entire automation process has been completed. The story is over again! (Accept the reality .)

In fact, there are some other issues that I want to work out.

Mahout has a certain learning threshold and requires interdisciplinary knowledge. As long as you keep learning, there is no gap between them! Optimistic effort!

4. Use Cases of mahout

Cases that have been compiled into articles

Use R to parse the mahout user recommendation collaborative filtering algorithm (usercf)
Rhadoop Practice Series 3 R implement the collaborative filtering algorithm of mapreduce
Build a mahout project using Maven
Mahout recommendation algorithm API details
Analysis of mahout recommendation engine from source code
Mahout step-by-step program development item-based collaborative filtering itemcf
Mahout, step-by-step program development, clustering, kmeans
Use mahout to build a job recommendation engine

Mahout learning Roadmap

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Mahout learning Roadmap

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Mahout learning Roadmap

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support