Data mining applications in Hadoop-mahout--learning notes

Data mining applications in Hadoop-mahout--learning notes < three >

Last Update:2015-08-29 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I was fortunate enough to take the MOOC college Hadoop experience class at the academy.
This is the little Elephant College hadoop2. X's Notes

As the usual data mining do more, so the priority to see Mahout direction video.

Mahout has good extensibility and fault tolerance (based on hdfs&mapreduce development), which realizes most commonly used data mining algorithms (clustering, classification, recommendation algorithm) but data mining assistant and business understanding is the key, personal feel really want to learn, It's better to watch the regular machine learning course.

Most of the comparison techniques are omitted from the note here ...

Although Mahout has a natural advantage in speed. But R/python is actually accessing Hadoop, such as Rhadoop.
And just as [Don't talk about Hadoop, your data isn't big enough] (http://geek.csdn.net/news/detail/2780) Here, on lightweight data, there's still not much need to toss Hadoop, The premise of using Mahout on Hadoop should be a very large amount of data

Do not read in the blog Park will be seen, this blog post is Http://www.cnblogs.com/weibaar all

Only ensure that the layout of the blog Park blog is clean and the code blocks and pictures correctly displayed, he stood please keep the author information respecting copyright AH

I. Overview of the course

1. General Introduction
2. Clustering algorithm
3. Classification algorithm
4. Recommended algorithm

Second, clustering algorithm

Clustering scenarios such as news clustering (how many of these are related). The most common is the K-means cluster
The basic process should be to specify the number of clusters, locate the center point, calculate the average distance, and finally achieve product classification.
In the mahout.

1. Extracting feature

To the news text participle, word code, such as Doc1 what words appear, converted to 0-1 multidimensional vector

2, feature vectorization, multi-dimensional vector

Before the multidimensional vector wasted space, need to change a way of expression, mahout provide Lucene or other tools to convert these feature to vector format
In short, to achieve an orderly, space-saving feature, and finally stored in the Sequencefile format.

3. Implement clustering with Kmeans

Bin/mahout kmeans \ and adjustable corresponding format

In addition, Mahout provides distance calculations between multiple vectors org.apache.mahout.distance

So Kmeans parameter tuning, one method is to adjust the vector distance calculation method

Cannopy algorithm: Finding the optimal initial point

Commonly used in conjunction with other clustering methods
such as the cannopy algorithm can assist Kmeans to determine the initial point

is to randomly select a point, calculate the number of points at different distances, then iterate the calculation, and finally find a high-level initial point
(Kmeans default with random points, specify canopy words can find the best initial point, this improvement should also be one of the assistant)

Three, classification algorithm

belongs to the supervised machine learning algorithm, the classification has been implemented, and now what is the factor that allows us to quickly locate its data for that category?
So the application step should be to use the training set to obtain the classification model, test tuning for the online product

Call the other parameter reference classification model again

Two indicators commonly used in model evaluation: Confusion Confusion Matrix & AUC

Do not read in the blog Park will be seen, this blog post is Http://www.cnblogs.com/weibaar all

Only ensure that the layout of the blog Park blog is clean and the code blocks and pictures correctly displayed, he stood please keep the author information respecting copyright AH

Iv. Recommended Algorithms

User, to what item, how many points
Preference: Inclination, can be graded with User-item matrix

is to use the user to score other items (based on similar user ratings, focusing on finding user similarity) and other users ' rating of the item (focus on finding similar items, using the product similarity as a weight fill)

User-based recommended effect is better, the user effect is good
Item-based is less effective, but computationally efficient and suitable for real-time referral systems

Mahout comes with a taste recommendation system implementation. Java-based, collaborative filtering, a reliable and efficient recommendation engine

Data mining applications in Hadoop-mahout--learning notes < three >

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Data mining applications in Hadoop-mahout--learning notes < three >

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Data mining applications in Hadoop-mahout--learning notes < three >

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support