Data mining applications in Hadoop-mahout--learning notes < three >

Source: Internet
Author: User

I was fortunate enough to take the MOOC college Hadoop experience class at the academy.
This is the little Elephant College hadoop2. X's Notes

As the usual data mining do more, so the priority to see Mahout direction video.

Mahout has good extensibility and fault tolerance (based on hdfs&mapreduce development), which realizes most commonly used data mining algorithms (clustering, classification, recommendation algorithm) but data mining assistant and business understanding is the key, personal feel really want to learn, It's better to watch the regular machine learning course.

Most of the comparison techniques are omitted from the note here ...

Although Mahout has a natural advantage in speed. But R/python is actually accessing Hadoop, such as Rhadoop.
And just as [Don't talk about Hadoop, your data isn't big enough] (http://geek.csdn.net/news/detail/2780) Here, on lightweight data, there's still not much need to toss Hadoop, The premise of using Mahout on Hadoop should be a very large amount of data

Do not read in the blog Park will be seen, this blog post is Http://www.cnblogs.com/weibaar all

Only ensure that the layout of the blog Park blog is clean and the code blocks and pictures correctly displayed, he stood please keep the author information respecting copyright AH

I. Overview of the course

1. General Introduction
2. Clustering algorithm
3. Classification algorithm
4. Recommended algorithm

Second, clustering algorithm

Clustering scenarios such as news clustering (how many of these are related). The most common is the K-means cluster
The basic process should be to specify the number of clusters, locate the center point, calculate the average distance, and finally achieve product classification.
In the mahout.

1. Extracting feature

To the news text participle, word code, such as Doc1 what words appear, converted to 0-1 multidimensional vector

2, feature vectorization, multi-dimensional vector

Before the multidimensional vector wasted space, need to change a way of expression, mahout provide Lucene or other tools to convert these feature to vector format
In short, to achieve an orderly, space-saving feature, and finally stored in the Sequencefile format.

3. Implement clustering with Kmeans

Bin/mahout kmeans \ and adjustable corresponding format

In addition, Mahout provides distance calculations between multiple vectors org.apache.mahout.distance

So Kmeans parameter tuning, one method is to adjust the vector distance calculation method

Cannopy algorithm: Finding the optimal initial point

Commonly used in conjunction with other clustering methods
such as the cannopy algorithm can assist Kmeans to determine the initial point

is to randomly select a point, calculate the number of points at different distances, then iterate the calculation, and finally find a high-level initial point
(Kmeans default with random points, specify canopy words can find the best initial point, this improvement should also be one of the assistant)

Three, classification algorithm

belongs to the supervised machine learning algorithm, the classification has been implemented, and now what is the factor that allows us to quickly locate its data for that category?
So the application step should be to use the training set to obtain the classification model, test tuning for the online product

Call the other parameter reference classification model again

Two indicators commonly used in model evaluation: Confusion Confusion Matrix & AUC

Do not read in the blog Park will be seen, this blog post is Http://www.cnblogs.com/weibaar all

Only ensure that the layout of the blog Park blog is clean and the code blocks and pictures correctly displayed, he stood please keep the author information respecting copyright AH

Iv. Recommended Algorithms

User, to what item, how many points
Preference: Inclination, can be graded with User-item matrix

is to use the user to score other items (based on similar user ratings, focusing on finding user similarity) and other users ' rating of the item (focus on finding similar items, using the product similarity as a weight fill)

User-based recommended effect is better, the user effect is good
Item-based is less effective, but computationally efficient and suitable for real-time referral systems

Mahout comes with a taste recommendation system implementation. Java-based, collaborative filtering, a reliable and efficient recommendation engine

Data mining applications in Hadoop-mahout--learning notes < three >

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.