I was fortunate enough to take the MOOC college Hadoop experience class at the academy.
This is the little Elephant College hadoop2. X's Notes
As the usual data mining do more, so the priority to see Mahout direction video.
Mahout has good extensibility and fault tolerance (based on hdfs&mapreduce development), which realizes most commonly used data mining algorithms (clustering, classification, recommendation algorithm) but data mining assistant and business understanding is the key, personal feel really want to learn, It's better to watch the regular machine learning course.
Most of the comparison techniques are omitted from the note here ...
Although Mahout has a natural advantage in speed. But R/python is actually accessing Hadoop, such as Rhadoop.
And just as [Don't talk about Hadoop, your data isn't big enough] (http://geek.csdn.net/news/detail/2780) Here, on lightweight data, there's still not much need to toss Hadoop, The premise of using Mahout on Hadoop should be a very large amount of data
Do not read in the blog Park will be seen, this blog post is Http://www.cnblogs.com/weibaar all
Only ensure that the layout of the blog Park blog is clean and the code blocks and pictures correctly displayed, he stood please keep the author information respecting copyright AH
I. Overview of the course
1. General Introduction
2. Clustering algorithm
3. Classification algorithm
4. Recommended algorithm
Second, clustering algorithm
Clustering scenarios such as news clustering (how many of these are related). The most common is the K-means cluster
The basic process should be to specify the number of clusters, locate the center point, calculate the average distance, and finally achieve product classification.
In the mahout.
1. Extracting feature
To the news text participle, word code, such as Doc1 what words appear, converted to 0-1 multidimensional vector
2, feature vectorization, multi-dimensional vector
Before the multidimensional vector wasted space, need to change a way of expression, mahout provide Lucene or other tools to convert these feature to vector format
In short, to achieve an orderly, space-saving feature, and finally stored in the Sequencefile format.
3. Implement clustering with Kmeans
Bin/mahout kmeans \ and adjustable corresponding format
In addition, Mahout provides distance calculations between multiple vectors org.apache.mahout.distance
So Kmeans parameter tuning, one method is to adjust the vector distance calculation method
Cannopy algorithm: Finding the optimal initial point
Commonly used in conjunction with other clustering methods
such as the cannopy algorithm can assist Kmeans to determine the initial point
is to randomly select a point, calculate the number of points at different distances, then iterate the calculation, and finally find a high-level initial point
(Kmeans default with random points, specify canopy words can find the best initial point, this improvement should also be one of the assistant)
Three, classification algorithm
belongs to the supervised machine learning algorithm, the classification has been implemented, and now what is the factor that allows us to quickly locate its data for that category?
So the application step should be to use the training set to obtain the classification model, test tuning for the online product
Call the other parameter reference classification model again
Two indicators commonly used in model evaluation: Confusion Confusion Matrix & AUC
Do not read in the blog Park will be seen, this blog post is Http://www.cnblogs.com/weibaar all
Only ensure that the layout of the blog Park blog is clean and the code blocks and pictures correctly displayed, he stood please keep the author information respecting copyright AH
Iv. Recommended Algorithms
User, to what item, how many points
Preference: Inclination, can be graded with User-item matrix
is to use the user to score other items (based on similar user ratings, focusing on finding user similarity) and other users ' rating of the item (focus on finding similar items, using the product similarity as a weight fill)
User-based recommended effect is better, the user effect is good
Item-based is less effective, but computationally efficient and suitable for real-time referral systems
Mahout comes with a taste recommendation system implementation. Java-based, collaborative filtering, a reliable and efficient recommendation engine
Data mining applications in Hadoop-mahout--learning notes < three >