Mahout running test and Kmeans algorithm analysis

Source: Internet
Author: User

Install and start the Hadoop cluster before you use Mahout

Upload the Mahout package to Linux and unzip it.

Mahout

Click to open link


The algorithms in Mahout can be broadly divided into three broad categories:

Clustering, collaborative filtering and classification

which

Common clustering algorithms are: canopy clustering, K-mean Algorithm (Kmeans), fuzzy K-mean, hierarchical clustering, LDA clustering, etc.

Common classification algorithms are: Bayesian, logistic regression, support vector machine, perceptron, neural network, etc.


The following will run the example example jar package from the mahout to see if the Mahou works correctly

Practice data:

Click to open link

The above practice data is used to detect the Kmeans clustering algorithm data

Example program to run mahout using Hadoop command (make sure Hadoop cluster is turned on)

In the example code, the path to the input is/user/hadoop/testdata.

Upload the exercise data to the corresponding TestData directory in HDFs

Write dead output path is/user/hadoop/output

Execute command:

Hadoop jar ~/mahout/mahout-examples-0.9-job.jar Org.apache.mahout.clustering.syntheticcontrol.kmeans.Job

Start to perform a task


Because the clustering algorithm is an iterative process (explained later)

What he wants to do is repeat the MR Task to meet the requirements (the process may be a bit long ...). )

The results of the operation are as follows:


Mahout no exception

After executing the Kmeans algorithm, the resulting file is not viewable in the normal way, only to see a bunch of inexplicable data

Need to download to local Linux using Mahout's seqdumper command to see normal results

To view the results of a clustering analysis:

./mahout Seqdumper-s/user/hadoop/output/data/part-m-0000/home/hadoop/res

Then use the Cat command to view

Cat Res | More


Now, what's the Kmeans clustering algorithm?

The so-called clustering algorithm is a data, according to the data we want or the law of the Data classification algorithm

For example:

There is a messy sample data, we hope that the final data according to certain categories (red bean is divided into red beans, mung beans are divided into green beans, etc.)

The clustering algorithm starts at the initial center of the n classes (if there is no human setting, it starts at a random initial center)

What do you mean? Take a look at a picture


, the left-hand circle represents the distribution of the original data after a random initial center partition

But it's obvious that many of the cluster1 are near Cluster2 data points.

So Kmeans will calculate a more appropriate center point for division according to the rules.

This rule is:

Calculate the distance from each data point to the original center cluster1 and Cluster2

Who is closer to the other side (shaped like a circle in the middle)

The data in Cluster1 and Cluster2 are then averaged, and the resulting two averages become the new Cluster1 and Cluster2 center points.

But it's clear that this is not a reasonable division.

So Kmeans will continue to iterate to calculate the distance from each data to the new center point.

Who's closer to who?

And then we get the new center point by averaging the values separately.

Until the average of data in Cluster1 and Cluster2 is not changed, it is considered to be the ideal partitioning method (or manual intervention).


The best advantage of this algorithm is the quick introduction. The key of the algorithm lies in the selection of the initial center and the formula of calculating distance


Finally, a mahout algorithm is called to test the Mahout

Call the FPG algorithm (the algorithm that implements the count frequent itemsets)

Test data Download (E-commerce shopping cart data)

Click to open link

In the bin directory of the Mahout

./mahout fpg-i/user/hadoop/testdata/tail.txt-o/user/hadoop/output-method mapreduce-s 1000-regex ' [] '

The meaning of each parameter:

-I: Specify the path of the input data

-O: Specifies the path of the output result

-method: Specifying the use of the MapReduce method

-S: Minimum support level

-regex: Match filter data with specified regular


In the same way, the data for running the results will be viewed through seqdumper

Mahout running test and Kmeans algorithm analysis

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.