Mahout getting started

Source: Internet
Author: User
Tags hadoop fs

Because we need to use data mining in cloud computing, we simply look at the mahout configuration. mahout is a machine learning platform based on MAP/reduce.AlgorithmLibrary, run on the hadoop Cluster

The configuration process is as follows:

1. Download mahout-distribution-0.4.tar.gz to mahout. This is a compiled package. If the source package is used, install Maven to compile it.

2. hadoop has been set up earlier. I will not talk about it here. Set the environment variables below, sudo VI/etc/profile (for Ubuntu environment variables, refer)

Export hadoop_home =/home/Guang/desktop/tools/hadoop-0.20.2
Export hadoop_conf_dir =/home/Guang/desktop/tools/hadoop-0.20.2/Conf
Export mahout_home =/home/Guang/desktop/tools/mahout-distribution-0.4
Export Path = $ hadoop_home/bin: $ mahout_home/bin: $ path
3. Start hadoop and use pseudo-distributed for testing.

4. mahout -- help # Check whether mahout is properly installed and whether some algorithms are listed.

5. Download the dataset synthetic_control.data. Download it
Here.
6. Create the test directory testdata and import the data to this tastdata directory (the directory name here can only be testdata, because mahout will automatically go to HDFS to find this directory)

$ Hadoop_home/bin/hadoop FS-mkdir testdata
$ Hadoop_home/bin/hadoop FS-Put/home/test/synthetic_control.data testdata

7. Run the kmeans Algorithm

Hadoop jar mahout-examples-0.4-job.jar org. Apache. mahout. Clustering. syntheticcontrol. kmeans. Job

It will take several minutes to run and be patient.
8. view the running result. Run the following commands in sequence:

$ Hadoop_home/bin/hadoop FS-LSR output

$ Hadoop_home/bin/hadoop FS-Get output $ mahout_home/result

$ CD mahout_home/examples/result

$ Ls

If the following result is displayed, the algorithm runs successfully and your installation is successful:

Clusteredpoints clusters-0 clusters-1 clusters-2 ...... clusters-10 Data

The final result on the internet is eight clusters folders. By default, K values are not specified (K classes can be aggregated) for clustering. datasets are the same. Why are the results different, I don't know what the problem is? I didn't see any errors during the running process. I can clearly see the mapreduce process for each iteration. It is estimated that the version is incorrect. mahout uses version 0.3 and I use version 0.4. If you know the correct reason, please leave a message to us!

2011-8-1 answer: Calculate the clustering center of this iteration based on the original data point and the clustering center of the previous iteration (or the initial clustering) and output it to the clusters-N directory. mahout_in_action also writes in this book, the clusters-N directory is the directory generated by each iteration. After reading the source code, the number of iterations is set to 10.

Next, let's take a look at the kmeans algorithm, how to combine mapreduce clustering, and then look at the recommendation engine based on mahout.

References:

1. http://wenku.baidu.com/view/dbd15bd276a20029bd642d55.html mahout installation text

2. http://blog.csdn.net/chjshan55/article/details/5923646

3. https://cwiki.apache.org/confluence/display/MAHOUT/Clustering+of+synthetic+control+data

4. http://bbs.hadoopor.com/thread-983-1-1.html

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.