Mahout getting started

Last Update:2018-12-03 Source: Internet

Author: User

Tags hadoop fs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Because we need to use data mining in cloud computing, we simply look at the mahout configuration. mahout is a machine learning platform based on MAP/reduce.AlgorithmLibrary, run on the hadoop Cluster

The configuration process is as follows:

1. Download mahout-distribution-0.4.tar.gz to mahout. This is a compiled package. If the source package is used, install Maven to compile it.

2. hadoop has been set up earlier. I will not talk about it here. Set the environment variables below, sudo VI/etc/profile (for Ubuntu environment variables, refer)

Export hadoop_home =/home/Guang/desktop/tools/hadoop-0.20.2
Export hadoop_conf_dir =/home/Guang/desktop/tools/hadoop-0.20.2/Conf
Export mahout_home =/home/Guang/desktop/tools/mahout-distribution-0.4
Export Path = $ hadoop_home/bin: $ mahout_home/bin: $ path
3. Start hadoop and use pseudo-distributed for testing.

4. mahout -- help # Check whether mahout is properly installed and whether some algorithms are listed.

5. Download the dataset synthetic_control.data. Download it
Here.
6. Create the test directory testdata and import the data to this tastdata directory (the directory name here can only be testdata, because mahout will automatically go to HDFS to find this directory)

$ Hadoop_home/bin/hadoop FS-mkdir testdata
$ Hadoop_home/bin/hadoop FS-Put/home/test/synthetic_control.data testdata

7. Run the kmeans Algorithm

Hadoop jar mahout-examples-0.4-job.jar org. Apache. mahout. Clustering. syntheticcontrol. kmeans. Job

It will take several minutes to run and be patient.
8. view the running result. Run the following commands in sequence:

$ Hadoop_home/bin/hadoop FS-LSR output

$ Hadoop_home/bin/hadoop FS-Get output $ mahout_home/result

$ CD mahout_home/examples/result

$ Ls

If the following result is displayed, the algorithm runs successfully and your installation is successful:

Clusteredpoints clusters-0 clusters-1 clusters-2 ...... clusters-10 Data

The final result on the internet is eight clusters folders. By default, K values are not specified (K classes can be aggregated) for clustering. datasets are the same. Why are the results different, I don't know what the problem is? I didn't see any errors during the running process. I can clearly see the mapreduce process for each iteration. It is estimated that the version is incorrect. mahout uses version 0.3 and I use version 0.4. If you know the correct reason, please leave a message to us!

2011-8-1 answer: Calculate the clustering center of this iteration based on the original data point and the clustering center of the previous iteration (or the initial clustering) and output it to the clusters-N directory. mahout_in_action also writes in this book, the clusters-N directory is the directory generated by each iteration. After reading the source code, the number of iterations is set to 10.

Next, let's take a look at the kmeans algorithm, how to combine mapreduce clustering, and then look at the recommendation engine based on mahout.

References:

1. http://wenku.baidu.com/view/dbd15bd276a20029bd642d55.html mahout installation text

2. http://blog.csdn.net/chjshan55/article/details/5923646

3. https://cwiki.apache.org/confluence/display/MAHOUT/Clustering+of+synthetic+control+data

4. http://bbs.hadoopor.com/thread-983-1-1.html

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More