Because we need to use data mining in cloud computing, we simply look at the mahout configuration. mahout is a machine learning platform based on MAP/reduce.AlgorithmLibrary, run on the hadoop Cluster
The configuration process is as follows:
1. Download mahout-distribution-0.4.tar.gz to mahout. This is a compiled package. If the source package is used, install Maven to compile it.
2. hadoop has been set up earlier. I will not talk about it here. Set the environment variables below, sudo VI/etc/profile (for Ubuntu environment variables, refer)
Export hadoop_home =/home/Guang/desktop/tools/hadoop-0.20.2
Export hadoop_conf_dir =/home/Guang/desktop/tools/hadoop-0.20.2/Conf
Export mahout_home =/home/Guang/desktop/tools/mahout-distribution-0.4
Export Path = $ hadoop_home/bin: $ mahout_home/bin: $ path
3. Start hadoop and use pseudo-distributed for testing.
4. mahout -- help # Check whether mahout is properly installed and whether some algorithms are listed.
5. Download the dataset synthetic_control.data. Download it
Here.
6. Create the test directory testdata and import the data to this tastdata directory (the directory name here can only be testdata, because mahout will automatically go to HDFS to find this directory)
$ Hadoop_home/bin/hadoop FS-mkdir testdata
$ Hadoop_home/bin/hadoop FS-Put/home/test/synthetic_control.data testdata
7. Run the kmeans Algorithm
Hadoop jar mahout-examples-0.4-job.jar org. Apache. mahout. Clustering. syntheticcontrol. kmeans. Job
It will take several minutes to run and be patient.
8. view the running result. Run the following commands in sequence:
$ Hadoop_home/bin/hadoop FS-LSR output
$ Hadoop_home/bin/hadoop FS-Get output $ mahout_home/result
$ CD mahout_home/examples/result
$ Ls
If the following result is displayed, the algorithm runs successfully and your installation is successful:
Clusteredpoints clusters-0 clusters-1 clusters-2 ...... clusters-10 Data
The final result on the internet is eight clusters folders. By default, K values are not specified (K classes can be aggregated) for clustering. datasets are the same. Why are the results different, I don't know what the problem is? I didn't see any errors during the running process. I can clearly see the mapreduce process for each iteration. It is estimated that the version is incorrect. mahout uses version 0.3 and I use version 0.4. If you know the correct reason, please leave a message to us!
2011-8-1 answer: Calculate the clustering center of this iteration based on the original data point and the clustering center of the previous iteration (or the initial clustering) and output it to the clusters-N directory. mahout_in_action also writes in this book, the clusters-N directory is the directory generated by each iteration. After reading the source code, the number of iterations is set to 10.
Next, let's take a look at the kmeans algorithm, how to combine mapreduce clustering, and then look at the recommendation engine based on mahout.
References:
1. http://wenku.baidu.com/view/dbd15bd276a20029bd642d55.html mahout installation text
2. http://blog.csdn.net/chjshan55/article/details/5923646
3. https://cwiki.apache.org/confluence/display/MAHOUT/Clustering+of+synthetic+control+data
4. http://bbs.hadoopor.com/thread-983-1-1.html