As developers of it, we have to keep up with the rhythm, seize the opportunity, and follow Hadoop together.
About the Author: Zhang Dan (Conan), programmer Java,r,php,javascript Weibo: @Conan_Z blog:http://blog.fens.me email:bsspirit@gmail.com
Reprint please specify the source:
http://blog.fens.me/hadoop-mahout-maven-eclipse/
Objective
Hadoop-based projects, whether for mapreduce development or mahout development, are developed in a complex programming environment. Java's environmental problem is a nightmare that haunts every programmer. Java programmers, not only to write Java programs, but also to tune Linux, will be equipped with Hadoop, start Hadoop, but also to do their own operations. So it's not easy for novices to play with Hadoop.
However, we can simplify the environment as much as possible, so that programmers only focus on writing programs. In particular, like algorithmic programmers, putting energy into algorithmic design is much more valuable than taking time to solve environmental problems.
Directory Maven Introduction and installation Mahout stand-alone development environment introduction using MAVEN to build mahout development environment with Mahout to implement collaborative filtering USERCF mahout to implement Kmeans template project upload GitHub 1. Maven Introduction and Installation
Please refer to the article: building a Hadoop project with Maven
Development environment Win7 64bit Java 1.6.0_45 Maven 3 Eclipse Juno Service Release 2 Mahout 0.6
Here is a description of the running version of Mahout. mahout-0.5, mahout-0.6, mahout-0.7, is based on the hadoop-0.20.2x. mahout-0.8, mahout-0.9, is based on hadoop-1.1.x. mahout-0.7, there was a major upgrade, removed multiple algorithms of single-machine memory run, and some API is not forward-compatible.
Note: This article focuses on "Building a mahout development environment with Maven", and the 2 examples are based on a single-machine memory implementation, so choose version 0.6. Mahout running in a Hadoop cluster is described in the next article. 2. Mahout Single-machine development environment Introduction
As shown in the figure above, we can choose to develop in win, or in Linux, the development process we can debug in the local environment, the standard tools are Maven and eclipse. 3. Build the Mahout development environment with MAVEN 1. Use MAVEN to create a standardized Java Project 2. Import the project to Eclipse 3. Increase Mahout Dependency, modify Pom.xml 4. Download dependencies
1). Create a standardized Java project with Maven
~ D:\WORKSPACE\JAVA>MVN archetype:generate-darchetypegroupid=org.apache.maven.archetypes
-DgroupId= Org.conan.mymahout-dartifactid=mymahout-dpackagename=org.conan.mymahout-dversion=1.0-snapshot-dinteractivemode =false
Enter project, execute MVN command
~ D:\WORKSPACE\JAVA>CD mymahout
~ d:\workspace\java\mymahout>mvn clean Install
2). Import Project to eclipse
We created a basic MAVEN project and then imported it into eclipse. We'd better have the Maven plugin installed right here.
3). Increase mahout dependency, modify Pom.xml
Here I use the hadoop-0.6 version, while removing the dependency on JUnit, modifying the file: Pom.xml
<project xmlns= "http://maven.apache.org/POM/4.0.0" xmlns:xsi= "Http://www.w3.org/2001/XMLSchema-instance" xsi: schemalocation= "http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd" > <modelversion >4.0.0</modelVersion> <groupId>org.conan.mymahout</groupId> <artifactid>mymahout</ artifactid> <packaging>jar</packaging> <version>1.0-SNAPSHOT</version> <name> Mymahout</name> <url>http://maven.apache.org</url> <properties> < Project.build.sourceencoding>utf-8</project.build.sourceencoding> <mahout.version>0.6</ mahout.version> </properties> <dependencies> <dependency> <groupid>org.apache.mahout </groupId> <artifactId>mahout-core</artifactId> <version>${mahout.version}</version > </dependency> <dependency> <groupId>org.apache.mahout</groupId> <artifactId> mahout-integration</artifactid> <version>${mahout.version}</version> <exclusions> <exclusion> <groupid >org.mortbay.jetty</groupId> <artifactId>jetty</artifactId> </exclusion> <exclusion > <groupId>org.apache.cassandra</groupId> <artifactId>cassandra-all</artifactId> </ exclusion> <exclusion> <groupId>me.prettyprint</groupId> <artifactid>hector-core</
artifactid> </exclusion> </exclusions> </dependency> </dependencies> </project>
4). Download dependent
~ mvn clean Install
To refresh the project in eclipse:
The dependencies of the project are automatically loaded under the library path. 4. Implement collaborative filtering with Mahout USERCF
Mahout Collaborative filtering USERCF depth algorithm analysis, please refer to the article: using R to resolve Mahout user recommended collaborative filtering algorithm (USERCF)
Implementation steps: 1. Prepare data file: Item.csv 2. Java Program: Usercf.java 3. Run Program 4. Interpretation of recommended results
1). New Data file: Item.csv
~ mkdir datafile
~ VI datafile/item.csv
1,101,5.0
1,102,3.0
1,103,2.5
2,101,2.0
2,102,2.5
2,103,5.0
2,104,2.0
3,101,2.5
3,104,4.0
3,105,4.5
3,107,5.0
4,101,5.0
4,103,3.0
4,104,4.5
4,106,4.0
5,101,4.0
5,102,3.0
5,103,2.0
5,104,4.0
5,105,3.5
5,106,4.0
Data interpretation: There are three columns in each row, the first column is the user ID, the second column is the item ID, and the third column is the user's rating of the item.
2). Java Program: Usercf.java
Mahout Collaborative filtering of the data stream, invoking the process.
Above photo excerpt from: Mahout in Action
New Java class: Org.conan.mymahout.recommendation.UserCF.java
Package org.conan.mymahout.recommendation;
Import Java.io.File;
Import java.io.IOException;
Import java.util.List;
Import org.apache.mahout.cf.taste.common.TasteException;
Import Org.apache.mahout.cf.taste.impl.common.LongPrimitiveIterator;
Import Org.apache.mahout.cf.taste.impl.model.file.FileDataModel;
Import Org.apache.mahout.cf.taste.impl.neighborhood.NearestNUserNeighborhood;
Import Org.apache.mahout.cf.taste.impl.recommender.GenericUserBasedRecommender;
Import org.apache.mahout.cf.taste.impl.similarity.EuclideanDistanceSimilarity;
Import Org.apache.mahout.cf.taste.model.DataModel;
Import Org.apache.mahout.cf.taste.recommender.RecommendedItem;
Import Org.apache.mahout.cf.taste.recommender.Recommender;
Import org.apache.mahout.cf.taste.similarity.UserSimilarity;
public class USERCF {final static int neighborhood_num = 2;
Final static int recommender_num = 3; public static void Main (string[] args) throws IOException, tasteexception {String file = "DatafilE/item.csv ";
Datamodel model = new Filedatamodel (new file);
usersimilarity user = new euclideandistancesimilarity (model);
Nearestnuserneighborhood neighbor = new Nearestnuserneighborhood (neighborhood_num, user, model);
Recommender r = new Genericuserbasedrecommender (model, neighbor, user);
Longprimitiveiterator iter = Model.getuserids ();
while (Iter.hasnext ()) {Long uid = Iter.nextlong ();
List List = R.recommend (uid, recommender_num);
System.out.printf ("uid:%s", UID);
for (Recommendeditem ritem:list) {System.out.printf ("(%s,%f)", Ritem.getitemid (), Ritem.getvalue ());
} System.out.println ();
}
}
}
3). Run the program
Console output:
slf4j:failed to load Class "Org.slf4j.impl.StaticLoggerBinder".
Slf4j:defaulting to No-operation (NOP) Logger implementation
Slf4j:see http://www.slf4j.org/codes.html# Staticloggerbinder for further details.
Uid:1 (104,4.274336) (106,4.000000)
uid:2 (105,4.055916)
Uid:3 (103,3.360987) (102,2.773169)
Uid:4 ( 102,3.000000)
Uid:5
4). Recommended results interpretation to the user ID1, recommended the top two most relevant items, 104 and 106 to the user ID2, recommend the top two most relevant items, but only a 105 to the user ID3, recommend the top two most relevant items, 103 and 102 to the user ID4, recommend the top two most relevant items, But only a 102 to the user ID5, recommended the top two most relevant items, there is no matching 5. Implement Kmeans 1 with Mahout . Prepare data file: Randomdata.csv 2. Java Program: Kmeans.java 3. Run Java program 4. Mahout results Interpretation 5. Implement Kmeans algorithm 6 with R language. Compare results of Mahout and R
1). Prepare the data file: Randomdata.csv
~ VI datafile/randomdata.csv
-0.883033363823402,-3.31967192630249
-2.39312626419456,3.34726861118871
2.66976353341256,1.85144276077058
-1.09922906899594,-6.06261735207489
- 4.36361936997216,1.90509905380532
-0.00351835125495037,-0.610105996559153
-2.9962958796338,- 3.60959839525735
-3.27529418132066,0.0230099799641799
2.17665594420569,6.77290756817957
- 2.47862038335637,2.53431833167278
5.53654901906814,2.65089785582474
5.66257474538338,6.86783609641077
-0.558946883114376,1.22332819416237
5.11728525486132,3.74663871584768
1.91240516693351,2.95874731384062
- 2.49747101306535,2.05006504756875
3.98781883213459,1.00780938946366
This is only a subset of the data, please see the source code for more information.
Note: I am a randomdata.csv generated by the R language
X1<-cbind (X=rnorm (400,1,3), Y=rnorm (400,1,3))
X2<-cbind (X=rnorm (300,1,0.5), Y=rnorm (300,0,0.5))
x3 <-cbind (X=rnorm (300,0,0.1), Y=rnorm (300,2,0.2))
x<-rbind (x1,x2,x3)
write.table (x,file= " Randomdata.csv ", sep=", ", Row.names=false,col.names=false)
2). Java Program: Kmeans.java
The algorithm implementation process of Kmeans method in Mahout.
Above photo excerpt from: Mahout in Action
New Java class: Org.conan.mymahout.cluster06.Kmeans.java
Package org.conan.mymahout.cluster06;
Import java.io.IOException;
Import java.util.ArrayList;
Import java.util.List;
Import Org.apache.mahout.clustering.kmeans.Cluster;
Import Org.apache.mahout.clustering.kmeans.KMeansClusterer;
Import Org.apache.mahout.common.distance.EuclideanDistanceMeasure;
Import Org.apache.mahout.math.Vector; public class Kmeans {public static void main (string[] args) throws IOException {List sampleData = mathutil.
Readfiletovector ("Datafile/randomdata.csv");
int k = 3;
Double threshold = 0.01;
List randompoints = Mathutil.chooserandompoints (SampleData, k);
for (vector vector:randompoints) {System.out.println ("Init Point Center:" + vector);
} List clusters = new ArrayList (); for (int i = 0; i < K; i++) {Clusters.add (New Cluster (Randompoints.get (i), I, New euclideandistancemeasure
())); } list<list> finalclusters = Kmeansclusterer.clusTerpoints (SampleData, clusters, new Euclideandistancemeasure (), k, Threshold); For (Cluster Cluster:finalClusters.get (Finalclusters.size ()-1)) {System.out.println ("Cluster ID:" + clus
Ter.getid () + "Center:" + cluster.getcenter (). asformatstring ());
}
}
}
3). Run the Java program
Console output:
Init Point Center: {0:-0.162693685149196,1:2.19951550286862}
init Point center: {0:-0. 0409782183083317,1:2.09376666042057}
Init Point center: {0:0.158401778474687,1:2.37208412905273}
slf4j: Failed to load Class "Org.slf4j.impl.StaticLoggerBinder".
Slf4j:defaulting to No-operation (NOP) Logger implementation
Slf4j:see http://www.slf4j.org/codes.html# Staticloggerbinder for further details.
Cluster id:0 Center: {0:-2.686856800552941,1:1.8939462954763795}
Cluster Id:1 Center: { 0:0.6334255423230666,1:0.49472852972602105}
Cluster Id:2 Center: {0:3.334520309711998,1:3.2758355898247653}