Build Mahout projects with Maven

Source: Internet
Author: User
Tags cassandra prepare

As developers of it, we have to keep up with the rhythm, seize the opportunity, and follow Hadoop together.

About the Author: Zhang Dan (Conan), programmer Java,r,php,javascript Weibo: @Conan_Z blog:http://blog.fens.me email:bsspirit@gmail.com

Reprint please specify the source:
http://blog.fens.me/hadoop-mahout-maven-eclipse/

Objective

Hadoop-based projects, whether for mapreduce development or mahout development, are developed in a complex programming environment. Java's environmental problem is a nightmare that haunts every programmer. Java programmers, not only to write Java programs, but also to tune Linux, will be equipped with Hadoop, start Hadoop, but also to do their own operations. So it's not easy for novices to play with Hadoop.

However, we can simplify the environment as much as possible, so that programmers only focus on writing programs. In particular, like algorithmic programmers, putting energy into algorithmic design is much more valuable than taking time to solve environmental problems.

Directory Maven Introduction and installation Mahout stand-alone development environment introduction using MAVEN to build mahout development environment with Mahout to implement collaborative filtering USERCF mahout to implement Kmeans template project upload GitHub 1. Maven Introduction and Installation

Please refer to the article: building a Hadoop project with Maven

Development environment Win7 64bit Java 1.6.0_45 Maven 3 Eclipse Juno Service Release 2 Mahout 0.6

Here is a description of the running version of Mahout. mahout-0.5, mahout-0.6, mahout-0.7, is based on the hadoop-0.20.2x. mahout-0.8, mahout-0.9, is based on hadoop-1.1.x. mahout-0.7, there was a major upgrade, removed multiple algorithms of single-machine memory run, and some API is not forward-compatible.

Note: This article focuses on "Building a mahout development environment with Maven", and the 2 examples are based on a single-machine memory implementation, so choose version 0.6. Mahout running in a Hadoop cluster is described in the next article. 2. Mahout Single-machine development environment Introduction

As shown in the figure above, we can choose to develop in win, or in Linux, the development process we can debug in the local environment, the standard tools are Maven and eclipse. 3. Build the Mahout development environment with MAVEN 1. Use MAVEN to create a standardized Java Project 2. Import the project to Eclipse 3. Increase Mahout Dependency, modify Pom.xml 4. Download dependencies

1). Create a standardized Java project with Maven

~ D:\WORKSPACE\JAVA>MVN archetype:generate-darchetypegroupid=org.apache.maven.archetypes 
-DgroupId= Org.conan.mymahout-dartifactid=mymahout-dpackagename=org.conan.mymahout-dversion=1.0-snapshot-dinteractivemode =false

Enter project, execute MVN command

~ D:\WORKSPACE\JAVA>CD mymahout
~ d:\workspace\java\mymahout>mvn clean Install

2). Import Project to eclipse

We created a basic MAVEN project and then imported it into eclipse. We'd better have the Maven plugin installed right here.

3). Increase mahout dependency, modify Pom.xml

Here I use the hadoop-0.6 version, while removing the dependency on JUnit, modifying the file: Pom.xml

<project xmlns= "http://maven.apache.org/POM/4.0.0" xmlns:xsi= "Http://www.w3.org/2001/XMLSchema-instance" xsi: schemalocation= "http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd" > <modelversion >4.0.0</modelVersion> <groupId>org.conan.mymahout</groupId> <artifactid>mymahout</ artifactid> <packaging>jar</packaging> <version>1.0-SNAPSHOT</version> <name> Mymahout</name> <url>http://maven.apache.org</url> <properties> < Project.build.sourceencoding>utf-8</project.build.sourceencoding> <mahout.version>0.6</ mahout.version> </properties> <dependencies> <dependency> <groupid>org.apache.mahout </groupId> <artifactId>mahout-core</artifactId> <version>${mahout.version}</version > </dependency> <dependency> <groupId>org.apache.mahout</groupId> <artifactId> mahout-integration</artifactid> <version>${mahout.version}</version> <exclusions> <exclusion> <groupid >org.mortbay.jetty</groupId> <artifactId>jetty</artifactId> </exclusion> <exclusion > <groupId>org.apache.cassandra</groupId> <artifactId>cassandra-all</artifactId> </ exclusion> <exclusion> <groupId>me.prettyprint</groupId> <artifactid>hector-core</
 artifactid> </exclusion> </exclusions> </dependency> </dependencies> </project>

4). Download dependent

~ mvn clean Install

To refresh the project in eclipse:

The dependencies of the project are automatically loaded under the library path. 4. Implement collaborative filtering with Mahout USERCF

Mahout Collaborative filtering USERCF depth algorithm analysis, please refer to the article: using R to resolve Mahout user recommended collaborative filtering algorithm (USERCF)

Implementation steps: 1. Prepare data file: Item.csv 2. Java Program: Usercf.java 3. Run Program 4. Interpretation of recommended results

1). New Data file: Item.csv

~ mkdir datafile
~ VI datafile/item.csv

1,101,5.0
1,102,3.0
1,103,2.5
2,101,2.0
2,102,2.5
2,103,5.0
2,104,2.0
3,101,2.5
3,104,4.0
3,105,4.5
3,107,5.0
4,101,5.0
4,103,3.0
4,104,4.5
4,106,4.0
5,101,4.0
5,102,3.0
5,103,2.0
5,104,4.0
5,105,3.5
5,106,4.0

Data interpretation: There are three columns in each row, the first column is the user ID, the second column is the item ID, and the third column is the user's rating of the item.

2). Java Program: Usercf.java

Mahout Collaborative filtering of the data stream, invoking the process.

Above photo excerpt from: Mahout in Action

New Java class: Org.conan.mymahout.recommendation.UserCF.java

Package org.conan.mymahout.recommendation;
Import Java.io.File;
Import java.io.IOException;

Import java.util.List;
Import org.apache.mahout.cf.taste.common.TasteException;
Import Org.apache.mahout.cf.taste.impl.common.LongPrimitiveIterator;
Import Org.apache.mahout.cf.taste.impl.model.file.FileDataModel;
Import Org.apache.mahout.cf.taste.impl.neighborhood.NearestNUserNeighborhood;
Import Org.apache.mahout.cf.taste.impl.recommender.GenericUserBasedRecommender;
Import org.apache.mahout.cf.taste.impl.similarity.EuclideanDistanceSimilarity;
Import Org.apache.mahout.cf.taste.model.DataModel;
Import Org.apache.mahout.cf.taste.recommender.RecommendedItem;
Import Org.apache.mahout.cf.taste.recommender.Recommender;

Import org.apache.mahout.cf.taste.similarity.UserSimilarity;
    public class USERCF {final static int neighborhood_num = 2;

    Final static int recommender_num = 3; public static void Main (string[] args) throws IOException, tasteexception {String file = "DatafilE/item.csv ";
        Datamodel model = new Filedatamodel (new file);
        usersimilarity user = new euclideandistancesimilarity (model);
        Nearestnuserneighborhood neighbor = new Nearestnuserneighborhood (neighborhood_num, user, model);
        Recommender r = new Genericuserbasedrecommender (model, neighbor, user);

        Longprimitiveiterator iter = Model.getuserids ();
            while (Iter.hasnext ()) {Long uid = Iter.nextlong ();
            List List = R.recommend (uid, recommender_num);
            System.out.printf ("uid:%s", UID);
            for (Recommendeditem ritem:list) {System.out.printf ("(%s,%f)", Ritem.getitemid (), Ritem.getvalue ());
        } System.out.println ();
 }
    }
}

3). Run the program
Console output:

slf4j:failed to load Class "Org.slf4j.impl.StaticLoggerBinder".
Slf4j:defaulting to No-operation (NOP) Logger implementation
Slf4j:see http://www.slf4j.org/codes.html# Staticloggerbinder for further details.
Uid:1 (104,4.274336) (106,4.000000)
uid:2 (105,4.055916)
Uid:3 (103,3.360987) (102,2.773169)
Uid:4 ( 102,3.000000)
Uid:5

4). Recommended results interpretation to the user ID1, recommended the top two most relevant items, 104 and 106 to the user ID2, recommend the top two most relevant items, but only a 105 to the user ID3, recommend the top two most relevant items, 103 and 102 to the user ID4, recommend the top two most relevant items, But only a 102 to the user ID5, recommended the top two most relevant items, there is no matching 5. Implement Kmeans 1 with Mahout . Prepare data file: Randomdata.csv 2. Java Program: Kmeans.java 3. Run Java program 4. Mahout results Interpretation 5. Implement Kmeans algorithm 6 with R language. Compare results of Mahout and R

1). Prepare the data file: Randomdata.csv

~ VI datafile/randomdata.csv

-0.883033363823402,-3.31967192630249
-2.39312626419456,3.34726861118871
2.66976353341256,1.85144276077058
-1.09922906899594,-6.06261735207489
- 4.36361936997216,1.90509905380532
-0.00351835125495037,-0.610105996559153
-2.9962958796338,- 3.60959839525735
-3.27529418132066,0.0230099799641799
2.17665594420569,6.77290756817957
- 2.47862038335637,2.53431833167278
5.53654901906814,2.65089785582474
5.66257474538338,6.86783609641077
-0.558946883114376,1.22332819416237
5.11728525486132,3.74663871584768
1.91240516693351,2.95874731384062
- 2.49747101306535,2.05006504756875
3.98781883213459,1.00780938946366

This is only a subset of the data, please see the source code for more information.

Note: I am a randomdata.csv generated by the R language

X1<-cbind (X=rnorm (400,1,3), Y=rnorm (400,1,3))
X2<-cbind (X=rnorm (300,1,0.5), Y=rnorm (300,0,0.5))
x3 <-cbind (X=rnorm (300,0,0.1), Y=rnorm (300,2,0.2))
x<-rbind (x1,x2,x3)
write.table (x,file= " Randomdata.csv ", sep=", ", Row.names=false,col.names=false)

2). Java Program: Kmeans.java

The algorithm implementation process of Kmeans method in Mahout.

Above photo excerpt from: Mahout in Action

New Java class: Org.conan.mymahout.cluster06.Kmeans.java

Package org.conan.mymahout.cluster06;
Import java.io.IOException;
Import java.util.ArrayList;

Import java.util.List;
Import Org.apache.mahout.clustering.kmeans.Cluster;
Import Org.apache.mahout.clustering.kmeans.KMeansClusterer;
Import Org.apache.mahout.common.distance.EuclideanDistanceMeasure;

Import Org.apache.mahout.math.Vector; public class Kmeans {public static void main (string[] args) throws IOException {List sampleData = mathutil.

        Readfiletovector ("Datafile/randomdata.csv");
        int k = 3;

        Double threshold = 0.01;
        List randompoints = Mathutil.chooserandompoints (SampleData, k);
        for (vector vector:randompoints) {System.out.println ("Init Point Center:" + vector);
        } List clusters = new ArrayList (); for (int i = 0; i < K; i++) {Clusters.add (New Cluster (Randompoints.get (i), I, New euclideandistancemeasure
        ())); } list<list> finalclusters = Kmeansclusterer.clusTerpoints (SampleData, clusters, new Euclideandistancemeasure (), k, Threshold); For (Cluster Cluster:finalClusters.get (Finalclusters.size ()-1)) {System.out.println ("Cluster ID:" + clus
        Ter.getid () + "Center:" + cluster.getcenter (). asformatstring ());
 }
    }

}

3). Run the Java program
Console output:

Init Point Center: {0:-0.162693685149196,1:2.19951550286862}
init Point center: {0:-0. 0409782183083317,1:2.09376666042057}
Init Point center: {0:0.158401778474687,1:2.37208412905273}
slf4j: Failed to load Class "Org.slf4j.impl.StaticLoggerBinder".
Slf4j:defaulting to No-operation (NOP) Logger implementation
Slf4j:see http://www.slf4j.org/codes.html# Staticloggerbinder for further details.
Cluster id:0 Center: {0:-2.686856800552941,1:1.8939462954763795}
Cluster Id:1 Center: { 0:0.6334255423230666,1:0.49472852972602105}
Cluster Id:2 Center: {0:3.334520309711998,1:3.2758355898247653}

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.