Mahout algorithm invoke Presentation Platform 2.1

Source: Internet
Author: User

Software version number:

WINDOWS7:TOMCAT7, JDK7, Spring4.0.2, Struts2.3, Hibernate4.3, myeclipse10.0, Easyui;linux (centos6.5): Hadoop2.4, Mahout1.0, JDK7;

Use WebProject to invoke Mahout's related algorithms to provide monitoring. View the execution status of a task.

Build your own Web project. The project homepage is as follows:


1. PreparationThe project can be http://download.csdn.net/detail/fansy1990/7600427 (Part one), http://download.csdn.net/detail/fansy1990/ 7600463 (Part II), http://download.csdn.net/detail/fansy1990/7600489 (Part III) download.

Hadoop uses the 2.4 version number provided on the website and can be downloaded directly. Then configure (configuration is not mentioned here), start each service. With JPS, you can see the following services:

[Email protected] data]# jps6033 NodeManager5543 NameNode5629 DataNode5942 ResourceManager41611 Jps5800 SecondaryNameNode6412 Jobhistoryserver
The 1.1 Hadoop package enables you to create a new Java project using Eclipse and then import the Hadoop package. Test can be enough to connect the cluster, the imported package such as the following: Change the red box inside the Mapred-default.xml, yarn-default.xml such as the following configuration (Node33 is a pseudo-distributed Hadoop cluster machine name) : Mapred-default.xml:
<property>  <name>mapreduce.jobhistory.address</name>  <value>node33:10020</ value>  <description>mapreduce jobhistory Server IPC host:port</description></property>
Yarn-default.xml:
<name>yarn.application.classpath</name>    <value> $HADOOP _conf_dir,      $HADOOP _common_home /share/hadoop/common/*,      $HADOOP _common_home/share/hadoop/common/lib/*,      $HADOOP _hdfs_home/share/hadoop/ hdfs/*,      $HADOOP _hdfs_home/share/hadoop/hdfs/lib/*,      $HADOOP _yarn_home/share/hadoop/yarn/*,      $HADOOP _ Yarn_home/share/hadoop/yarn/lib/*</value>
<name>yarn.resourcemanager.hostname</name>    <value>node33</value>
Note that the path to the classpath is the corresponding path to the cluster. There is a new Yarnrunner file, reference: http://blog.csdn.net/fansy1990/article/details/27526167. First of all this test, see if you can connect to the cluster (directly perform a Mr Task, to see if it executes), if not, there must be no place to set the right.1.2 Mahout Package Import Mahout Package, Mahout's package gets adopted by the official website to provide the way, use git download by yourself. Compile, reference: http://mahout.apache.org/developers/buildingmahout.html (note the way to use HADOOP2, because it uses 2.4, so
MVN -DHADOOP2.version=2. 4. 1 -dskiptests  Clean Install
The imported packages are: Create a new file to test and see if the algorithm package that calls Mahout is able to execute properly. Does not report incompatible jobcontext and job incompatibility errors, assuming the error, the compiler has a problem (can download LZ compiled good)
2. Configuration items can be http://download.csdn.net/detail/fansy1990/7600427 (first part), http://download.csdn.net/detail/fansy1990/ 7600463 (Part II), http://download.csdn.net/detail/fansy1990/7600489 (Part III) Download, need to configure after download: 2.1 Hadoop-related configuration (1) in 1. Prepare to change the corresponding node33 to their machine name; (2) Remove Webroot/lib/mahout-*-job.jar from Project Javax.servlet and Javax.el folder (otherwise, Tomcat cannot be started.) Assuming that you compile it yourself, it is not necessary to download it. have been removed);
(3) Change the node33 in the Src/com/fz/util/hadooputils file in project and the port number to their cluster machine name/IP and port ; (4) Upload all files of the Projectsrc folder to the Cloud Platform MapReduce folder (or you will report errors that the class cannot find). Lib folder Mh2.1.jar); 2.2 Database related configuration changes the database configuration in the Configuration/db.properties file in project (the database is not used temporarily). 2.3 Tomcat deploy a Tomcat deployment using a configuration file:
<context  Path = "/mh"  docBase = "D:\workspase\hadoop_hbase\MahoutAlgorithmPlatform2.1\WebRoot"    Privileged = "true"  reloadable = "false"  >  </Context>

The project department is signed using MN.

3. Function function mainly includes four aspects: Cluster configuration, cluster algorithm monitoring, Hadoop module, mahout module. The Data folder provides a test. 3.1 Cluster configuration module start Project, open the browser access to ask HTTP://LOCALHOST:8080/MH, you can visit the project, the first page to see is the cluster configuration. What needs to be explained here is that it is not necessary to change in the Src/com/fz/util/hadooputils, can be configured in the cluster configuration page, and verify that the cluster can connect to the code:

public int checkconnection (String fsstr,string rm) throws Ioexception{configuration conf = new Configuration (); Conf.set ( "Fs.defaultfs", Fsstr), Conf.set ("Yarn.resourcemanager.address", RM), Conf.set ("Mapreduce.framework.name", "yarn"); FileSystem fs = Filesystem.get (conf), Boolean fsonline=fs.exists (New Path ("/")), if (!fsonline) {return 1;} Jobclient JC = new Jobclient (conf); Clusterstatus cs = Jc.getclusterstatus (); if (! " RUNNING ". Equals (Cs.getjobtrackerstatus (). toString ())) {return 0;} Cluster verification Success hadooputils.setconf (CONF); Hadooputils.setfs (fs);//By inferring whether hadoop.getconf () is null to determine if the cluster return 3 has been configured;}
Mainly through two aspects: 1, check the HDFs file, 2, check whether the cluster status is running; After the configuration is complete, click Verify, assuming the validation is successful. You can prompt for validation success:

watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvzmfuc3kxotkw/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/southeast "/>
3.2 Clustering Algorithm monitoring module in the cluster configuration, click Verify Success, you will be on the Task monitoring page to send messages continuously. Gets the status of the cluster task execution (1.2 seconds, Ajax mode), and gets the task execution state when no task is executed. will return null directly. When the Mahout module or the Hadoop module performs Mr Tasks, assuming the task is committed successfully, the task information class is initialized first based on the number of Mr Tasks executed by this submission. Initialization, the job is to find the ID of the task that is currently executing. Then initialize the ID of the task that you want to perform next, such as the following code:

public static void Initialcurrentjobs (int nextjobnum) throws Ioexception{/*if (List!=null&&list.size () ==10) { List.clear ();} */list.clear (); Empty last Legacy jobstatus[] JBS=GETJC (). GetAllJobs (); JobID JID = Findlastjob (JBS). Getjobid (); if (jid==null) {//The first time start the cluster, 'll be fixed next time//TOD O Fix the Buglog.info ("The cluster is started before and don't running any job!!!");} Log.info ("The Last Job ID was: {}", jid.tostring ()); for (int i=1;i<=nextjobnum;i++) {Currentjobinfo CJ = new Currentjobin Fo (); Cj.setjobid (New JobID (Jid.getjtidentifier (), Jid.getid () +i)); List.add (CJ);}}
It is important to note that if the cluster is started for the first time and does not perform the MR Task. The obtained task ID is empty and cannot be initialized (this is fixed in the next version number); Gets the code for the current execution of the task, such as the following:
public static list<currentjobinfo> Getcurrentjobs () throws ioexception{for (int i=0;i<list.size (); i++) { Currentjobinfo IJob = List.get (i); Runningjob runningjob =findgivenjob (Ijob.getjobid (). toString ()); if (runningjob==null) {break;} if (I==list.size ()-1) {//Put in front of the settings Finished=runningjob.iscomplete ();} Ijob.setjobname (Runningjob.getjobname ()); Ijob.setjobidstr (Runningjob.getjobstatus (). GetJobID (). toString ()); Ijob.setmapprogress (Utils.topercent (Runningjob.mapprogress (), 2)) Ijob.setredprogress (Utils.toPercent ( Runningjob.reduceprogress (), 2)); Ijob.setstate (Jobstatus.getjobrunstate (Runningjob.getjobstate ()  )); Sometimes the map and reduce are to 1 o'clock, this value is still running, need to process}return list;}
Once you get to the task information, you can monitor the execution status of the task in the task monitoring interface.


3.3 Hadoop Module Hadoop module now contains 5 small functions: Upload, download, read, read the central point of clustering, text conversion to sequence vector file. 3.3.1 Upload, download upload and download all use filesystem method, each is copyfromlocal and copytolocal.

The interface has only two parameters:

watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvzmfuc3kxotkw/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/southeast "/>
3.3.2 Read and read the central point reads are read according to each line of data. Can select the number of rows to read, read the cluster center, read the sequence file directly, read the cluster center vector code such as the following:

/** * Read Cluster center vector * @param conf * @param centerpathdir * @return * @throws ioexception */public static String readcenter (conf Iguration conf,string centerpathdir) throws Ioexception{stringbuffer buff = new StringBuffer (); Path input = new path (Centerpathdir, "part-*"); Hadooputils.getfs (). exists (input) {return input+ "not exist, please check the input";} For (clusterwritable cl:new sequencefiledirvalueiterable<clusterwritable> (input, Pathtype.glob, conf)) { Buff.append (Cl.getvalue (). asformatstring (null)). Append ("\ n");} return buff.tostring ();}

watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvzmfuc3kxotkw/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/southeast "/>
3.3.3 text to sequence vector This function point is an Mr Task. After submitting the task, you can see the task monitoring in the Task monitoring module. The basic function is to convert a text file into a sequence vector, providing input data for clustering.

You need to set the text delimiter:

watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvzmfuc3kxotkw/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/southeast "/>
Monitoring information:
3.4 mahout Module Mahout algorithm module is mainly called Mahout algorithm library related algorithms, and then monitoring algorithm execution state; 3.4.1 Clustering algorithm uses Kmeans algorithm temporarily. Provide the algorithm-dependent parameters (data in the Wine_kmeans.txt of the database folder):
The submit task here uses multi-threaded submission, which makes it easy to monitor.

watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvzmfuc3kxotkw/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/southeast "Width="/>
The 3.4.2 classification algorithm uses random forest algorithms temporarily (data in data/galss.txt). Divided into two parts. Build, test, build. Using Mr Algorithm, test using the single-machine mode; the build output model path uses relative paths. Using an absolute path will cause an error.
Click OK to open the Task monitoring page to view the task submission status:
Test random forest. Can see the parameters of random forest and the correct rate of test data and fuzzy matrix;
The recommended algorithm for the 3.4.3 recommendation algorithm uses the Recommenderjob of item. To set the parameters, submit the task:

watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvzmfuc3kxotkw/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/southeast "/>
Click OK. After the task is successfully submitted, you can view the monitoring:
3.5 Help Module on the right side of the homepage. Can see three help pages. Can get help information for different modules.

Share, grow, be happy

Reprint Please specify blog address: http://blog.csdn.net/fansy1990




Mahout algorithm invoke Presentation Platform 2.1

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.