Software Version:
Windows 7: tomcat7, jdk7, spring4.0.2, struts2.3, hibernate4.3, myeclipse10.0, easyui; Linux (centos6.5): hadoop2.4, mahout1.0, jdk7;
Use a web project to call related algorithms of mahout, provide monitoring, and view the task execution status.
For self-built web projects, the project homepage is as follows:
1. Preparation items can be downloaded in http://download.csdn.net/detail/fansy1990/7600427 (Part 1), http://download.csdn.net/detail/fansy1990/7600463 (Part 2), http://download.csdn.net/detail/fansy1990/7600489 (part 3. Hadoop uses version 2.4 provided on the official website to download it directly. Then, configure (the configuration is not described here), start various services, and use JPs. You can see the following services:
[[email protected] data]# jps6033 NodeManager5543 NameNode5629 DataNode5942 ResourceManager41611 Jps5800 SecondaryNameNode6412 JobHistoryServer
1.1 Use eclipse to create a Java project and import the hadoop package to test whether the cluster can be connected. The imported package is as follows: modify the mapred-default.xml in the red box, the following configuration in the yarn-default.xml (node33 is the machine name of the pseudo distributed hadoop Cluster machine): mapred-default.xml:
<property> <name>mapreduce.jobhistory.address</name> <value>node33:10020</value> <description>MapReduce JobHistory Server IPC host:port</description></property>
Yarn-default.xml:
<name>yarn.application.classpath</name> <value>$HADOOP_CONF_DIR, $HADOOP_COMMON_HOME/share/hadoop/common/*, $HADOOP_COMMON_HOME/share/hadoop/common/lib/*, $HADOOP_HDFS_HOME/share/hadoop/hdfs/*, $HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*, $HADOOP_YARN_HOME/share/hadoop/yarn/*, $HADOOP_YARN_HOME/share/hadoop/yarn/lib/*</value>
<name>yarn.resourcemanager.hostname</name> <value>node33</value>
Note that the classpath path is the corresponding path of the cluster, and the yarnrunner file is created. For more information, see http://blog.csdn.net/fansy1990/article/details/27526167. First, test whether the cluster can be connected (directly run an MR task and check whether the task is executed). If not, there must be no correct settings. 1.2 mahout package import mahout package, mahout package to obtain the way provided by the official website, the use of git download, compiled, reference: http://mahout.apache.org/developers/buildingmahout.html (pay attention to the choice of hadoop2, because 2.4 is used here
mvn -Dhadoop2.version=2.4.1 -DskipTests clean install
) Imported packages include: Create a new file for testing and check whether the algorithm package called mahout can run normally. errors that are incompatible with jobcontext and job are not reported. If an error is reported, indicates compilation problems (you can download the compiled LZ files)
2. configuration items can be downloaded in http://download.csdn.net/detail/fansy1990/7600427 (Part 1), http://download.csdn.net/detail/fansy1990/7600463 (Part 2), http://download.csdn.net/detail/fansy1990/7600489 (part 3), after download requires configuration: 2.1 hadoop-related configuration (1) at 1. in preparation, change the corresponding node33 to your machine name. (2) Remove webroot/lib/mahout-*-job from the project. remove javax. servlet and javax. el directory (otherwise, Tomcat cannot be started. If it is compiled by yourself, it will not be used if it is downloaded, and it has been removed );
(3) Modify node33 and the port number in the src/COM/FZ/util/hadooputils file in the project to the machine name/IP address and port number of the Cluster. (4) upload All files in the src directory of the project to the mapreduce directory of the cloud platform (otherwise, an error cannot be found in the class, mh2.1.jar under the lib directory ); 2.2 configuration/DB in the Database Configuration modification project. the corresponding configuration of the database in the properties file (the database is not used yet); 2.3 Tomcat deployment using the configuration file:
<Context path ="/mh" docBase ="D:\workspase\hadoop_hbase\MahoutAlgorithmPlatform2.1\WebRoot" privileged ="true" reloadable ="false" > </Context>
The Project Department uses Mn. 3. the functions include cluster configuration, cluster algorithm monitoring, hadoop module, and mahout module. The data directory provides test data. 3.1 The cluster configuration module starts the project and opens a browser to access HTTP: // localhost: 8080/MH. You can access the project. The cluster configuration is displayed on the home page. It must be noted that you do not have to modify the src/COM/FZ/util/hadooputils file. You can configure the file on the cluster configuration page or verify whether the cluster can be connected to the file:
Public int checkconnection (string fsstr, string RM) throws ioexception {configuration conf = new configuration (); Conf. set ("FS. defaultfs ", fsstr); Conf. set ("yarn. resourceManager. address ", RM); Conf. set ("mapreduce. framework. name "," yarn "); filesystem FS = filesystem. get (CONF); Boolean fsonline = FS. exists (New Path ("/"); If (! Fsonline) {return 1;} jobclient JC = new jobclient (CONF); clusterstatus cs = JC. getclusterstatus (); If (! "Running ". equals (CS. getjobtrackerstatus (). tostring () {return 0;} // hadooputils is successfully verified by the cluster. setconf (CONF); hadooputils. setfs (FS); // determine hadoop. whether getconf () is null to determine whether the cluster return 3 has been configured ;}
The following two methods are used: 1. Check the HDFS file; 2. Check whether the cluster status is running; after the configuration is complete, click Verify. If the verification is successful, the system prompts that the verification is successful:
3.2 In cluster configuration, the cluster algorithm monitoring module click Verify to send messages continuously on the task monitoring page to obtain the cluster task running status (interval: 1.2 seconds, ajax); when no task is running, obtain the task running status and return NULL directly. When the mahout module or hadoop module runs the Mr task, if the task is successfully submitted, the Information Class of the task is initialized based on the number of Mr tasks submitted. The initialization is to find the ID of the currently running task, and then initialize the ID of the task to be run next, as shown in the following code:
Public static void initialcurrentjobs (INT nextjobnum) throws ioexception {/* If (list! = NULL & list. size () = 10) {list. clear ();} */list. clear (); // clear the previous legacy jobstatus [] jbs = getjc (). getalljobs (); jobid jid = findlastjob (jbs ). getjobid (); If (jid = NULL) {// The first time start the cluster, will be fixed next time // todo fix the buglog.info ("the cluster is started before and not running any job !!! ");} Log.info (" the last job ID is: {} ", jid. tostring (); For (INT I = 1; I <= nextjobnum; I ++) {currentjobinfo Cj = new currentjobinfo (); CJ. setjobid (New jobid (jid. getjtidentifier (), jid. GETID () + I); list. add (CJ );}}
Note that if the cluster is started for the first time and the Mr task is not running, the task id is null and cannot be initialized (this is fixed in the next version ); the code for obtaining the current running task is as follows:
Public static list <currentjobinfo> getcurrentjobs () throws ioexception {for (INT I = 0; I <list. size (); I ++) {currentjobinfo ijob = List. get (I); runningjob = findgivenjob (ijob. getjobid (). tostring (); If (runningjob = NULL) {break;} if (I = List. size ()-1) {// put it in front of the set finished = runningjob. iscomplete ();} ijob. setjobname (runningjob. getjobname (); ijob. setjobidstr (runningjob. getjobstatus (). getjobid (). tostring (); ijob. setmapprogress (utils. topercent (runningjob. mapprogress (), 2); ijob. setredprogress (utils. topercent (runningjob. ceceprogress (), 2); ijob. setstate (jobstatus. getjobrunstate (runningjob. getjobstate (); // when both map and reduce reach 1, this value is still running and needs to be processed} return list ;}
After obtaining the task information, you can monitor the running status of the task on the task monitoring page.
3.3 hadoop module the hadoop module currently has five small functions: Upload, download, read, read cluster center, and convert the text into a column vector file. 3.3.1 The filesystem method is used for uploading and downloading, including copyfromlocal and copytolocal. There are only two parameters on the interface:
3.3.2 reading and reading the cluster center is based on the data in each row. You can select the number of rows to read. Reading the cluster center directly reads the sequence file. The code for reading the cluster center vector is as follows:
/*** Read the cluster center vector * @ Param conf * @ Param centerpathdir * @ return * @ throws ioexception */public static string readcenter (configuration Conf, string centerpathdir) throws ioexception {stringbuffer buff = new stringbuffer (); Path input = New Path (centerpathdir, "part-*"); If (! Hadooputils. getfs (). exists (input) {return input + "not exist, please check the input";} For (clusterwritable CL: New sequencefiledirvalueiterable <clusterwritable> (input, pathtype. glob, conf) {buff. append (Cl. getvalue (). asformatstring (null )). append ("\ n");} return buff. tostring ();}
3.3.3 converting the text to a forward vector is an MR task. After submitting a task, you can view the task monitoring in the task monitoring module. The main function is to convert a text file into a sequence vector to provide input data for clustering. You need to set text delimiters:
Monitoring Information:
3.4 mahout module the mahout Algorithm Module calls the algorithms in the mahout Algorithm Library and monitors the running status of the algorithms. 3.4.1 The aggregation algorithm is used for the moment to use the kmeansalgorithm, And the encryption algorithm is related to the aggregation algorithm (wine_kmeans.txt in the data ):
Here, you can submit tasks using multiple threads to facilitate monitoring;
3.4.2 Classification Algorithm classification algorithms use random forest algorithms (data is stored in data/galss.txt) for the time being. These algorithms are divided into two parts: build and test. The Mr algorithm is used for build and test, and the standalone mode is used for test; when relative paths are used for the output model path, an error is reported when absolute paths are used!
Click OK to go to the task monitoring page and view the task submission status:
Test the random forest. We can see the parameters of the random forest and the accuracy rate and fuzzy matrix of the test data;
3.4.3 recommendation algorithm the recommendation algorithm uses item recommenderjob, sets parameters, and submits the task:
Click OK. After successfully submitting the task, you can view the monitoring information:
3.5 The help module is displayed on the right of the home page. You can view three help pages to obtain help information for different modules.
Share, grow, and be happy
Reprinted please indicate blog address: http://blog.csdn.net/fansy1990