Spark Mllib algorithm invoke display platform and its implementation process

Last Update:2017-03-18 Source: Internet

Author: User

Tags assert spark mllib

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Software version: Ide:intellij idea 14,java:1.7,scala:2.10.6;tomcat:7,cdh:5.8.0; spark:1.6.0-cdh5.8.0-hadoop2.6.0-cdh5.8.0; hadoop:hadoop2.6.0-cdh5.8.0; (using a virtual machine provided by CDH)
2. Project download and deployment: Scala encapsulation Spark algorithm engineering: https://github.com/fansy1990/Spark_MLlib_Algorithm_1.6.0.git; call spark algorithm engineering: https:// Github.com/fansy1990/spark_mllib_1.6.0_.git; Deployment (mainly for spark_mllib_1.6.0 engineering):

1) Configure the corresponding user name password/database parameters in Db.properties; 2) Start Tomcat for the first time, modify the Hibernate.hbm2ddl.auto value in the Hibernate.cfg.xml file to create, and change the second start to update;3) Open the Cluster Parameters page, click Initialize, initialize the cluster parameters, if the cluster parameters and the current cluster does not match, then need to make corresponding changes,   temporarily consider the use of configuration files to configure the cluster parameters, if you want to adjust to the database configuration, Then modify the Utisl.dborfile parameter, that is, temporarily only need to modify the Utisl.properties file; 4) Copy Spark_mllib_algorithm_ 1.6.0 the algorithm generated by the project to 3) in the Spark.jar path, 5) Copy the cluster yarn-site.xml to 3) spark.files the path; 6) copy SPARK-ASSEMBLY-1.6.0-CDH5.8.0-HADOOP2.6.0-CDH 5.8.0.jar to 3) where the Spark.yarn.jar is located;

3. Engineering Implementation principle: 3.1 Scala package spark algorithm Engineering: 3.1.1 Engineering Catalog 1. The project catalog is as follows:
Among them, the data directory for all the test data in the directory, here for different algorithms to establish a different directory, there are 5 categories: Classification and regression/clustering/Collaborative filtering/dimensionality reduction/frequent itemsets mining; Main/scala inside is all the code in the Spark source package. ; Test/scala the test of each package code inside;
2. The project uses MAVEN constructs, directly according to the pom file loading corresponding dependence;
3. The project needs to be MAVEN packaged to put packaged jar packages into a fixed directory on HDFs in the CDH virtual machine, which makes it easy for the spark algorithm to invoke the project invocation (as specified below); 3.1.2 A single algorithm implementation (encapsulation/testing), such as for logistic regression 1. For logistic regression, the encapsulated code is as follows: Code Listing 3-1 Logistic regression algorithm encapsulation (Scala)

Package Com.fz.classificationimport Com.fz.util.Utilsimport org.apache.spark.mllib.classification. {LOGISTICREGRESSIONWITHSGD, Logisticregressionwithlbfgs}import org.apache.spark.mllib.linalg.Vectorsimport Org.apache.spark.mllib.regression.LabeledPointimport Org.apache.spark. {sparkconf, sparkcontext}/** * Logistic regression encapsulation algorithm * Labels used in Logistic Regression should is {0, 1, ..., k-1} for K classes mu Lti-label classification Problem * Input parameters: * Testornot: Whether it is a test, normal setting to False * Input: output data; * Minpartitions: Minimum input data partition Number of * Outputs: Output path * Targetindex: Target column is subscript, starting from 1 * Splitter: Data delimiter; * Method: Using logistic regression algorithm: "SGD" or "LBFGS" * hasintercept: Has intercept * Numclasses: Number of target column categories; * Created by Fanzhe on 2016/12/19. */object logisticregression {def main (args:array[string]) {if (args.length! = 9) {println ("Usage:com.fz.clas Sification.      Logisticregression testornot input minpartitions output targetindex "+" Splitter method hasintercept numclasses ") System.exit ( -1)} val testornot = arGS (0). ToBoolean//Whether it is a test, Sparkcontext gets a different way, true is test val input = args (1) Val minpartitions = args (2). ToInt Val output = args (3) Val targetindex = args (4). ToInt//Starting from 1, not starting from 0 to note val splitter = args (5) Val method = Ar GS (6)//should be "SGD" or "Lbfgs" val hasintercept = args (7). ToBoolean val numclasses = args (8). ToInt Val SC = Utils.getsparkcontext (Testornot, "Logistic Create Model")//Construct data//Load and parse the data Val t  Raining = Utils.getlabeledpointdata (Sc,input,minpartitions,splitter,targetindex). Cache ()//Run training algorithm to Build the model Val model = method Match {case "SGD" = new LOGISTICREGRESSIONWITHSGD (). Setinterce PT (hasintercept). Run (training) case "LBFGS" = new Logisticregressionwithlbfgs (). Setnumclasses (Numclasse     s). Setintercept (Hasintercept). Run (Training) Case _ = + throw new RuntimeException ("no Method")   }//Save model  Model.save (Sc,output) Sc.stop ()}}

In the above code, there is an explanation of each parameter, including the meaning of the parameter, parameters, and so on; in the main function, each parameter is first acquired and assigned a variable, and then the sparkcontext is obtained; The most important part is to invoke the LOGISTICREGRESSIONWITHSGD or Logisticregressionwithbfgs class of Spark's own encapsulation for logistic regression modeling; Finally, call the model's Save method to solidify the model to HDFs , basically, all of the algorithm packages take this pattern, and the Spark mllib native algorithm plus a layer of encapsulation.
2. Testing
The test is primarily tested with JUnit, and its logistic regression sample code is as follows: Code Listing 3-2 logistic regression algorithm encapsulation Test (Scala)

Package Com.fz.classificationimport Java.io.Fileimport Com.fz.util.Utilsimport org.junit. {Assert, test}import assert._/** * Test Logistics regression algorithm * Created by Fanzhe on 2016/12/19. */@Testclass logisticregressiontest {@Test def testMain1 () ={//testornot Input Output Targetindex splitter method ha Sintercept numclasses val args = Array ("true", "./src/data/classification_regression/logistic.dat", "2"    , "./target/logistic/tmp1", "1", "", "SGD", "true", "2"//This parameter is useless) Delete Output directory Utils.deleteoutput (args (3)) Logisticregression.main (args) asserttrue (Utils.filecontainsclassname (args (3) + "/metadata/part-00000", "Org.apache.spark.mllib.classification.LogisticRegressionModel")} @Test def Testmain 2 () ={//Testornot input Minpartitions Output Targetindex Splitter method Hasintercept numclasses val args = Arra Y ("true", "./src/data/classification_regression/logistic.dat", "2",     "./target/logistic/tmp2", "1", "" "," Lbfgs "," true "," 2 ")//Delete output directory Utils.deleteou Tput (args (3)) Logisticregression.main (args) asserttrue (Utils.filecontainsclassname (args (3) + "/metadata/part-00000 "," Org.apache.spark.mllib.classification.LogisticRegressionModel "))}}

This method is the first step to build the algorithm parameters, and then call the Main method, the third step, to see if the output has information about the model, of course, it can also add a number of test methods, using different algorithm parameters or data for testing; (readers can add it themselves)
3.2 Spark algorithm Call Engineering: 3.2.1 Interface Introduction 1. Home：

In the System home page has the introduction of the system implementation algorithm, the main functions of the system are: 1) cluster parameter maintenance: Mainly used in the bottom of the Hadoop cluster parameter configuration, each configuration is completed, not only update the database corresponding records, and will update the Hadoop configuration access ; 2) Monitoring: Mainly refers to the spark task running under the Yarn Explorer task status monitoring; 3) file Upload and preview: File upload is mainly to upload local test data to HDFs, easy to test the page, while the preview is to view the data above the HDFs ; 4) Classification and regression/collaborative filtering/clustering/dimensionality reduction/association rules: Each kind of algorithm below the invocation of each algorithm modeling page; 2. Cluster Parameters page:
Click Initialization, the parameters will be fixed to the background database, the user can be based on their own cluster configuration, and parameter modification, and each modification will also refresh the configuration in Hadoop access; 3. Monitoring:
The monitoring page monitors the running status of the spark task submitted by the user, and if the task fails, the exception message is displayed (only a subset of the information is intercepted, adjustments are made to see how important information can be obtained, and it is displayed directly); 4. File Upload:
File upload has two functions: 1) You can specify a local directory and an HDFs directory, and then upload the data from local to HDFs, 2) directly select the corresponding algorithm data, and then initialize, this is the local engineering path src/main/ Data is uploaded to the fixed directory in HDFs, and both uploaded data can be used in later algorithm modeling. It is also important to note that the HDFs path being written is required to have write permission, while the user is the user who started Tomcat; 5. File View：
The file viewing function can only view text-encoded files, which is a textual file, and can enter line numbers to read the contents of the file; 6. Logistic regression algorithm:
In the logic regression algorithm interface, input algorithm parameters, click Submit, if the task submitted successfully, you can see the task submitted ID, if the submission failed (that is, the task ID is not obtained), there is also a corresponding message, and after the task is submitted, the monitoring interface can also observe the status of the task, The latest task status can be obtained by refreshing;
7. Other algorithms are similar to logistic regression algorithmsThe 3.2.2 Architecture system architecture diagram looks like this (algorithm invocation and monitoring):
The process is described as follows: 1. The foreground interface sets parameters, including algorithm data, algorithm parameters, and then submits the task; 2. After a task is committed, Cloudaction receives a thread that initiates a job on Hadoop that has a return value, a task ID, and a null;3 if the task submission fails. Primary monitoring status: After the cloudaction initiates the thread, the main thread blocks, waits for the Hadoop task thread to return a value, depending on the return value status, the foreground returns the task submission succeeded or failed; 4. At the same time of 3, the Jobinfo status of the database corresponding table can be updated by Dbservice; 5. In the Monitor.html interface, the Hadoop task status can be obtained in time by the Refresh button (the corresponding service, see below), and update the database related data, return to the foreground all task information;
3.2.3 Partial Implementation Details 1. Spark submits a taskRefer to the spark-based ALS online referral system;
2. Monitor real-time Query task status listMonitor real-time Query task status list its process is described as follows:

1) Obtain the latest records record in the Jobinfo, 2) find the data in which the Isfinished field is false, 3) the data found in the 2), go to yarn to get its real-time status, and update the data in 1), and then deposit in the database; 4) Returns JSON data based on row and page field pagination;

Its code is as follows: Listing 3-3 updating the Monitoring task List

public void        Getjobinfo () {map<string,object> jsonmap = new hashmap<string,object> ();        1.        list<object> Jobinfos = dbservice.getlastnrows ("Jobinfo", "JobId", true,records);        2,3 list<object> List = null;            try {list = Hutils.updatejobinfo (Jobinfos);            if (list = null | | list.size () >0) {dbservice.updatetabledata (list);            }}catch (Exception e) {e.printstacktrace (); Log.warn ("Update task status exception!")            ");            Jsonmap.put ("Total", 0);            Jsonmap.put ("Rows", null);            Utils.write2printwriter (json.tojsonstring (Jsonmap));        return;        }//4.        Jsonmap.put ("Total", list.size ());        Jsonmap.put ("Rows", Utils.getsublist (list,page,rows));    Utils.write2printwriter (json.tojsonstring (Jsonmap)); }

The first step is to get a given records record through Dbservice, and the second step is to update the records; see Hutils.updatejobinfo implementation:listing 3-4 getting the latest status for a task

public static list<object> Updatejobinfo (list<object> jobinfos) throws yarnexception,ioexception{LIST&L        t;object> list = new arraylist<> ();        Jobinfo Jobinfo;            for (Object o:jobinfos) {jobinfo = (jobinfo) o;                if (!jobinfo.isfinished ()) {//If not completed, check its current status Applicationreport Appreport=null;                try {appreport = Getclient (). Getapplicationreport (Sparkutils.getappid (Jobinfo.getjobid ())); } catch (Yarnexception |                    IOException e) {e.printstacktrace ();                Throw e; }/** * NEW, 0 new_saving, 1 submitted, 2 A ccepted, 3 RUNNING, 4 finished, 5 FAILED, 6 killed; 7 */switch (appreport.getyarnapplicationstate (). Ordinal ()) {Case 0 | 1 | 2 |3://are updated to accepted status jobinfo.setrunstate (jobstate.accetped);                    Break                    Case 4:jobinfo.setrunstate (jobstate.running); Case 5://undefined,//succeeded,//F                        Ailed,//killed; Switch (Appreport.getfinalapplicationstatus (). Ordinal ()) {case 1:jobinfo.setrunstate (jobstate.                            successed);                            Sparkutils.cleanupstagingdir (Jobinfo.getjobid ());                            Jobinfo.setfinished (TRUE);                                Case 2:jobinfo.setrunstate (jobstate.failed);                                Sparkutils.cleanupstagingdir (Jobinfo.getjobid ());    Jobinfo.seterrorinfo (Appreport.getdiagnostics (). substring (0,utils.exceptionmessagelength));                            Jobinfo.setfinished (TRUE);                                Case 3:jobinfo.setrunstate (jobstate.killed);                                Sparkutils.cleanupstagingdir (Jobinfo.getjobid ());                            Jobinfo.setfinished (TRUE); Default:log.warn ("App:" + jobinfo.getjobid () + "Get task status Exception!" + "Appreport.getfinalappli Cationstatus (): "+appreport.getfinalapplicationstatus (). Name () +", ordinal: "+ Appreport.getfinala                        Pplicationstatus (). ordinal ());                    } break;                        Case 6:jobinfo.setrunstate (jobstate.failed);                        Sparkutils.cleanupstagingdir (Jobinfo.getjobid ());                        Jobinfo.seterrorinfo (Appreport.getdiagnostics (). substring (0,utils.exceptionmessagelength));  Jobinfo.setfinished (TRUE);                  Case 7:jobinfo.setrunstate (jobstate.killed);                        Sparkutils.cleanupstagingdir (Jobinfo.getjobid ());                    Jobinfo.setfinished (TRUE); Default:log.warn ("App:" + jobinfo.getjobid () + "Get task status exception!")                            + "Appreport.getyarnapplicationstate ():" +appreport.getyarnapplicationstate (). Name ()                + ", ordinal:" + appreport.getyarnapplicationstate (). ordinal ());            } jobinfo.setmodifiedtime (New Date ());    } list.add (Jobinfo);//Add the updated or original jobinfo to the list} return list; }

The work here is based on the status of the task in the database, querying only the latest status of tasks that have not been completed, updating the original task status, and finally adding the updated or original task to the list, and returning the updated list in Listing 3-3. The dbservice.updatetabledata is then called, the data is cured, and finally, the list is intercepted using sublist, and the data is returned to a paging page in the foreground.
4. Spark algorithm Call engineering follow-up development: I have to say, this version of the project has not been developed, then if you want to continue to develop, what is the general process?

1) write the thread of the algorithm corresponding to the src/main/java/,   2) write the corresponding page under WebApp,   3) write the corresponding JS under Webapp/js,   4) Modify the webapp/preprocess/ upload.jsp, add a data upload record, and add the corresponding data under Main/data;   5) Start the project, upload the data on the page, then select the algorithm, set the parameters, you can submit the task, after submitting the task in the monitoring interface can see the algorithm running state;

Engineering status (assuming Scala works for Project 1, call spark algorithm engineering for Project 2): Project 1: Basic encapsulation of data mining related algorithms in spark mllib, including clustering, classification, regression, collaborative filtering, dimensionality reduction, frequent set mining (this also has a problem); Project 2: currently only the relevant pages and invocations of the classification and regression algorithms are done;
So, if you want to develop on this version, then you can refer to the above process and try to write the ALS algorithm call first.
5. Summary 1. Spark algorithm Call Project There are many pages are not completed, this is similar to repetitive work, and no difficulty to overcome; 2. The spark algorithm calls the project for each algorithm, originally wanted in its algorithm call interface with its data description, algorithm description, parameter description, but temporarily did not add, but these information in the Scala algorithm encapsulation project has; 3. The process of invoking the spark algorithm using spark on yarn and using yarn to manage tasks is basically reflected in the Spark algorithm call project, no more can not play the flowers, so if you want to study this piece of content, then the project is a good reference; 4. Before the classification algorithm this block is to add algorithms to compare and analyze, and then add some charts and other display, so it appears more tall, but at present only one step, is to write a classification algorithm evaluation of the Scala package algorithm; 5. Can consider some of the process of timing tasks such as adding to the project, which is actually a bit like Oozie, but why oozie there is no direct drag interface or process task monitoring management of things, if there is actually more like a commercial software (kettle); 6. About the SSH framework in fact, I am relatively weak, so the application of SSH in the place is just a simple application (for example, in the return of paging, I directly use sublist, this should be inappropriate); 7. About the front page display, I am also relatively weak, so the interface style or a single page of relevant information display and so on, watching can not be done pleasing; 8. The Code is free, just enjoy!

Share, grow, be happy

Down-to-earth, focus

Reprint Please specify blog address: http://blog.csdn.net/fansy1990

Spark Mllib algorithm invoke display platform and its implementation process

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More