Spark Model Example: two methods for implementing stochastic forest models (MLLIB and ML)

Source: Internet
Author: User

An official example of this article
http://blog.csdn.net/dahunbi/article/details/72821915
Official examples have a disadvantage, used for training data directly on the load came in, do not do any processing, some opportunistic.

    Load and parse the data file.
    Val data = Mlutils.loadlibsvmfile (SC, "Data/mllib/sample_libsvm_data.txt")

In practice, our spark are all architectures on Hadoop systems, and tables are stored on HDFS, so the normal way to extract it is with Hivesql, to invoke Hivecontext.
As mentioned in the previous article, there are two machine learning libraries, one is ML, one is mllib ml instance, used to pipeline:

Import java.io. {ObjectInputStream, ObjectOutputStream} import org.apache.spark.ml.util.MLWritable Import Org.apache.hadoop.conf.Configuration import Org.apache.hadoop.fs. {Fsdatainputstream, Path, filesystem} Import org.apache.spark.ml.feature.VectorAssembler Import Org.apache.spark.rdd.RDD Import org.apache.spark.sql.Row Import org.apache.spark.sql.hive.HiveContext Import Org.apache.spark. {sparkconf, sparkcontext} import org.apache.spark.ml.Pipeline import org.apache.spark.ml.classification. {Randomforestclassificationmodel, randomforestclassifier} Import Org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator import org.apache.spark.ml.feature. {indextostring, Stringindexer, vectorindexer} val hc = new Hivecontext (SC) Import Hc.implicits._//Call Hiveco ntext//sample, Sample first column label (0 or 1), other columns may be the name, cell phone number, and the actual feature to be involved in the training columns val data = Hc.sql (S "" "SELECT * from Database1.

  Traindata_userprofile "" "" "". Stripmargin)//extract schema, that is, table column Name,drop (2) Delete 1, 2 columns, only retain the feature column  Val schema = Data.schema.map (f=>s "${f.name}"). The Vectorassembler of the Drop (2)//ml is a transformer that requires that the data type not be string and that multiple columns of data be transferred into a single column of vector columns, such as the age, income, and so on field columns into a userfea vector column, facilitate follow-up training Val assembler = new Vectorassembler (). Setinputcols (Schema.toarray ). Setoutputcol ("Userfea") val userprofile = Assembler.transform (Data.na.fill ( -1E9)). Select ("label", "Userfea") Val Data_train = UserProfile.na.fill ( -1E9)//Fetch Training sample val labelindexer = new Stringindexer (). Setinputcol ("label"). SetOut Putcol ("Indexedlabel"). Fit (userprofile) val featureindexer = new Vectorindexer (). Setinputcol ("Userfea"). Setoutputcol ("Indexedfeatures"). Setmaxcategories (4). Fit (UserProfile)//Split the data into training and test sets (3
    0% held out for testing).
    Val Array (trainingdata, testData) = Userprofile.randomsplit (Array (0.7, 0.3))//Train a randomforest model. Val rf = new Randomforestclassifier (). Setlabelcol ("Indexedlabel"). Setfeaturescol ("Indexedfeatures") Rf.setmaxbins ( Setmaxdepth (6). sEtnumtrees. Setmininstancespernode (4). Setimpurity ("Gini")//Convert indexed labels back to original labels. Val labelconverter = new Indextostring (). Setinputcol ("prediction"). Setoutputcol ("Predictedlabel"). Setlabels ( 

    Labelindexer.labels) Val pipeline = new Pipeline (). Setstages (Array (labelindexer, featureindexer, RF, Labelconverter)) Train model.
    This also runs the indexers.
    Val model = Pipeline.fit (trainingdata) println ("Training finished!!!!")
    Make predictions.
    Val predictions = Model.transform (testData)//Select example rows to display. Predictions.select ("Predictedlabel", "Indexedlabel", "Indexedfeatures"). Show (5) Val evaluator = new Multiclassclassif Icationevaluator (). Setlabelcol ("Indexedlabel"). Setpredictioncol ("prediction"). Setmetricname ("accuracy") Val accuracy = evaluator.evaluate (predictions) println ("Test Error =" + (1.0-accuracy))}
Mllib example, based on RDD, note the process of converting from ml vector to mllib vector
Import java.io. {ObjectInputStream, ObjectOutputStream} import org.apache.hadoop.conf.Configuration import Org.apache.hadoop.fs. {Fsdatainputstream, Path, filesystem} Import org.apache.spark.ml.feature.VectorAssembler Import Org.apache.spark.mllib.linalg. {Vector, Vectors} import org.apache.spark.mllib.regression.LabeledPoint Import Org.apache.spark.mllib.tree.RandomForest Import Org.apache.spark.mllib.tree.model.RandomForestModel Import Org.apache.spark.rdd.RDD Import org.apache.spark.sql.Row Import org.apache.spark.sql.hive.HiveContext Import Org.apache.spark. {sparkconf, sparkcontext}//import org.apache.spark.ml.linalg.Vector import org.apache.spark.mllib.util.MLUtils var Modelrf:randomforestmodel = null val HC = new Hivecontext (SC) import hc.implicits._//Advertising portrait construction completed//sample, first column of sample L Abel (0 or 1), other columns may be the name, cell phone number, and the actual feature to be involved in training columns val data = Hc.sql (S "" "" Select * from Database1.traindata_userprofile "" ". St Ripmargin)////extract schema, that is, table column Name,drop (2) Delete 1, 2 columns, only feature column Val sChema = Data.schema.map (f=>s "${f.name}"). The//ml of the Drop (1) Vectorassembler is a transformer that requires that the data type not be a string. Converting multiple columns of data into a single column of vector columns, such as combining age, income, and so on into a userfea vector column, facilitates subsequent training val assembler = new Vectorassembler (). Setinputcols ( Schema.toarray). Setoutputcol ("Userfea") val data2 = Data.na.fill ( -1e9) Val userprofile = Assembler.transform (data2). SE Lect ("label", "Userfea")//emphasis here: the vector built with ML Vectorassembler must have a conversion of this format, from the vector of ML to the vector of Mllib, In order to use the classifier inside the mllib (both vectors are really a pit, note) Val userProfile2 = MLUTILS.CONVERTVECTORCOLUMNSFROMML (userprofile, "Userfea")/
      /Take Training sample Val rdd_data:rdd[labeledpoint]= userProfile2.rdd.map {x => val label = x.getas[double] ("label") Val Userfea = X.getas[vector] ("Userfea") Labeledpoint (LABEL,USERFEA)}//build up the training data to be trained, the parameters of the RF are as follows Val Impuri ty = "Gini" val featuresubsetstrategy = "Auto"//Let's algorithm Choose val categoricalfeaturesinfo = Map[int, in T] () val iteration = val maxDepth = 9 val numclasses = 2 val maxBins = Val Numtrees = MODELRF = Randomforest.trainclassifier (Rdd_data, numclasses, Categoricalfeaturesinfo,
  Numtrees, Featuresubsetstrategy, impurity, maxDepth, Maxbins) println ("Training finished!!!!") Evaluate model on test instances and compute Test error val labelandpreds = userProfile2.rdd.map {x=> val Lab
    El = x.getas[double] ("label") val Userfea = X.getas[vector] ("Userfea") Val prediction = modelrf.predict (USERFEA)
  (Label, Prediction)} Labelandpreds.take. foreach (println) Modelrf.save (SC, "/home/user/victorhuang/rfcmodel_mllib") Spark.stop ()

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.