Official examples of spark: two methods for implementing stochastic forest models (ML/MLLIB)

Source: Internet
Author: User

In the spark2.0 version, there are two implementation libraries for machine learning algorithms mllib and ML, such as random forests:
Org.apache.spark.mllib.tree.RandomForest
And
Org.apache.spark.ml.classification.RandomForestClassificationModel

The two libraries correspond to different usage methods, Mllib is the rdd-based API,
ML is a data structure based on the ML Pipeline API and Dataframe.
Refer to Http://spark.apache.org/docs/latest/ml-guide.html
So the official case is also very different, the following are given the source code and comments: Mllib Model Implementation

Scalastyle:off println Package Org.apache.spark.examples.mllib import Org.apache.spark. {sparkconf, Sparkcontext}//$example on$ import org.apache.spark.mllib.tree.RandomForest Import

Org.apache.spark.mllib.tree.model.RandomForestModel Import org.apache.spark.mllib.util.MLUtils//$example off$ Object Randomforestclassificationexample {def main (args:array[string]): unit = {val conf = new sparkconf (). Setapp Name ("Randomforestclassificationexample") val sc = new Sparkcontext (conf)//$example on$//Load and parse th
    e data file. Val data = Mlutils.loadlibsvmfile (SC, "data/mllib/sample_libsvm_data.txt")//Split the data into training and test SE TS (30% held out for testing) Val splits = Data.randomsplit (Array (0.7, 0.3)) Val (trainingdata, TestData) = (split
    S (0), splits (1))//Train a randomforest model.
    Empty Categoricalfeaturesinfo indicates all features are continuous. Val numclasses = 2 val categoricalfeaturesinfo = Map[int, INT] () val numtrees = 3//Use more in practice.
    Val featuresubsetstrategy = "Auto"//Let the algorithm choose. Val impurity = "Gini" val maxDepth = 4 val maxbins = val model = Randomforest.trainclassifier (trainingdata , Numclasses, Categoricalfeaturesinfo, Numtrees, Featuresubsetstrategy, impurity, maxDepth, maxbins)//Evaluat  E model on test instances and compute Test error val labelandpreds = testdata.map {Point => val prediction = Model.predict (Point.features) (Point.label, Prediction)} val testerr = Labelandpreds.filter (r => r._1!) = r._2). Count.todouble/testdata.count () println ("Test Error =" + Testerr) println ("learned classification fores T model:\n "+ model.todebugstring)//Save and load Model Model.save (SC," target/tmp/myrandomforestclassification Model ") Val Samemodel = Randomforestmodel.load (SC," Target/tmp/myrandomforestclassificationmodel ")//$example off $}}//ScalastylE:on println 
ml model Implementation
Scalastyle:off println Package org.apache.spark.examples.ml//$example on$ import org.apache.spark.ml.Pipeline Impor T org.apache.spark.ml.classification. {Randomforestclassificationmodel, randomforestclassifier} Import Org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator import org.apache.spark.ml.feature. {indextostring, stringindexer, vectorindexer}//$example off$ Import Org.apache.spark.sql.SparkSession Object
      randomforestclassifierexample {def main (args:array[string]): unit = {val spark = sparksession. Builder  . AppName ("Randomforestclassifierexample"). Getorcreate ()//$example on$//Load and parse the data file,
    Converting it to a dataframe. Val data = Spark.read.format ("LIBSVM"). Load ("Data/mllib/sample_libsvm_data.txt")//Index labels, adding metadata to
    The label column.
    Fit on whole dataset to include all labels in index. Val labelindexer = new Stringindexer (). Setinputcol ("label"). SETOUTPUtcol ("Indexedlabel"). Fit (data)//automatically identify categorical features, and index them.
    Set maxcategories so features with > 4 distinct values are treated as continuous. Val featureindexer = new Vectorindexer (). Setinputcol ("Features"). Setoutputcol ("Indexedfeatures"). Setm
    Axcategories (4). Fit (data)//Split the data into training and test sets (30% held out for testing).
    Val Array (trainingdata, testData) = Data.randomsplit (Array (0.7, 0.3))//Train a randomforest model. Val rf = new Randomforestclassifier (). Setlabelcol ("Indexedlabel"). Setfeaturescol ("Indexedfeatures"). S
    Etnumtrees//Convert indexed labels back to original labels. Val labelconverter = new Indextostring (). Setinputcol ("prediction"). Setoutputcol ("Predictedlabel"). Set
    Labels (labelindexer.labels)//Chain indexers and forest in a Pipeline. Val pipeline = new Pipeline (). Setstages (ArRay (Labelindexer, Featureindexer, RF, Labelconverter))//Train model.
    This also runs the indexers.
    Val model = Pipeline.fit (trainingdata)//Make predictions.
    Val predictions = Model.transform (testData)//Select example rows to display.  Predictions.select ("Predictedlabel", "label", "features"). Show (5)//Select (prediction, True label) and compute test
    Error. Val evaluator = new Multiclassclassificationevaluator (). Setlabelcol ("Indexedlabel"). Setpredictioncol ("Predic tion "). Setmetricname (" accuracy ") val accuracy = evaluator.evaluate (predictions) println (" Test Error = "+ ( 1.0-accuracy)) Val Rfmodel = Model.stages (2). Asinstanceof[randomforestclassificationmodel] println ("Learned CLA Ssification Forest model:\n "+ rfmodel.todebugstring)//$example off$ spark.stop ()}//Scalastyle:on Printl N

TIPS:
want to see all of the sample code inside Http://spark.apache.org/docs? One way is to go to the GitHub, another way is to enter the spark installation directory, all the source code in spark/examples/src/main/scala/,
such as ML algorithm Scala implementation:
Spark/examples/ SRC/MAIN/SCALA/ORG/APACHE/SPARK/EXAMPLES/ML
Mllib algorithm Scala implementation:
spark/examples/src/main/scala/org/apache/ Spark/examples/mllib

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.