In the spark2.0 version, there are two implementation libraries for machine learning algorithms mllib and ML, such as random forests:
Org.apache.spark.mllib.tree.RandomForest
And
Org.apache.spark.ml.classification.RandomForestClassificationModel
The two libraries correspond to different usage methods, Mllib is the rdd-based API,
ML is a data structure based on the ML Pipeline API and Dataframe.
Refer to Http://spark.apache.org/docs/latest/ml-guide.html
So the official case is also very different, the following are given the source code and comments: Mllib Model Implementation
Scalastyle:off println Package Org.apache.spark.examples.mllib import Org.apache.spark. {sparkconf, Sparkcontext}//$example on$ import org.apache.spark.mllib.tree.RandomForest Import
Org.apache.spark.mllib.tree.model.RandomForestModel Import org.apache.spark.mllib.util.MLUtils//$example off$ Object Randomforestclassificationexample {def main (args:array[string]): unit = {val conf = new sparkconf (). Setapp Name ("Randomforestclassificationexample") val sc = new Sparkcontext (conf)//$example on$//Load and parse th
e data file. Val data = Mlutils.loadlibsvmfile (SC, "data/mllib/sample_libsvm_data.txt")//Split the data into training and test SE TS (30% held out for testing) Val splits = Data.randomsplit (Array (0.7, 0.3)) Val (trainingdata, TestData) = (split
S (0), splits (1))//Train a randomforest model.
Empty Categoricalfeaturesinfo indicates all features are continuous. Val numclasses = 2 val categoricalfeaturesinfo = Map[int, INT] () val numtrees = 3//Use more in practice.
Val featuresubsetstrategy = "Auto"//Let the algorithm choose. Val impurity = "Gini" val maxDepth = 4 val maxbins = val model = Randomforest.trainclassifier (trainingdata , Numclasses, Categoricalfeaturesinfo, Numtrees, Featuresubsetstrategy, impurity, maxDepth, maxbins)//Evaluat E model on test instances and compute Test error val labelandpreds = testdata.map {Point => val prediction = Model.predict (Point.features) (Point.label, Prediction)} val testerr = Labelandpreds.filter (r => r._1!) = r._2). Count.todouble/testdata.count () println ("Test Error =" + Testerr) println ("learned classification fores T model:\n "+ model.todebugstring)//Save and load Model Model.save (SC," target/tmp/myrandomforestclassification Model ") Val Samemodel = Randomforestmodel.load (SC," Target/tmp/myrandomforestclassificationmodel ")//$example off $}}//ScalastylE:on println
ml model Implementation
Scalastyle:off println Package org.apache.spark.examples.ml//$example on$ import org.apache.spark.ml.Pipeline Impor T org.apache.spark.ml.classification. {Randomforestclassificationmodel, randomforestclassifier} Import Org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator import org.apache.spark.ml.feature. {indextostring, stringindexer, vectorindexer}//$example off$ Import Org.apache.spark.sql.SparkSession Object
randomforestclassifierexample {def main (args:array[string]): unit = {val spark = sparksession. Builder . AppName ("Randomforestclassifierexample"). Getorcreate ()//$example on$//Load and parse the data file,
Converting it to a dataframe. Val data = Spark.read.format ("LIBSVM"). Load ("Data/mllib/sample_libsvm_data.txt")//Index labels, adding metadata to
The label column.
Fit on whole dataset to include all labels in index. Val labelindexer = new Stringindexer (). Setinputcol ("label"). SETOUTPUtcol ("Indexedlabel"). Fit (data)//automatically identify categorical features, and index them.
Set maxcategories so features with > 4 distinct values are treated as continuous. Val featureindexer = new Vectorindexer (). Setinputcol ("Features"). Setoutputcol ("Indexedfeatures"). Setm
Axcategories (4). Fit (data)//Split the data into training and test sets (30% held out for testing).
Val Array (trainingdata, testData) = Data.randomsplit (Array (0.7, 0.3))//Train a randomforest model. Val rf = new Randomforestclassifier (). Setlabelcol ("Indexedlabel"). Setfeaturescol ("Indexedfeatures"). S
Etnumtrees//Convert indexed labels back to original labels. Val labelconverter = new Indextostring (). Setinputcol ("prediction"). Setoutputcol ("Predictedlabel"). Set
Labels (labelindexer.labels)//Chain indexers and forest in a Pipeline. Val pipeline = new Pipeline (). Setstages (ArRay (Labelindexer, Featureindexer, RF, Labelconverter))//Train model.
This also runs the indexers.
Val model = Pipeline.fit (trainingdata)//Make predictions.
Val predictions = Model.transform (testData)//Select example rows to display. Predictions.select ("Predictedlabel", "label", "features"). Show (5)//Select (prediction, True label) and compute test
Error. Val evaluator = new Multiclassclassificationevaluator (). Setlabelcol ("Indexedlabel"). Setpredictioncol ("Predic tion "). Setmetricname (" accuracy ") val accuracy = evaluator.evaluate (predictions) println (" Test Error = "+ ( 1.0-accuracy)) Val Rfmodel = Model.stages (2). Asinstanceof[randomforestclassificationmodel] println ("Learned CLA Ssification Forest model:\n "+ rfmodel.todebugstring)//$example off$ spark.stop ()}//Scalastyle:on Printl N
TIPS:
want to see all of the sample code inside Http://spark.apache.org/docs? One way is to go to the GitHub, another way is to enter the spark installation directory, all the source code in spark/examples/src/main/scala/,
such as ML algorithm Scala implementation:
Spark/examples/ SRC/MAIN/SCALA/ORG/APACHE/SPARK/EXAMPLES/ML
Mllib algorithm Scala implementation:
spark/examples/src/main/scala/org/apache/ Spark/examples/mllib