Yesterday I used spark Mllib's naïve Bayesian to do handwritten digit recognition, accuracy at about 0.83, today used RandomForest
to train the model, and the parameter tuning.
First of all, RandomForest
some of the parameters used to train the classifier are:
- Numtrees: Number of trees in a random forest. Increasing this value can reduce the variance of the prediction, improve the accuracy of the predictive test, and the training time will grow linearly with it.
MaxDepth: The depth of each tree in a random forest. Increasing this value can be a more representational and powerful model, but training is also more time-consuming and easier to fit.
In this training process, I just repeatedly adjusted the above two parameters to improve the accuracy of the prediction. Start by setting the initial values for some parameters.
val10 val categoricalFeaturesInfo = Map[Int, Int]() val3 val"auto" val"gini" val4 val32
For the first time, I set the number of trees to 3, and each tree has a depth of 4. The following starts the training model:
valRandomForest.trainClassifier(data, numClasses, categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
I use the training data to calculate the accuracy rate in the same way that I use naive Bayes to evaluate accuracy:
val nbtotalcorrect = data map {Point => if (Randomforestmodel Predict (Point Features) == point Label) 1 else 0 } sum val numdata = Data Count () println (numdata) //42000 val nbaccuracy = nbtotalcorrect / Numdata
The following is the exact rate at which each of the two parameters mentioned above has been adjusted:
//numtree=3,maxdepth=4, accuracy rate: 0.5507619047619048 //numtree=4,maxdepth=5, accuracy rate: 0.7023095238095238 //numtree=5,maxdepth=6, accuracy rate: 0.693595238095238 //numtree=6,maxdepth=7, accuracy rate: 0.8426428571428571 //numtree=7,maxdepth=8, accuracy rate: 0.879452380952381 //numtree=8,maxdepth=9, accuracy rate: 0.9105714285714286 //numtree=9,maxdepth=10, accuracy rate: 0.9446428571428571 //numtree=10,maxdepth=11, accuracy rate: 0.9611428571428572 //numtree=11,maxdepth=12, accuracy rate: 0.9765952380952381 //numtree=12,maxdepth=13, accuracy rate: 0.9859523809523809 //numtree=13,maxdepth=14, accuracy rate: 0.9928333333333333 //numtree=14,maxdepth=15, accuracy rate: 0.9955 //numtree=15,maxdepth=16, accuracy rate: 0.9972857142857143 //numtree=16,maxdepth=17, accuracy rate: 0.9979285714285714 //numtree=17,maxdepth=18, accuracy rate: 0.9983809523809524 //numtree=18,maxdepth=19, accuracy rate: 0.9989285714285714 //numtree=19,maxdepth=20, accuracy rate: 0.9989523809523809 //numtree=20,maxdepth=21, accuracy rate: 0.999 //numtree=21,maxdepth=22, accuracy rate: 0.9994761904761905 //numtree=22,maxdepth=23, accuracy rate: 0.9994761904761905 //numtree=23,maxdepth=24, accuracy rate: 0.9997619047619047 //numtree=24,maxdepth=25, accuracy rate: 0.9997857142857143 //numtree=25,maxdepth=26, accuracy rate: 0.9998333333333334 //numtree=29,maxdepth=30, accuracy rate: 0.9999523809523809
It can be found that the accuracy rate numTree=11,maxDepth=12
starts to converge around to 0.999. This time the accuracy is much higher than the accuracy (0.826) of the last use of naive Bayesian training. Now we start to make predictions for the test data, using the numTree=29,maxDepth=30
following parameters:
val predictions = randomForestModel.predict(features).map { p => p.toInt }
The results of the training to upload to the kaggle, the accuracy rate is 0.95929
, after my four parameter adjustment, the highest accuracy rate is 0.96586
, set the parameters are: numTree=55,maxDepth=30
, when I change the parameters numTree=70,maxDepth=30
, the accuracy rate has dropped, for 0.96271
, It seems that this time has come to fit. But the accuracy can be increased from yesterday's 0.83 to 0.96 is very exciting, I will continue to try to use other methods of handwritten numeral recognition, I do not know when to reach 1.
Handwritten numeral recognition using the randomforest of Spark mllib on Kaggle handwritten digital datasets