Handwritten numeral recognition using the randomforest of Spark mllib on Kaggle handwritten digital datasets

Source: Internet
Author: User
Tags spark mllib

Yesterday I used spark Mllib's naïve Bayesian to do handwritten digit recognition, accuracy at about 0.83, today used RandomForest to train the model, and the parameter tuning.

First of all, RandomForest some of the parameters used to train the classifier are:

    • Numtrees: Number of trees in a random forest. Increasing this value can reduce the variance of the prediction, improve the accuracy of the predictive test, and the training time will grow linearly with it.
    • MaxDepth: The depth of each tree in a random forest. Increasing this value can be a more representational and powerful model, but training is also more time-consuming and easier to fit.

      In this training process, I just repeatedly adjusted the above two parameters to improve the accuracy of the prediction. Start by setting the initial values for some parameters.

    val10    val categoricalFeaturesInfo = Map[Int, Int]()    val3     val"auto"     val"gini"    val4    val32

For the first time, I set the number of trees to 3, and each tree has a depth of 4. The following starts the training model:

valRandomForest.trainClassifier(data, numClasses, categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)

I use the training data to calculate the accuracy rate in the same way that I use naive Bayes to evaluate accuracy:

 val nbtotalcorrect =  data  map  {Point =>  if  (Randomforestmodel Predict (Point Features) ==  point Label) 1  else  0 } sum  val numdata =   Data   Count () println (numdata) //42000  val nbaccuracy = nbtotalcorrect / Numdata 

The following is the exact rate at which each of the two parameters mentioned above has been adjusted:

    //numtree=3,maxdepth=4, accuracy rate: 0.5507619047619048    //numtree=4,maxdepth=5, accuracy rate: 0.7023095238095238    //numtree=5,maxdepth=6, accuracy rate: 0.693595238095238    //numtree=6,maxdepth=7, accuracy rate: 0.8426428571428571    //numtree=7,maxdepth=8, accuracy rate: 0.879452380952381    //numtree=8,maxdepth=9, accuracy rate: 0.9105714285714286    //numtree=9,maxdepth=10, accuracy rate: 0.9446428571428571    //numtree=10,maxdepth=11, accuracy rate: 0.9611428571428572    //numtree=11,maxdepth=12, accuracy rate: 0.9765952380952381    //numtree=12,maxdepth=13, accuracy rate: 0.9859523809523809    //numtree=13,maxdepth=14, accuracy rate: 0.9928333333333333    //numtree=14,maxdepth=15, accuracy rate: 0.9955    //numtree=15,maxdepth=16, accuracy rate: 0.9972857142857143    //numtree=16,maxdepth=17, accuracy rate: 0.9979285714285714    //numtree=17,maxdepth=18, accuracy rate: 0.9983809523809524    //numtree=18,maxdepth=19, accuracy rate: 0.9989285714285714    //numtree=19,maxdepth=20, accuracy rate: 0.9989523809523809    //numtree=20,maxdepth=21, accuracy rate: 0.999    //numtree=21,maxdepth=22, accuracy rate: 0.9994761904761905    //numtree=22,maxdepth=23, accuracy rate: 0.9994761904761905    //numtree=23,maxdepth=24, accuracy rate: 0.9997619047619047    //numtree=24,maxdepth=25, accuracy rate: 0.9997857142857143    //numtree=25,maxdepth=26, accuracy rate: 0.9998333333333334    //numtree=29,maxdepth=30, accuracy rate: 0.9999523809523809

It can be found that the accuracy rate numTree=11,maxDepth=12 starts to converge around to 0.999. This time the accuracy is much higher than the accuracy (0.826) of the last use of naive Bayesian training. Now we start to make predictions for the test data, using the numTree=29,maxDepth=30 following parameters:

val predictions = randomForestModel.predict(features).map { p => p.toInt }

The results of the training to upload to the kaggle, the accuracy rate is 0.95929 , after my four parameter adjustment, the highest accuracy rate is 0.96586 , set the parameters are: numTree=55,maxDepth=30 , when I change the parameters numTree=70,maxDepth=30 , the accuracy rate has dropped, for 0.96271 , It seems that this time has come to fit. But the accuracy can be increased from yesterday's 0.83 to 0.96 is very exciting, I will continue to try to use other methods of handwritten numeral recognition, I do not know when to reach 1.

Handwritten numeral recognition using the randomforest of Spark mllib on Kaggle handwritten digital datasets

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.