Handwritten numeral recognition using the naïve Bayesian model of spark Mllib on Kaggle handwritten digital datasets

Source: Internet
Author: User
Tags spark mllib

Yesterday I downloaded a data set for handwritten numeral recognition in Kaggle, and wanted to train a model for handwritten digit recognition through some recent learning methods. These datasets are derived from 28x28 pixel-sized handwritten digital grayscale images, where the first element of the training data is a specific handwritten number, and the remaining 784 elements are grayscale values for each pixel of the handwritten digital grayscale image, in the range of [0,255], and the test data does not have the first element in the training data. Contains only 784 grayscale values. Now I'm going to train the model using the naïve Bayesian algorithm provided in Spark Mllib.

Let's start by setting some parameters for the spark context:

valnew SparkConf()    .setAppName("DigitRecgonizer")    .setMaster("local[*]")    .set("spark.driver.memory""10G")valnew SparkContext(conf)

So the spark context has been created, so now to read the training data, here I took the original training data header removed, only the data department, training data is saved in CSV format:

val rawData = sc.textFile("file://path/train-noheader.csv")

Since the data is in CSV format, then use "," to convert each row of data to an array:

val records = rawData.map(lineline.split(","))

These are processed into data types that naive Bayes can accept, LabeledPoint this type receives two parameters, the first parameter is label (tag, here is the specific handwritten number), the second parameter is features (eigenvectors, here is 784 grayscale values):

    val records = rawData.map(line => line.split(","))    val data = records.map{ r =>      val label = r(0).toInt      val features = r.slice(1, r.size).map(p => p.toDouble)      LabeledPoint(label, Vectors.dense(features))    }

Now that the data is ready, you can start training the model, and in Mllib, you simply call the method to train complete the training of the model:


Now that a model has been trained, let's see how accurate it is on the training data set, where I train the features of the training data set to the model, compare the resulting results with the real results, and then count the correct number of bars to assess the accuracy of the model, which should be considered a cross-validation:

    data.map { point =>      if (nbModel.predict(point.features) == point.label) 1 else 0    }.sum    data.count()    val nbAccuracy = nbTotalCorrect / numData

After running this code, I get the exact rate 0.8261190476190476 .

The test data is now identified, and the test data is read first:

val unlabeledData = sc.textFile("file://path/test-noheader.csv")

It is then preprocessed in the same way as before:

val unlabeledRecords = unlabeledData.map(line => line.split(","))val features = unlabeledRecords.map{ r =>  val f = r.map(p => p.toDouble)  Vectors.dense(f)}

Note that there is no label in the test data, so all its values are characterized features .

Now start identifying the test data and saving the results as a file:

    val predictions = nbModel.predict(features).map(p => p.toInt)    predictions.repartition(1).saveAsTextFile("file://path/digitRec.txt")

After all the work was done, I uploaded the calculated results to the kaggle and found that the accuracy was about 0.83, similar to the assessment I had before in the training data set.

Come here today, and you may find other ways to train the model in the future to see how it works.

Handwritten numeral recognition using the naïve Bayesian model of spark Mllib on Kaggle handwritten digital datasets

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.