Handwritten numeral recognition using the naïve Bayesian model of spark Mllib on Kaggle handwritten digital datasets

Last Update:2016-05-12 Source: Internet

Author: User

Tags spark mllib

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Yesterday I downloaded a data set for handwritten numeral recognition in Kaggle, and wanted to train a model for handwritten digit recognition through some recent learning methods. These datasets are derived from 28x28 pixel-sized handwritten digital grayscale images, where the first element of the training data is a specific handwritten number, and the remaining 784 elements are grayscale values for each pixel of the handwritten digital grayscale image, in the range of [0,255], and the test data does not have the first element in the training data. Contains only 784 grayscale values. Now I'm going to train the model using the naïve Bayesian algorithm provided in Spark Mllib.

Let's start by setting some parameters for the spark context:

valnew SparkConf()    .setAppName("DigitRecgonizer")    .setMaster("local[*]")    .set("spark.driver.memory""10G")valnew SparkContext(conf)

So the spark context has been created, so now to read the training data, here I took the original training data header removed, only the data department, training data is saved in CSV format:

val rawData = sc.textFile("file://path/train-noheader.csv")

Since the data is in CSV format, then use "," to convert each row of data to an array:

val records = rawData.map(lineline.split(","))

These are processed into data types that naive Bayes can accept, LabeledPoint this type receives two parameters, the first parameter is label (tag, here is the specific handwritten number), the second parameter is features (eigenvectors, here is 784 grayscale values):

    val records = rawData.map(line => line.split(","))    val data = records.map{ r =>      val label = r(0).toInt      val features = r.slice(1, r.size).map(p => p.toDouble)      LabeledPoint(label, Vectors.dense(features))    }

Now that the data is ready, you can start training the model, and in Mllib, you simply call the method to train complete the training of the model:

valNaiveBayes.train(data)

Now that a model has been trained, let's see how accurate it is on the training data set, where I train the features of the training data set to the model, compare the resulting results with the real results, and then count the correct number of bars to assess the accuracy of the model, which should be considered a cross-validation:

    data.map { point =>      if (nbModel.predict(point.features) == point.label) 1 else 0    }.sum    data.count()    val nbAccuracy = nbTotalCorrect / numData

After running this code, I get the exact rate 0.8261190476190476 .

The test data is now identified, and the test data is read first:

val unlabeledData = sc.textFile("file://path/test-noheader.csv")

It is then preprocessed in the same way as before:

val unlabeledRecords = unlabeledData.map(line => line.split(","))val features = unlabeledRecords.map{ r =>  val f = r.map(p => p.toDouble)  Vectors.dense(f)}

Note that there is no label in the test data, so all its values are characterized features .

Now start identifying the test data and saving the results as a file:

    val predictions = nbModel.predict(features).map(p => p.toInt)    predictions.repartition(1).saveAsTextFile("file://path/digitRec.txt")

After all the work was done, I uploaded the calculated results to the kaggle and found that the accuracy was about 0.83, similar to the assessment I had before in the training data set.

Come here today, and you may find other ways to train the model in the future to see how it works.

Handwritten numeral recognition using the naïve Bayesian model of spark Mllib on Kaggle handwritten digital datasets

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More