Yesterday I downloaded a data set for handwritten numeral recognition in Kaggle, and wanted to train a model for handwritten digit recognition through some recent learning methods. These datasets are derived from 28x28 pixel-sized handwritten digital grayscale images, where the first element of the training data is a specific handwritten number, and the remaining 784 elements are grayscale values for each pixel of the handwritten digital grayscale image, in the range of [0,255], and the test data does not have the first element in the training data. Contains only 784 grayscale values. Now I'm going to train the model using the naïve Bayesian algorithm provided in Spark Mllib.
Let's start by setting some parameters for the spark context:
valnew SparkConf() .setAppName("DigitRecgonizer") .setMaster("local[*]") .set("spark.driver.memory""10G")valnew SparkContext(conf)
So the spark context has been created, so now to read the training data, here I took the original training data header removed, only the data department, training data is saved in CSV format:
val rawData = sc.textFile("file://path/train-noheader.csv")
Since the data is in CSV format, then use "," to convert each row of data to an array:
val records = rawData.map(lineline.split(","))
These are processed into data types that naive Bayes can accept, LabeledPoint
this type receives two parameters, the first parameter is label
(tag, here is the specific handwritten number), the second parameter is features
(eigenvectors, here is 784 grayscale values):
val records = rawData.map(line => line.split(",")) val data = records.map{ r => val label = r(0).toInt val features = r.slice(1, r.size).map(p => p.toDouble) LabeledPoint(label, Vectors.dense(features)) }
Now that the data is ready, you can start training the model, and in Mllib, you simply call the method to train
complete the training of the model:
valNaiveBayes.train(data)
Now that a model has been trained, let's see how accurate it is on the training data set, where I train the features of the training data set to the model, compare the resulting results with the real results, and then count the correct number of bars to assess the accuracy of the model, which should be considered a cross-validation:
data.map { point => if (nbModel.predict(point.features) == point.label) 1 else 0 }.sum data.count() val nbAccuracy = nbTotalCorrect / numData
After running this code, I get the exact rate 0.8261190476190476
.
The test data is now identified, and the test data is read first:
val unlabeledData = sc.textFile("file://path/test-noheader.csv")
It is then preprocessed in the same way as before:
val unlabeledRecords = unlabeledData.map(line => line.split(","))val features = unlabeledRecords.map{ r => val f = r.map(p => p.toDouble) Vectors.dense(f)}
Note that there is no label in the test data, so all its values are characterized features
.
Now start identifying the test data and saving the results as a file:
val predictions = nbModel.predict(features).map(p => p.toInt) predictions.repartition(1).saveAsTextFile("file://path/digitRec.txt")
After all the work was done, I uploaded the calculated results to the kaggle and found that the accuracy was about 0.83, similar to the assessment I had before in the training data set.
Come here today, and you may find other ways to train the model in the future to see how it works.
Handwritten numeral recognition using the naïve Bayesian model of spark Mllib on Kaggle handwritten digital datasets