Classify handwritten digits using the famous MNIST data
This competition was the first in a series of tutorial competitions designed to introduce people to machine learning.
The goal-competition is-to-take an image of a handwritten a-digit, and determine what's digit is. As the competition progresses, we'll release tutorials which explain different machine learning algorithms To get started.
The data for this competition were taken from the MNIST dataset. The MNIST ("Modified National Institute of Standards and technology") datasets are a classic within the machine learning com Munity that have been extensively studied. More detail about the dataset, including machine learning algorithms that has been tried on it and their levels of succes S, can is found at http://yann.lecun.com/exdb/mnist/index.html.
Title Link: Http://www.kaggle.com/c/digit-recognizer
Digital recognition of handwriting
Data Description: Http://www.kaggle.com/c/digit-recognizer/data
Each picture is 28 pixels long, each pixel is represented by a number (between 0~255), so each picture is represented by a 28x28 number. The training data contains a list of label and 784 column pixel values. The test data does not have a label column. Objective: To train the training data, to obtain the model and predict the label value of the test data.
The following restores the picture from the pixel value to the actual picture, using Ipython notebook:
In [1]:<textarea tabindex="0" spellcheck="false" autocapitalize="off" autocorrect="off" wrap="off" style="position: absolute; padding-top: 0px; padding-left: 0px; width: 1px; height: 1em; outline: none medium;"></textarea>
Pwd
C:\Users\zhaohf\Desktop
In [5]:<textarea tabindex="0" spellcheck="false" autocapitalize="off" autocorrect="off" wrap="off" style="position: absolute; padding-top: 0px; padding-left: 0px; width: 1px; height: 1em; outline: none medium;"></textarea>
CD .. / .. / .. / Workspace / Kaggle / Digitrecognizer / Data /
C:\workspace\kaggle\DigitRecognizer\Data
In [6]:<textarea tabindex="0" spellcheck="false" autocapitalize="off" autocorrect="off" wrap="off" style="position: absolute; padding-top: 0px; padding-left: 0px; width: 1px; height: 1em; outline: none medium;"></textarea>
Ls
The volume in drive C is the OS volume serial number that is the 6C93-0DF3 C:\workspace\kaggle\DigitRecognizer\Data directory 2015/01/15 16:04 <DIR> . 2015/01/15 16:04 <DIR> . 2014/12/28 15:06 240,909 rf_benchmark.csv2015/01/15 16:04 51,118,294 test.csv2014/12/28 15:06 51,118,296 test.csv.bak2014/12/28 15:06 76,775,041 train.csv 4 files 179,252,540 bytes 2 directories 105,536,135,168 bytes available
In [7]:<textarea tabindex="0" spellcheck="false" autocapitalize="off" autocorrect="off" wrap="off" style="position: absolute; padding-top: 0px; padding-left: 0px; width: 1px; height: 1em; outline: none medium;"></textarea>
Import Pandas as PD
DF PD. Read_csv (' train.csv ',header=0). Head () #只要前5行
In [8]:<textarea tabindex="0" spellcheck="false" autocapitalize="off" autocorrect="off" wrap="off" style="position: absolute; padding-top: 0px; padding-left: 0px; width: 1px; height: 1em; outline: none medium;"></textarea>
Df
OUT[8]:
|
label |
pixel0 |
pixel1 |
pixel2 |
pixel3 |
pixel4 |
pixel5 |
pixel6 |
pixel7 |
pixel8 |
| ...
pixel774 |
pixel775 |
pixel776 |
pixel777 |
pixel778 |
pixel779 |
pixel780 |
pixel781 |
pixel782 |
pixel783 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
... |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
... |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
2 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
... |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
3 |
4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
... |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
... |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
5 rowsx785 Columns
In [9]:<textarea tabindex="0" spellcheck="false" autocapitalize="off" autocorrect="off" wrap="off" style="position: absolute; padding-top: 0px; padding-left: 0px; width: 1px; height: 1em; outline: none medium;"></textarea>
DF [' label ']
OUT[9]:
0 0name:label, Dtype:int64
In [14]:<textarea tabindex="0" spellcheck="false" autocapitalize="off" autocorrect="off" wrap="off" style="position: absolute; padding-top: 0px; padding-left: 0px; width: 1px; height: 1em; outline: none medium;"></textarea>
DF DF. IX [:,' pixel0 ':] #去除label列
In [15]:<textarea tabindex="0" spellcheck="false" autocapitalize="off" autocorrect="off" wrap="off" style="position: absolute; padding-top: 0px; padding-left: 0px; width: 1px; height: 1em; outline: none medium;"></textarea>
Df
OUT[15]:
|
pixel0 |
pixel1 |
pixel2 |
pixel3 |
pixel4 |
pixel5 |
pixel6 |
pixel7 |
pixel8 |
pixel9 |
| ...
pixel774 |
pixel775 |
pixel776 |
pixel777 |
pixel778 |
pixel779 |
pixel780 |
pixel781 |
pixel782 |
pixel783 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
... |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
... |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
2 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
... |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
3 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
... |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
... |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
5 rowsx784 Columns
In [21]:<textarea tabindex="0" style="position:absolute; padding-top:0px; padding-left:0px; width:1px; height:1em; outline:none medium"></textarea>
% matplotlib inline
Import matplotlib. Pyplot as PLT
for I inch Range (DF. Shape [0]):
img DF. IX [i]. values. Reshape (())
plt. subplot (2,5,i+1)
plt. Imshow (img)
The following is the use of random forests for training and forecasting:
Import NumPy as Npfrom sklearn.ensemble import randomforestclassifierfrom numpy Import savetxt,loadtxttrain = Loadtxt ('.. /data/train.csv ', delimiter= ', ', skiprows=1) X_train = Np.array ([x[1:] for X in train]) print X_train.shapey_train = Np.array ([x[0] for x in train]) print y_train.shapex_test = Loadtxt (' ... /data/test.csv ', delimiter= ', ', Skiprows=1) print X_test.shapeprint ' Training ... ' RF = Randomforestclassifier (n_ estimators=100) print ' predicting ... ' Rf_model = Rf.fit (x_train,y_train) pred = [[index+1,x] for index,x in enumerate (rf_ Model.predict (x_test))]savetxt ('.. /submissions/myrf_benchmark.csv ', pred,delimiter= ', ', fmt= '%d,%d ', header= ' Imageid,label ', comments= ') print ' done. '
First Submission Results:
The--digit of the Kaggle contest title recognizer