Getting started with Kaggle-using Scikit-learn to solve digitrecognition problems

Source: Internet
Author: User
Tags svm

Getting started with Kaggle-using Scikit-learn to solve digitrecognition problems

@author: Wepon

@blog: http://blog.csdn.net/u012162613


1, Scikit-learn simple introduction

Scikit-learn is an open-source machine learning toolkit based on NumPy, SciPy, and Matplotlib. Written in the Python language. Mainly covers classification,

back and clustering algorithms such as KNN, SVM, logistic regression, Naive Bayes, random forest, K-means and many other algorithms, official online code and documentation

are not often do not wrong, for machine learning developers. is a convenient and powerful tool, saving a lot of development time.


Scikit-learn official website Guide: http://scikit-learn.org/stable/user_guide.html



The previous article "Big Data competition Platform-kaggle" I introduced kaggle in two parts, in the second part, I recorded the solution Kaggle on the competition project digitrecognition the entire process, then I wrote the KNN algorithm, Although it does not take much time to write the KNN algorithm, when we want to try a lot of other, more complex algorithms, assuming that each algorithm is implemented by itself, it will be a waste of time, when Scikit-learn play a role, we can directly call Scikit-learn algorithm package.

Of course, for those who have just started learning, it may be necessary to understand the algorithm based on the invocation of these algorithm packages, assuming that there is time to fully implement an algorithm to believe that you will be more in-depth algorithm mastery.
OK 。 Verses, the following enters the second part.
2, use Scikit-learn to resolve digitrecognition I find myself very fond of practicing classification algorithms with digitrecognition, because it's simple enough. Let's say you don't know digitrecognition The question is, just a quick look: Kaggle digitrecognition &NBSP: In my last article, there was also a descriptive narrative, "Big Data competition platform-kaggle"  . Below I use Scikit-learn in the algorithm package KNN (k nearest neighbor), SVM (Support vector machine), NB (naive Bayesian) to solve the problem, the key steps to solve the problem are two: 1, processing data. 2, call the algorithm.


(1) Processing data This part is the same as the data processing in the second part of the previous article "Big Data Race platform-kaggle". This article is not intended to be repeated. The following is simply a list of functions and their functions. Specific code is also available in the final section of this article.


Def loadtraindata ():    get training sample in #这个函数从train. csv file: Traindata, Trainlabeldef loadtestdata ():    # This function obtains a test sample from the Test.csv file: Testdatadef toint (array):d EF nomalizing (array):    #这两个函数在loadTrainData () and Loadtestdata () Called    #toInt () converts an array of strings into integers, nomalizing () normalized integer def loadtestresult ():    #这个函数载入測试样本的參考label, for the later Def Saveresult (result,csvname):    #这个函数将result保存为csv文件, named Csvname


The Process data section. We obtained the training samples from the Train.csv, test.csv file feature, training samples of the label, measured sample of the feature, in the program we use Traindata, Trainlabel, TestData.


(2) calling the algorithm in Scikit-learnKNN Algorithm
#调用scikit的knn算法包from sklearn.neighbors Import kneighborsclassifier  def knnclassify (Traindata,trainlabel, TestData):     knnclf=kneighborsclassifier () #default: K = 5,defined by Yourself:kneighborsclassifier (n_neighbors=10 )    Knnclf.fit (Traindata,ravel (Trainlabel))    testlabel=knnclf.predict (testData)    Saveresult (TestLabel, ' Sklearn_knn_result.csv ')    return TestLabel

KNN algorithm package can set its own parameter k, the default k=5, the above comments is described.

More specific use, recommended officer net view: http://scikit-learn.org/stable/modules/neighbors.html




SVM Algorithm
#调用scikit的SVM算法包from Sklearn import SVM   def svcclassify (traindata,trainlabel,testdata):     SVCCLF=SVM. SVC (c=5.0) #default: C=1.0,kernel = ' RBF '. You can try kernel: ' linear ', ' poly ', ' rbf ', ' sigmoid ', ' precomputed '      svcclf.fit (Traindata,ravel (Trainlabel))    Testlabel=svcclf.predict (TestData)    saveresult (TestLabel, ' sklearn_svc_c=5.0_result.csv ')    return TestLabel

There are a number of parameters for SVC (). The kernel function implicitly feels that ' RBF ' (radial basis function), C defaults to feel 1.0

More specific use, recommended officer net view: http://scikit-learn.org/stable/modules/svm.html



naive Bayesian algorithm
#调用scikit的朴素贝叶斯算法包, GAUSSIANNB and multinomialnbfrom sklearn.naive_bayes import gaussiannb      data def #nb for Gaussian distribution Gaussiannbclassify (traindata,trainlabel,testdata):     nbclf=gaussiannb ()              Nbclf.fit (Traindata,ravel ( Trainlabel)    testlabel=nbclf.predict (testData)    saveresult (TestLabel, ' sklearn_gaussiannb_result.csv ')    return TestLabel    from sklearn.naive_bayes import MULTINOMIALNB   #nb The data def for the for polynomial distribution    Multinomialnbclassify (traindata,trainlabel,testdata):     NBCLF=MULTINOMIALNB (alpha=0.1)      #default alpha= 1.0,setting Alpha = 1 is called Laplace smoothing, while the Alpha < 1 is called Lidstone smoothing.           Nbclf.fit (Traindata,ravel (Trainlabel))    testlabel=nbclf.predict (testData)    saveresult (TestLabel, ' Sklearn_multinomialnb_alpha=0.1_result.csv ')    return TestLabel

Above I tried two naive Bayes algorithms: Gaussian distribution and polynomial distribution. The polynomial distribution function has a reference to the alpha can be self-set.

More specific use, recommended officer net view: http://scikit-learn.org/stable/modules/naive_bayes.html



Summary of Usage:
The first step : first determine which classifier to use, this step can set a variety of parameters. Example:

SVCCLF=SVM. SVC (c=5.0)

Step two : What training data will be used to use this classifier? Call the Fit method, for example:
Svcclf.fit (Traindata,ravel (Trainlabel))

Fit (x, y) Description:X: Corresponding Traindataarray-like, shape = [N_samples, N_features],x is the set of eigenvectors of the training sample,n_samples row n_features column, which is one row for each training sample, How many columns each training sample has. y: corresponding Trainlabelarray-like, shape = [n_samples],y must be a row vector, which is why the Numpy.ravel () function is used above.
Step three : Use the classifier to pre-test samples, for example:
Testlabel=svcclf.predict (TestData)

Call the Predict method.
Fourth Step : Save the results. This step is based on our request to solve this problem, as this article takes digitrecognition as an example, so there are:
Saveresult (TestLabel, ' sklearn_svc_c=5.0_result.csv ')



(3) make a submissionThe above basic is the whole development process, the following look at the effect of each algorithm, on Kaggle make a submission
the effect of KNN algorithm, accuracy rate 95.871%



Naive Bayes, alpha=1.0, accuracy rate 81.043%


SVM. Linear nucleus. Accurate rate 93.943%



3. Project filecsdn Download: Kaggle Get started-use Scikit-learn to resolve digitrecoginition
github:https://github.com/wepe/kaggle-solution

post the code:
#!/usr/bin/python#-*-coding:utf-8-*-"" "Created on Tue Dec 21:59:00 2014@author:wepon@blog:http://blog.csdn.net/u0     12162613 "" "from NumPy import *import csvdef toint (array): Array=mat (array) m,n=shape (array) Newarray=zeros ((m,n)) For I in Xrange (m): for J in Xrange (n): Newarray[i,j]=int (Array[i,j]) return NewArray def                Nomalizing (Array): M,n=shape (array) for I in Xrange (m): for J in Xrange (n): if array[i,j]!=0: Array[i,j]=1 return Array def loadtraindata (): l=[] with open (' train.csv ') as File:lines=c Sv.reader (file) for line in Lines:l.append (line) #42001 *785 l.remove (l[0]) L=array (l) label= l[:,0] data=l[:,1:] Return nomalizing (ToInt (data)), ToInt (label) #label 1*42000 data 42000*784 #return traindata , Trainlabel def loadtestdata (): l=[] with open (' test.csv ') as File:lines=csv.reader (file) for Li             NE in lines:L.append (line) #28001 *784 l.remove (l[0]) Data=array (L) return nomalizing (ToInt (data)) # data 28000*784 #retur         n testData def loadtestresult (): l=[] with open (' knn_benchmark.csv ') as File:lines=csv.reader (file)  Lines:l.append (line) #28001 * * L.remove (l[0]) Label=array (L) return ToInt (label[:,1]) # label 28000*1 #result是结果列表 #csvName是存放结果的csv文件名称def Saveresult (result,csvname): With open (Csvname, ' WB ') as Myfil E:mywriter=csv.writer (MyFile) for i in result:tmp=[] tmp.append (i) my Writer.writerow (TMP) #调用scikit的knn算法包from sklearn.neighbors import Kneighborsclassifier def KNNCLA Ssify (Traindata,trainlabel,testdata): Knnclf=kneighborsclassifier () #default: k = 5,defined by yourself: Kneighborsclassifier (n_neighbors=10) Knnclf.fit (Traindata,ravel (Trainlabel)) testlabel=knnclf.predict (TestData) s Averesult (TestLabel, ' Sklearn_knn_result.CSV ') return TestLabel #调用scikit的SVM算法包from sklearn import svm def svcclassify (traindata,trainlabel,testdata): SVCCLF=SVM. SVC (c=5.0) #default: C=1.0,kernel = ' RBF '. You can try kernel: ' linear ', ' poly ', ' rbf ', ' sigmoid ', ' precomputed ' Svcclf.fit (Traindata,ravel (Trainlabel)) Testl Abel=svcclf.predict (TestData) saveresult (TestLabel, ' sklearn_svc_c=5.0_result.csv ') return TestLabel #调用scikit的朴素贝 Dean algorithm package, GAUSSIANNB and multinomialnbfrom sklearn.naive_bayes import gaussiannb #nb for Gaussian distribution data def gaussiannbclassify (traind Ata,trainlabel,testdata): Nbclf=gaussiannb () Nbclf.fit (Traindata,ravel (Trainlabel)) TESTLABEL=NBCLF.PR Edict (TestData) Saveresult (TestLabel, ' sklearn_gaussiannb_result.csv ') return TestLabel from Sklearn.naive_bayes I Mport MULTINOMIALNB #nb for polynomial distribution data def multinomialnbclassify (traindata,trainlabel,testdata): Nbclf=multinomialn B (alpha=0.1) #default alpha=1.0,setting alpha = 1 is called Laplace smoothing, while AlPha < 1 is called Lidstone smoothing. Nbclf.fit (Traindata,ravel (Trainlabel)) testlabel=nbclf.predict (TestData) saveresult (TestLabel, ' Sklearn_    Multinomialnb_alpha=0.1_result.csv ') return testlabeldef digitrecognition (): Traindata,trainlabel=loadtraindata () Testdata=loadtestdata () #使用不同算法 result1=knnclassify (traindata,trainlabel,testdata) result2=svcclassify (trainData , Trainlabel,testdata) result3=gaussiannbclassify (traindata,trainlabel,testdata) result4=multinomialnbclassify ( Traindata,trainlabel,testdata) #将结果与跟给定的knn_benchmark对照, take RESULT1 as an example Resultgiven=loadtestresult () M,n=shape (tes Tdata) different=0 #result1中与benchmark不同的label个数, initialized to 0 for I in Xrange (m): if Result1[i]!=resultgiven[0, I]: different+=1 print different



Copyright notice: This article blog original articles, blogs, without consent, may not be reproduced.

Getting started with Kaggle-using Scikit-learn to solve digitrecognition problems

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.