Get started with Kaggle -- use scikit-learn to solve DigitRecognition and scikitlearn
Get started with Kaggle -- use scikit-learn to solve DigitRecognition Problems
@ Author: wepon
@ Blog: http://blog.csdn.net/u012162613
1. Introduction to scikit-learn
Scikit-learn is an open-source machine learning toolkit based on NumPy, SciPy, and Matplotlib. It is written in Python and covers classification,
Regression and clustering algorithms, such as knn, SVM, logistic regression, Naive Bayes, random forest, k-means, and many other algorithms.
It is a convenient and powerful tool for machine learning developers, saving a lot of development time.
Scikit-learn official guide: http://scikit-learn.org/stable/user_guide.html
In the previous article "Big Data competition platform-Kaggle getting started", I introduced Kaggle in two parts. In the second section, I recorded the whole process of solving DigitRecognition in the Competition Project on Kaggle, at that time, I used my kNN algorithm. Although it didn't take a lot of time to write my own kNN algorithm, when we wanted to try more and more complex algorithms, if each algorithm is implemented by itself, it will be a waste of time. At this time, scikit-learn plays a role. We can directly call the scikit-learn algorithm package. Of course, it is better for beginners to call these algorithm packages based on understanding the algorithms. If there is time, fully implementing an algorithm will give you a deeper understanding of the algorithm.
OK. Let's get bored. The second part is below.
2. Use scikit-learn to solve DigitRecognition. I found myself very fond of using DigitRecognition to practice classification algorithms because it is simple enough. If you do not know the problem about DigitRecognition, please first take a look at it: Kaggle DigitRecognition, which is also described in my previous article: Big Data competition platform-Kaggle getting started. Below I use the scikit-learn algorithm package kNN (k Nearest Neighbor), SVM (Support Vector Machine), and NB (Naive Bayes) to solve this problem. There are two key steps to solve the problem: 1. process data. 2. Call an algorithm.
(1) data processing is the same as data processing in the second part of the previous article "Big Data competition platform-Kaggle getting started". This article is not intended to be repeated, the following is a simple list of functions and functions, and detailed code is provided at the end of this article. '
def loadTrainData ():
#This function gets training samples from the train.csv file: trainData, trainLabel
def loadTestData ():
#This function gets the test sample from the test.csv file: testData
def toInt (array):
def nomalizing (array):
#These two functions are called in loadTrainData () and loadTestData ()
#toInt () converts a string array to an integer, nomalizing () normalizes an integer
def loadTestResult ():
#This function loads the reference label of the test sample for later comparison
def saveResult (result, csvName):
#This function saves the result as a csv file, named after csvName
In the "processing data" part, we obtained the training sample feature, the training sample label, and the test sample feature from the train.csv and test.csv files. In the program, we use trainData, trainLabel, and testData.
(2) Call the kNN algorithm in scikit-learn
#Call scikit's knn algorithm package
from sklearn.neighbors import KNeighborsClassifier
def knnClassify (trainData, trainLabel, testData):
knnClf = KNeighborsClassifier () # default: k = 5, defined by yourself: KNeighborsClassifier (n_neighbors = 10)
knnClf.fit (trainData, ravel (trainLabel))
testLabel = knnClf.predict (testData)
saveResult (testLabel, 'sklearn_knn_Result.csv')
return testLabel
The kNN algorithm package can set the parameter k by itself, and the default k = 5. The comments above have instructions.
For more detailed use, it is recommended to view on the official website: http://scikit-learn.org/stable/modules/neighbors.html
SVM algorithm
#Call scikit's SVM algorithm package
from sklearn import svm
def svcClassify (trainData, trainLabel, testData):
svcClf = svm.SVC (C = 5.0) #default: C = 1.0, kernel = 'rbf'. you can try kernel: ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’
svcClf.fit (trainData, ravel (trainLabel))
testLabel = svcClf.predict (testData)
saveResult (testLabel, 'sklearn_SVC_C = 5.0_Result.csv')
return testLabel
SVC () has many parameters. The kernel function defaults to 'rbf' (radial basis function), and C defaults to 1.0
For more detailed use, it is recommended to view on the official website: http://scikit-learn.org/stable/modules/svm.html
Naive Bayes algorithm
#Call scikit's Naive Bayes algorithm package, GaussianNB and MultinomialNB
from sklearn.naive_bayes import GaussianNB #nb for Gaussian distribution data
def GaussianNBClassify (trainData, trainLabel, testData):
nbClf = GaussianNB ()
nbClf.fit (trainData, ravel (trainLabel))
testLabel = nbClf.predict (testData)
saveResult (testLabel, 'sklearn_GaussianNB_Result.csv')
return testLabel
from sklearn.naive_bayes import MultinomialNB #nb for polynomial distribution data
def MultinomialNBClassify (trainData, trainLabel, testData):
nbClf = MultinomialNB (alpha = 0.1) #default alpha = 1.0, Setting alpha = 1 is called Laplace smoothing, while alpha <1 is called Lidstone smoothing.
nbClf.fit (trainData, ravel (trainLabel))
testLabel = nbClf.predict (testData)
saveResult (testLabel, 'sklearn_MultinomialNB_alpha = 0.1_Result.csv')
return testLabel
Above I tried two naive Bayesian algorithms: Gaussian and polynomial. The function of polynomial distribution has parameter alpha which can be set by itself. For more detailed use, it is recommended to view on the official website: http://scikit-learn.org/stable/modules/naive_bayes.html
Summary of usage:
The first step: first determine which classifier to use, this step can set various parameters, such as:
svcClf = svm.SVC (C = 5.0)
Step 2: What training data will the classifier use? Call the fit method, for example:
svcClf.fit (trainData, ravel (trainLabel))
fit (X, y) Description: X: corresponds to trainData array-like, shape = [n_samples, n_features], X is the feature vector set of training samples, n_samples row n_features column, that is, each training sample occupies one row, each training sample There are as many columns as there are features. y: corresponding to trainLabel array-like, shape = [n_samples], y must be a row vector, which is why the numpy.ravel () function is used above.
Step 3: Use classifiers to predict test samples, such as:
testLabel = svcClf.predict (testData)
Call the predict method.
The fourth step: save the results, this step depends on our requirements to solve the problem, because this article uses DigitRecognition as an example, so there are:
saveResult (testLabel, 'sklearn_SVC_C = 5.0_Result.csv')
(3) Make a submission The above is basically the entire development process, let's take a look at the effect of each algorithm, make a submission on Kaggle
The effect of knn algorithm, the accuracy rate is 95.871%
Naive Bayes, alpha = 1.0, accuracy rate 81.043%
SVM, linear core, accuracy rate 93.943%
3. Project file CSDN download: Getting started with Kaggle-Solve DigitRecoginition using scikit-learn
Github: https://github.com/wepe/Kaggle-Solution
Paste the code:
#! / usr / bin / python
#-*-coding: utf-8-*-
"" "
Created on Tue Dec 16 21:59:00 2014
@author: wepon
@blog: http: //blog.csdn.net/u012162613
"" "
from numpy import *
import csv
def toInt (array):
array = mat (array)
m, n = shape (array)
newArray = zeros ((m, n))
for i in xrange (m):
for j in xrange (n):
newArray [i, j] = int (array [i, j])
return newArray
def nomalizing (array):
m, n = shape (array)
for i in xrange (m):
for j in xrange (n):
if array [i, j]! = 0:
array [i, j] = 1
return array
def loadTrainData ():
l = []
with open ('train.csv') as file:
lines = csv.reader (file)
for line in lines:
l.append (line) # 42001 * 785
l.remove (l [0])
l = array (l)
label = l [:, 0]
data = l [:, 1:]
return nomalizing (toInt (data)), toInt (label) #label 1 * 42000 data 42000 * 784
#return trainData, trainLabel
def loadTestData ():
l = []
with open ('test.csv') as file:
lines = csv.reader (file)
for line in lines:
l.append (line) # 28001 * 784
l.remove (l [0])
data = array (l)
return nomalizing (toInt (data)) # data 28000 * 784
#return testData
def loadTestResult ():
l = []
with open ('knn_benchmark.csv') as file:
lines = csv.reader (file)
for line in lines:
l.append (line) # 28001 * 2
l.remove (l [0])
label = array (l)
return toInt (label [:, 1]) # label 28000 * 1
#result is a list of results
#csvName is the name of the csv file where the results are stored
def saveResult (result, csvName):
with open (csvName, 'wb') as myFile:
myWriter = csv.writer (myFile)
for i in result:
tmp = []
tmp.append (i)
myWriter.writerow (tmp)
#Call scikit's knn algorithm package
from sklearn.neighbors import KNeighborsClassifier
def knnClassify (trainData, trainLabel, testData):
knnClf = KNeighborsClassifier () # default: k = 5, defined by yourself: KNeighborsClassifier (n_neighbors = 10)
knnClf.fit (trainData, ravel (trainLabel))
testLabel = knnClf.predict (testData)
saveResult (testLabel, 'sklearn_knn_Result.csv')
return testLabel
#Call scikit's SVM algorithm package
from sklearn import svm
def svcClassify (trainData, trainLabel, testData):
svcClf = svm.SVC (C = 5.0) #default: C = 1.0, kernel = 'rbf'. you can try kernel: ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’
svcClf.fit (trainData, ravel (trainLabel))
testLabel = svcClf.predict (testData)
saveResult (testLabel, 'sklearn_SVC_C = 5.0_Result.csv')
return testLabel
#Call scikit's Naive Bayes algorithm package, GaussianNB and MultinomialNB
from sklearn.naive_bayes import GaussianNB #nb for Gaussian distribution data
def GaussianNBClassify (trainData, trainLabel, testData):
nbClf = GaussianNB ()
nbClf.fit (trainData, ravel (trainLabel))
testLabel = nbClf.predict (testData)
saveResult (testLabel, 'sklearn_GaussianNB_Result.csv')
return testLabel
from sklearn.naive_bayes import MultinomialNB #nb for polynomial distribution data
def MultinomialNBClassify (trainData, trainLabel, testData):
nbClf = MultinomialNB (alpha = 0.1) #default alpha = 1.0, Setting alpha = 1 is called Laplace smoothing, while alpha <1 is called Lidstone smoothing.
nbClf.fit (trainData, ravel (trainLabel))
testLabel = nbClf.predict (testData)
saveResult (testLabel, 'sklearn_MultinomialNB_alpha = 0.1_Result.csv')
return testLabel
def digitRecognition ():
trainData, trainLabel = loadTrainData ()
testData = loadTestData ()
#Use different algorithms
result1 = knnClassify (trainData, trainLabel, testData)
result2 = svcClassify (trainData, trainLabel, testData)
result3 = GaussianNBClassify (trainData, trainLabel, testData)
result4 = MultinomialNBClassify (trainData, trainLabel, testData)
#Compare the result with the given knn_benchmark, take result1 as an example
resultGiven = loadTestResult ()
m, n = shape (testData)
different = 0 # result1 The number of labels different from the benchmark, initialized to 0
for i in xrange (m):
if result1 [i]! = resultGiven [0, i]:
different + = 1
print different