Big Data Competition Platform--kaggle Introductory article
This article is suitable for those who just contact Kaggle, want to become familiar with Kaggle and finish a contest project independently, for the Netizen who has already competed on the Kaggle, can not spend time reading this article. This article is divided into two parts introduced Kaggle, the first part briefly introduces Kaggle, the second part will show the whole process of solving a competition project. If there is any mistake, please correct me!
1, Kaggle introductionKaggle is a competitive platform for data analysis, website: https://www.kaggle.com/
companies or researchers can publish data, problem descriptions, expected indicators to the Kaggle, in the form of contests to the vast number of data scientists to solicit solutionscase, similar toKdd-cup(International Knowledge discovery and data mining competition). The contestants on the Kaggle downloaded the data, analyzed the data, and then applied the machineLearning, data mining and other knowledge, the establishment of an algorithm model, solve the problem results, and finally submit the results, if the results of the submission meet the target requirements and ranked first in the contestants, will win a generous prize. For more information, see: Big Data crowdsourcing platform
below I introduce kaggle in the form of picture and text:
go to kaggle website:
This is currently in the heat of the prize competition, the shape of the Champions League is "Featured", translated as "call", the convening of data science experts to participate. The gray one below has the shape of the reagent bottle, "the" and the bonus is less. These two categories of competitions are competition, the difficulty is naturally not small, as an entry, should first do the practice game:
The game on the left is "101" and the right picture is "Playground", which is a practice game for beginners. The best way to get started Kaggle is to independently complete the two levels of the 101 and playground competitions. The second part of this article will be selected in 101 "Digit recognition" as the explanation.
click on "Digit recognition" to enter the game title:
This is a practice race to identify digital 0~9, and
"competition Details" is a description of the competition, explaining the problems that contestants need to solve.
"Get The Data" is a download, the contestants use this data to train their own model, the results, the data is generally given in CSV format:
among them, Train.csv is a training sample, Test.csv is a test sample, because this is a training game, so there are two solutions, Knn_benchmark. R and Rf_benchmark. R, the former is in R language. Written KNN algorithm program, the latter is written in R language of the Random forest algorithm program, their results are knn_benchmark.csv and rf_benchmark.csv respectively . In the CSV format file, my previous article was detailed: the use of the "Python" CSV module.
when the result is reached, the next step is to submit
the result "make a submission":
The requested file is in CSV format, if you save the results in Result.csv, then click "Click or drop submission here", check the Result.csv file upload, the system will test the accuracy of your submitted results, and then rank.
In addition , in addition to "competition Details","Get the Data","make a Submission", sidebar "Home", "Information", "Forum" and so on , it also provides some information about the competition, including rankings, rules, coaching ...
"The above is the first part, for the time being written so much, there is a supplement later more"
2. The whole process of solving problems in the competition project(1) Knowledge preparation
first of all, to solve the above topic, or need a bit of ML algorithm Foundation, in addition to the programming language and the corresponding third-party library to implement the algorithm, commonly used are:python and corresponding libraries NumPy, scipy, Scikit-learn (some algorithms that implement ML, can be used directly), Theano (deeplearning algorithm packages). R language, WekaIf deep learning algorithms are used, CUDA and Caffe can also be usedin short, the use of what programming language, what platform, what third-party library does not matter, no matter what method you use, Kaggle only need your online submission results, offline how you implement the algorithm is unlimited.
OK, the following explanation of the process, with "Digit recognition" as an example, the problem of digital identification I have written two articles, respectively, using KNN algorithm and the logistic algorithm to achieve, there is a complete code, interested can read: KNN algorithm to achieve digital recognition, Logistic regression for digital recognition
(2) Digit recognition problem Solving process
Below I will use KNN algorithm to solve this Digit recognition training problem on Kaggle. mentioned above, I used the KNN algorithm implemented before, here I will directly copy the previous algorithm core code, the core code is about the KNN algorithm of the main implementation, I do not repeat, I put emphasis on processing data .
The following works are based on Python, NumPy
Download the following three CSV files from "Get the Data":
Train.csv is a training sample set, size 42001*785, the first line is a literal description, so the actual sample data size is 42000*785, where the first column of each number is its corresponding row of the label, you can take the first column separately, get 42000* 1 Vector Trainlabel, the rest is the 42000*784 feature vector set Traindata, so from Train.csv can get two matrix Trainlabel, Traindata.
Here's the code, and also on how to read data from a CSV file, see: Using the CSV module
Def loadtraindata (): l=[] with open (' train.csv ') as file: lines=csv.reader (file) for line in lines : l.append (line) #42001 *785 l.remove (l[0]) L=array (L) label=l[:,0] data=l[:,1:] Return nomalizing (ToInt (data)), ToInt (label)
here are two more functions to illustrate, the ToInt () function, is to convert the string to an integer, because from the CSV file read out, is a string type, such as ' 253 ', and we next operation needs to be an integer type, so to convert, int (' 253 ') = 253. The ToInt () function is as follows:
def toint (array): Array=mat (array) M,n=shape (array) Newarray=zeros ((m,n)) for I in Xrange (m): For j in Xrange (n): newarray[i,j]=int (Array[i,j]) return NewArray
The work of the nomalizing () function is normalized because the data represented in the Train.csv is 0~255, and in order to simplify the operation, we can convert it to a two-value image, so that all numbers other than 0, that is, 1~255, are normalized to 1. The nomalizing () function is as follows:
def nomalizing (array): m,n=shape (array) for I in Xrange (m): for J in Xrange (n): if array[i,j]!=0: Array[i,j]=1 return Array
Test.csv data size is 28001*784, the first line is a literal description, so the actual test data sample is 28000*784, unlike Train.csv, no label,28000*784 is 28,000 test samples, The job we have to do is to find the right label for these 28,000 test samples. so from test.csv we can get the test sample set TestData, the code is as follows:
Def loadtestdata (): l=[] with open (' test.csv ') as file: lines=csv.reader (file) for line in lines: L.append (line) #28001 *784 l.remove (l[0]) Data=array (L) return nomalizing (ToInt (data))
- Analysis Knn_benchmark.csv
As already mentioned, because digit recognition is a training game, so this document is the official given the reference result, can not ignore this file, but I below in order to compare their training results, so also put Knn_ Benchmark.csv This file read out, the data in this file is 28001*2, the first line is a text description, can be removed, the first column represents the picture number 1~28000, the second column is the image corresponding to the number. From Knn_benchmark.csv can get 28000*1 test result matrix TestResult, code:
Def loadtestresult (): l=[] with open (' knn_benchmark.csv ') as file: lines=csv.reader (file) for line In lines: l.append (line) #28001 * * l.remove (l[0] ) Label=array (l) return ToInt (label[:,1])
Here, the data analysis and processing has been completed, we obtained the matrix are: Traindata, Trainlabel, TestData, TestResult
Here we use the KNN algorithm to classify, the core code:
def classify (InX, dataSet, labels, k): Inx=mat (InX) Dataset=mat (DataSet) Labels=mat (labels) Datasetsize = dataset.shape[0] Diffmat = Tile (InX, (datasetsize,1))-DataSet Sqdiffmat = Array (diffmat) **2 sqdistances = sqdiffmat.sum (Axis=1) distances = sqdistances**0.5 sorteddistindicies = Distances.argsort () classcount={} for I in range (k): Voteilabel = labels[0,sorteddistindicies[i]] classcount[ Voteilabel] = Classcount.get (voteilabel,0) + 1 sortedclasscount = sorted (Classcount.iteritems (), key= Operator.itemgetter (1), reverse=true) return sortedclasscount[0][0]
For this function, refer to: KNN Algorithm for digital recognition
As a simple explanation, Inx is a single sample of input and a eigenvectors. DataSet is a training sample, corresponding to the above traindata,labels corresponding trainlabel,k is the KNN algorithm selected K, generally select the number between 0~20. This function returns the Inx label, which is the corresponding number of the picture inx. for the 28,000 samples in the test set, the function can be called 28,000 times.
Kaggle requested file format is CSV, above we got 28,000 test samples of the label, it must be saved to CSV format file can be submitted, about CSV, reference: "Python" CSV module use . Code:
def saveresult (Result): with open (' Result.csv ', ' WB ') as MyFile: mywriter=csv.writer (myFile) for i in Result: tmp=[] tmp.append (i) Mywriter.writerow (TMP)
- Synthesis of various functions
The above functions have done all the work that needs to be done, now we need to write a function to combine them to solve the problem of digit recognition. We write a handwritingclasstest function, run this function, we can get the training result result.csv.
Def handwritingclasstest (): traindata,trainlabel=loadtraindata () testdata=loadtestdata () testLabel= Loadtestresult () m,n=shape (testData) errorcount=0 resultlist=[] for i in range (m): Classifierresult = classify (Testdata[i], Traindata, Trainlabel, 5) resultlist.append (Classifierresult) Print "The classifier came back with:%d, the real answer is:%d"% (Classifierresult, testlabel[0,i]) if (classifier Result! = Testlabel[0,i]): Errorcount + = 1.0 print "\nthe total number of errors is:%d"% errorcount print "\nt He total error rate is:%f "% (Errorcount/float (m)) Saveresult (resultlist)
To run this function, you can get the Result.csv file:
20993703....... is the number that corresponds to each image. Compare with the reference results knn_benchmark.csv:
Of the 28,000 samples, 1004 were not the same as in the Kknn_benchmark.csv. The error rate is 3.5%, which is not good because I did not train all the training samples because it took too much time, I only took half of the training samples to train, that the result of the corresponding code is:
Classifierresult = classify (Testdata[i], traindata[0:20000], trainlabel[0:20000], 5)
Training half the sample, the program ran for nearly 70 minutes (on the personal PC).
the RESULT.CSV organized into kknn_benchmark.csv that format, that is, add the first line of the text description, add the number of pictures in column one, and make a submission, the result accuracy rate 96.5%:
Download Project code: GitHub Address
"Finish"
Big Data race platform--kaggle getting started