Kaggle competition-Otto Group Product Classification-simple solution to defeat half of participating teams, kaggle-otto
Introduction
Otto Group Product Classification contains three data copies. Train.csv contains more than 60 thousand samples, each of which has an id and 93 feature values feat_1 ~ Feat_93 and category target: class_1 ~ Class_9. Test.csv contains more than 0.14 million test samples with only IDs and 93 features. The contestant's task is to classify these samples.
Format of the submitted result
Format provided on the official website:
id,Class_1,Class_2,Class_3,Class_4,Class_5,Class_6,Class_7,Class_8,Class_9 1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0 2,0.0,0.2,0.3,0.3,0.0,0.0,0.1,0.1,0.0
That is to say, you do not need to give the exact category, and give the probability of belonging to each category. This is very important. I did not pay attention to it at the beginning, and I wasted a lot of energy. So it is best to carefully read the information provided on the official website during the competition.
Rating criteria
For the score formula, see: here
I will explain this formula a little. I indicates the sample and j indicates the category. Pij indicates the probability that the I sample belongs to Category j. If the I sample really belongs to Category j, yij is equal to 1; otherwise, it is 0.
If you classify all the test samples correctly and all pij are 1, each log (pij) is 0, and the final logloss is also 0.
If the first sample belongs to the first class, but you give it the probability of category pij = 1st, then logloss will be tired of adding log (0.1. We know that this item is a negative number, and the smaller the pij, the more negative it is. If pij = 0, it will be infinite. This will lead to this situation: you have divided the wrong one, And logloss is infinite. This is of course unreasonable. To avoid this situation, the official solution is as follows:
That is to say, the minimum value is no less than 10 ^-15.
Start solving problems
Finally, let's go to the topic. Next we will talk about naive's solution. Basically, we don't need to worry about it. As long as we simply use sklearn and numpy, we can beat half of the participating teams.
Put the code on my github wepe/Kaggle-Solution, which can be divided into several parts:
Data preprocessing
Strictly speaking, I have not done much data preprocessing work. Only some functions that load data are
LoadTrainSet () and loadTestSet (), load the training dataset and test dataset. The Code is as follows. Normalization and zero mean are implemented in the Code (standardize is not much different here ).
#load train setdef loadTrainSet(): traindata = [] trainlabel = [] table = {"Class_1":1,"Class_2":2,"Class_3":3,"Class_4":4,"Class_5":5,"Class_6":6,"Class_7":7,"Class_8":8,"Class_9":9} with open("train.csv") as f: rows = csv.reader(f) rows.next() for row in rows: l = [] for i in range(1,94): l.append(int(row[i])) traindata.append(l) trainlabel.append(table.get(row[-1])) f.close() traindata = np.array(traindata,dtype="float") trainlabel = np.array(trainlabel,dtype="int") #Standardize(zero-mean,nomalization) mean = traindata.mean(axis=0) std = traindata.std(axis=0) traindata = (traindata - mean)/std #shuffle the data randomIndex = [i for i in xrange(len(trainlabel))] random.shuffle(randomIndex) traindata = traindata[randomIndex] trainlabel = trainlabel[randomIndex] return traindata,trainlabel#load test setdef loadTestSet(): testdata = [] with open("test.csv") as f: rows = csv.reader(f) rows.next() for row in rows: l = [] for i in range(1,94): l.append(int(row[i])) testdata.append(l) f.close() testdata = np.array(testdata,dtype="float") #Standardize(zero-mean,nomalization) mean = testdata.mean(axis=0) std = testdata.std(axis=0) testdata = (testdata - mean)/std return testdata
The code is quite long, but it is actually some simple reading and processing work. (If you don't want to write such tedious code, you can try the data analysis package pandas and complete the work with several lines of code ).
Model Evaluation
Model Evaluation is generally divided from a part of the training data for validation. Generally, we use the validation accuracy for evaluation (as described above). However, since this competition officially provides an evaluation formula, we will write an evaluation function based on this formula:
#Evaluation function#Refer to:https://www.kaggle.com/c/otto-group-product-classification-challenge/details/evaluationdef evaluation(label,pred_label): num = len(label) logloss = 0.0 for i in range(num): p = max(min(pred_label[i][label[i]-1],1-10**(-15)),10**(-15)) logloss += np.log(p) logloss = -1*logloss/num return logloss
The above three functions, loadTrainSet (), loadTestSet (), and evaluation (), are all stored in the code file preprocess. in py, A saveResult () function is also defined in this file, and a submission.csv file is generated based on the test set' alias test_label.
Classification Algorithm
Directly call the random forest in sklearn and adjust the parameters: adjust the number of decision trees several times to make the logloss reach about 0.5. This score has already defeated half of the participating teams.
def rf(train_data,train_label,val_data,val_label,test_data,name="RandomForest_submission.csv"): print "Start training Random forest..." rfClf = RandomForestClassifier(n_estimators=400,n_jobs=-1) rfClf.fit(train_data,train_label) #evaluate on validation set val_pred_label = rfClf.predict_proba(val_data) logloss = preprocess.evaluation(val_label,val_pred_label) print "logloss of validation set:",logloss print "Start classify test set..." test_label = rfClf.predict_proba(test_data) preprocess.saveResult(test_label,filename = name)
In this example, val_data=dataset data is extracted from train.csv, which accounts for about 6000 of the data.
This part of code is stored in the file RandomForest. py.
If you are interested, go to github to get the code. Have Fun!
================
By wepon