Kaggle competition-Otto Group Product Classification-simple solution to defeat half of participating teams, kaggle-otto

Last Update:2015-03-26 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Kaggle competition-Otto Group Product Classification-simple solution to defeat half of participating teams, kaggle-otto
Introduction

Otto Group Product Classification contains three data copies. Train.csv contains more than 60 thousand samples, each of which has an id and 93 feature values feat_1 ~ Feat_93 and category target: class_1 ~ Class_9. Test.csv contains more than 0.14 million test samples with only IDs and 93 features. The contestant's task is to classify these samples.

Format of the submitted result

Format provided on the official website:

id,Class_1,Class_2,Class_3,Class_4,Class_5,Class_6,Class_7,Class_8,Class_9        1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0        2,0.0,0.2,0.3,0.3,0.0,0.0,0.1,0.1,0.0

That is to say, you do not need to give the exact category, and give the probability of belonging to each category. This is very important. I did not pay attention to it at the beginning, and I wasted a lot of energy. So it is best to carefully read the information provided on the official website during the competition.

Rating criteria

For the score formula, see: here

I will explain this formula a little. I indicates the sample and j indicates the category. Pij indicates the probability that the I sample belongs to Category j. If the I sample really belongs to Category j, yij is equal to 1; otherwise, it is 0.

If you classify all the test samples correctly and all pij are 1, each log (pij) is 0, and the final logloss is also 0.

If the first sample belongs to the first class, but you give it the probability of category pij = 1st, then logloss will be tired of adding log (0.1. We know that this item is a negative number, and the smaller the pij, the more negative it is. If pij = 0, it will be infinite. This will lead to this situation: you have divided the wrong one, And logloss is infinite. This is of course unreasonable. To avoid this situation, the official solution is as follows:

That is to say, the minimum value is no less than 10 ^-15.

Start solving problems

Finally, let's go to the topic. Next we will talk about naive's solution. Basically, we don't need to worry about it. As long as we simply use sklearn and numpy, we can beat half of the participating teams.

Put the code on my github wepe/Kaggle-Solution, which can be divided into several parts:

Data preprocessing

Strictly speaking, I have not done much data preprocessing work. Only some functions that load data are
LoadTrainSet () and loadTestSet (), load the training dataset and test dataset. The Code is as follows. Normalization and zero mean are implemented in the Code (standardize is not much different here ).

#load train setdef loadTrainSet():    traindata = []    trainlabel = []    table = {"Class_1":1,"Class_2":2,"Class_3":3,"Class_4":4,"Class_5":5,"Class_6":6,"Class_7":7,"Class_8":8,"Class_9":9}    with open("train.csv") as f:        rows = csv.reader(f)        rows.next()        for row in rows:            l = []            for i in range(1,94):                l.append(int(row[i]))            traindata.append(l)            trainlabel.append(table.get(row[-1]))    f.close()    traindata = np.array(traindata,dtype="float")    trainlabel = np.array(trainlabel,dtype="int")    #Standardize(zero-mean,nomalization)    mean = traindata.mean(axis=0)    std = traindata.std(axis=0)    traindata = (traindata - mean)/std    #shuffle the data    randomIndex = [i for i in xrange(len(trainlabel))]    random.shuffle(randomIndex)    traindata = traindata[randomIndex]    trainlabel = trainlabel[randomIndex]    return traindata,trainlabel#load test setdef loadTestSet():    testdata = []    with open("test.csv") as f:        rows = csv.reader(f)        rows.next()        for row in rows:            l = []            for i in range(1,94):                l.append(int(row[i]))            testdata.append(l)    f.close()    testdata = np.array(testdata,dtype="float")    #Standardize(zero-mean,nomalization)    mean = testdata.mean(axis=0)    std = testdata.std(axis=0)    testdata = (testdata - mean)/std    return testdata

The code is quite long, but it is actually some simple reading and processing work. (If you don't want to write such tedious code, you can try the data analysis package pandas and complete the work with several lines of code ).

Model Evaluation

Model Evaluation is generally divided from a part of the training data for validation. Generally, we use the validation accuracy for evaluation (as described above). However, since this competition officially provides an evaluation formula, we will write an evaluation function based on this formula:

#Evaluation function#Refer to:https://www.kaggle.com/c/otto-group-product-classification-challenge/details/evaluationdef evaluation(label,pred_label):    num = len(label)    logloss = 0.0    for i in range(num):        p = max(min(pred_label[i][label[i]-1],1-10**(-15)),10**(-15))        logloss += np.log(p)    logloss = -1*logloss/num    return logloss

The above three functions, loadTrainSet (), loadTestSet (), and evaluation (), are all stored in the code file preprocess. in py, A saveResult () function is also defined in this file, and a submission.csv file is generated based on the test set' alias test_label.

Classification Algorithm

Directly call the random forest in sklearn and adjust the parameters: adjust the number of decision trees several times to make the logloss reach about 0.5. This score has already defeated half of the participating teams.

def rf(train_data,train_label,val_data,val_label,test_data,name="RandomForest_submission.csv"):    print "Start training Random forest..."    rfClf = RandomForestClassifier(n_estimators=400,n_jobs=-1)    rfClf.fit(train_data,train_label)    #evaluate on validation set    val_pred_label = rfClf.predict_proba(val_data)    logloss = preprocess.evaluation(val_label,val_pred_label)    print "logloss of validation set:",logloss    print "Start classify test set..."    test_label = rfClf.predict_proba(test_data)    preprocess.saveResult(test_label,filename = name)

In this example, val_data=dataset data is extracted from train.csv, which accounts for about 6000 of the data.

This part of code is stored in the file RandomForest. py.

If you are interested, go to github to get the code. Have Fun!

================

By wepon

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Kaggle competition-Otto Group Product Classification-simple solution to defeat half of participating teams, kaggle-otto

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Kaggle competition-Otto Group Product Classification-simple solution to defeat half of participating teams, kaggle-otto

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support