Kaggle competition-Otto Group Product Classification-simple solution to defeat half of participating teams, kaggle-otto

Source: Internet
Author: User

Kaggle competition-Otto Group Product Classification-simple solution to defeat half of participating teams, kaggle-otto
Introduction

Otto Group Product Classification contains three data copies. Train.csv contains more than 60 thousand samples, each of which has an id and 93 feature values feat_1 ~ Feat_93 and category target: class_1 ~ Class_9. Test.csv contains more than 0.14 million test samples with only IDs and 93 features. The contestant's task is to classify these samples.

Format of the submitted result

Format provided on the official website:

id,Class_1,Class_2,Class_3,Class_4,Class_5,Class_6,Class_7,Class_8,Class_9        1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0        2,0.0,0.2,0.3,0.3,0.0,0.0,0.1,0.1,0.0

That is to say, you do not need to give the exact category, and give the probability of belonging to each category. This is very important. I did not pay attention to it at the beginning, and I wasted a lot of energy. So it is best to carefully read the information provided on the official website during the competition.

Rating criteria

For the score formula, see: here

I will explain this formula a little. I indicates the sample and j indicates the category. Pij indicates the probability that the I sample belongs to Category j. If the I sample really belongs to Category j, yij is equal to 1; otherwise, it is 0.

If you classify all the test samples correctly and all pij are 1, each log (pij) is 0, and the final logloss is also 0.

If the first sample belongs to the first class, but you give it the probability of category pij = 1st, then logloss will be tired of adding log (0.1. We know that this item is a negative number, and the smaller the pij, the more negative it is. If pij = 0, it will be infinite. This will lead to this situation: you have divided the wrong one, And logloss is infinite. This is of course unreasonable. To avoid this situation, the official solution is as follows:

That is to say, the minimum value is no less than 10 ^-15.

Start solving problems

Finally, let's go to the topic. Next we will talk about naive's solution. Basically, we don't need to worry about it. As long as we simply use sklearn and numpy, we can beat half of the participating teams.

Put the code on my github wepe/Kaggle-Solution, which can be divided into several parts:

Data preprocessing

Strictly speaking, I have not done much data preprocessing work. Only some functions that load data are
LoadTrainSet () and loadTestSet (), load the training dataset and test dataset. The Code is as follows. Normalization and zero mean are implemented in the Code (standardize is not much different here ).

#load train setdef loadTrainSet():    traindata = []    trainlabel = []    table = {"Class_1":1,"Class_2":2,"Class_3":3,"Class_4":4,"Class_5":5,"Class_6":6,"Class_7":7,"Class_8":8,"Class_9":9}    with open("train.csv") as f:        rows = csv.reader(f)        rows.next()        for row in rows:            l = []            for i in range(1,94):                l.append(int(row[i]))            traindata.append(l)            trainlabel.append(table.get(row[-1]))    f.close()    traindata = np.array(traindata,dtype="float")    trainlabel = np.array(trainlabel,dtype="int")    #Standardize(zero-mean,nomalization)    mean = traindata.mean(axis=0)    std = traindata.std(axis=0)    traindata = (traindata - mean)/std    #shuffle the data    randomIndex = [i for i in xrange(len(trainlabel))]    random.shuffle(randomIndex)    traindata = traindata[randomIndex]    trainlabel = trainlabel[randomIndex]    return traindata,trainlabel#load test setdef loadTestSet():    testdata = []    with open("test.csv") as f:        rows = csv.reader(f)        rows.next()        for row in rows:            l = []            for i in range(1,94):                l.append(int(row[i]))            testdata.append(l)    f.close()    testdata = np.array(testdata,dtype="float")    #Standardize(zero-mean,nomalization)    mean = testdata.mean(axis=0)    std = testdata.std(axis=0)    testdata = (testdata - mean)/std    return testdata

The code is quite long, but it is actually some simple reading and processing work. (If you don't want to write such tedious code, you can try the data analysis package pandas and complete the work with several lines of code ).

Model Evaluation

Model Evaluation is generally divided from a part of the training data for validation. Generally, we use the validation accuracy for evaluation (as described above). However, since this competition officially provides an evaluation formula, we will write an evaluation function based on this formula:

#Evaluation function#Refer to:https://www.kaggle.com/c/otto-group-product-classification-challenge/details/evaluationdef evaluation(label,pred_label):    num = len(label)    logloss = 0.0    for i in range(num):        p = max(min(pred_label[i][label[i]-1],1-10**(-15)),10**(-15))        logloss += np.log(p)    logloss = -1*logloss/num    return logloss

The above three functions, loadTrainSet (), loadTestSet (), and evaluation (), are all stored in the code file preprocess. in py, A saveResult () function is also defined in this file, and a submission.csv file is generated based on the test set' alias test_label.

Classification Algorithm

Directly call the random forest in sklearn and adjust the parameters: adjust the number of decision trees several times to make the logloss reach about 0.5. This score has already defeated half of the participating teams.

def rf(train_data,train_label,val_data,val_label,test_data,name="RandomForest_submission.csv"):    print "Start training Random forest..."    rfClf = RandomForestClassifier(n_estimators=400,n_jobs=-1)    rfClf.fit(train_data,train_label)    #evaluate on validation set    val_pred_label = rfClf.predict_proba(val_data)    logloss = preprocess.evaluation(val_label,val_pred_label)    print "logloss of validation set:",logloss    print "Start classify test set..."    test_label = rfClf.predict_proba(test_data)    preprocess.saveResult(test_label,filename = name)

In this example, val_data=dataset data is extracted from train.csv, which accounts for about 6000 of the data.

This part of code is stored in the file RandomForest. py.

If you are interested, go to github to get the code. Have Fun!

================

By wepon

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.