Remember a failed Kaggle match (3): Where the failure is, greedy screening features, cross-validation, blending

Last Update:2016-05-03 Source: Internet

Author: User

Tags shuffle xgboost

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Today the game is over, the results can be seen: Https://www.kaggle.com/c/santander-customer-satisfaction/leaderboard

Public results:

Private results:

First, comparing the results of private and public, you can find:

1) Almost all the people are overfitting, or the other half of the private test data is more irregular than the half test data of the public.

2) Private of the first 10 have 5 is in public not ranked in the top hundreds of, four or even ranked between 1000 and 2000; It is more important to use a correct method than to blindly pursue the rank of public!!!

3) I transferred from the public No. 2323 Place to the private 1063, raised 1260 places; as the first to participate in such competitions, as a person who is troubled by various tasks, can be in 5,236 teams, 5,831 players to achieve such results, the individual is more satisfied, After all, experience is not enough, do a lot of wronged work.

4) say back to the most critical, what is called " a correct method "??? This is the failure that I want to explore:

1. Choose the right model: Because the data is not understood, the following models are directly tried:

models=[    randomforestclassifier (n_estimators=1999, criterion= ' Gini ', N_jobs=-1, random_state=seed),    Randomforestclassifier (n_estimators=1999, criterion= ' entropy ', N_jobs=-1, random_state=seed),    Extratreesclassifier (n_estimators=1999, criterion= ' Gini ', N_jobs=-1, random_state=seed),    extratreesclassifier (n_estimators=1999, criterion= ' entropy ', N_jobs=-1, random_state=seed),    gradientboostingclassifier (learning_ rate=0.1, n_estimators=101, subsample=0.6, Max_depth=8, Random_state=seed)]

In fact, what I'm trying to say here is that these models are very slow! At first, I felt that the convenience had not been configured Xgboost, this choice actually wasted a lot of time, and later used Xgboost to get the final result. Therefore, it is important to choose a fast and generalization-capable model when you do not know the data, Xgboost is the first choice.

2, come up without any thought to start using a variety of complex models, and even a baseline do not: Yes, I am this, because the first time, really lack of experience, because the complex model is easy to fit, so the more you fall deeper, and complex models generally spend more time, It was a waste of my youth; I realized it at a time when I was not ready for it; In addition, my final result was actually obtained through a very simple model. So, start with a simple model, with this as a reference to the model after construction. What is a simple model: the original dataset (or a little bit of data set, such as a de-sequence, a missing value, a normalization, etc.), a logistic regression, or a simple SVM, xgboost.

3, believe that the results of cross-validation: Do not just divide the data set into two parts, because cross-validation you will find some fold effect is very good, AUC can reach about 0.85, and some fold is very poor, 0.82 are not.

4, about the problem of noise : has not found a good treatment, so the final effect is not very good and normal.

5, about a pile of zero treatment methods: Normalization features, this is very necessary! Otherwise you will find the characteristics of the project is very poor, because 0+k=k, 0*k=0, 0^2=0, specifically how to normalization, I will not say more, donuts.

6, there are also some small details, such as screening features, because your final model is GBDT, then you filter features using GBDT, otherwise you use LR filtering effective features may not be effective for the GBDT model, there are many, really in practice to realize that For example, the feature processing is on the train+test or alone in the train these problems, theoretically only should be on the train, because we think the test data set is not known, but for this kind of game, you know the test, that still use good .... Not much to say, we still more practice good, scientific research again busy, a semester play a game still have time ...

7, said so many useless, to everyone on a bit of code, mainly including greedy screening features, cross-validation, blending three key points, but the entire code is complete:

#!usr/bin/env python#-*-coding:utf-8-*-import pandas as Pdimport numpy as Npfrom sklearn import preprocessing, Cross_val Idation, Metricsfrom sklearn.ensemble import Randomforestclassifier, Extratreesclassifier, Gradientboostingclassifierfrom sklearn.cross_validation Import stratifiedkfoldfrom Sklearn.linear_model Import Logisticregressionfrom sklearn.externals Import joblibseed=1126nfold=5def SaveFile (Submitid, Testsubmit, fileName= "    Submit.csv "): content=" Id,target ";     For I in range (Submitid.shape[0]): content+= "\ n" +str (Submitid[i]) + "," +str (Testsubmit[i]) File=open (FileName, "w") File.write (content) File.close () def crossvalidationscore (data, label, CLF, nfold=5, scoretype= "accuracy"): if SC  oretype== "accuracy": Scores=cross_validation.cross_val_score (Clf,data,label,cv=nfold) #print ("mean accuracy: %0.4F (+/-%0.4f) "% (Scores.mean (), SCORES.STD () * 2)) return Scores.mean () elif scoretype==" AUC ": Mean auc=0.0 KFCV=STRATIFIEDKFOld (Y=label, N_folds=nfold, Shuffle=true, Random_state=seed) for J, (Traini, CvI) in Enumerate (KFCV): PR            int "Fold", J, "^" *20 Xtrain=data[traini] XCV=DATA[CVI] Ytrain=label[traini] YCV=LABEL[CVI] Clf.fit (xtrain,ytrain) Probas=clf.predict_proba (XCV) aucscore=metrics.roc_ Auc_score (YCV, probas[:,1]) #print "AUC (fold%d/%d):%0.4f"% (I+1,nfold, Aucscore) Meanauc+=aucsco Re #print "mean AUC:%0.4f"% (meanauc/nfold) return meanauc/nfolddef greedyfeatureadd (CLF, data, label, SCO Retype= "accuracy", goodfeatures=[], maxfeanum=100, eps=0.00005): scorehistorys=[] While Len (Scorehistorys) <=2 or                Scorehistorys[-1]>scorehistorys[-2]+eps:if Len (goodfeatures) ==maxfeanum:break scores=[] For Testfeaind in range (Data.shape[1]): If Testfeaind not in Goodfeatures: #tempFeaIn Ds=goodfeatures.append (tEstfeaind); Tempfeainds=goodfeatures+[testfeaind] Tempdata=data[:,tempfeainds] Score=crossvalidationscor E (tempdata, label, CLF, Nfold, Scoretype) scores.append ((score,testfeaind)) print "feature: "+str (Testfeaind) +" ==>mean "+scoretype+":%0.4f "% score Goodfeatures.append (sorted (scores) [ -1][1]) #only add th E feature which get "the biggest gain score" scorehistorys.append (sorted (scores) [ -1][0]) #only Add the biggest gai N Score #print Scorehistorys print "Current features:%s"% sorted (goodfeatures) If Len (Goodfeatures) < MaxFeaNum:goodFeatures.pop ( -1) #remove last added feature from Goodfeatures #goodFeatures =sorted (goodfeatures) Don ' t sort at the "bigger Gain score" feature print "Selected%d features:%s"% (Len ( goodfeatures), goodfeatures) return goodfeatures #a Feature List traind=pd.read_csv ("Train.csv") Trainy=np.arrAy (traind.iloc[:,-1]) Trainx=np.array (traind.iloc[:,1:-1]) #drop ID and Targettestd=pd.read_csv ("Test.csv") SubmitID =np.array (testd.iloc[:,0]) #ID Columntestx=np.array (testd.iloc[:,1:]) #drop id#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Better use a RFC or GBC as the clf#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Because the final predict model is those two#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! We should select better feature for RFCs or GBC, not for LRCLF = logisticregression (class_weight= ' balanced ', penalty= ' L2 ', N_jobs=-1) Selectedfeainds=greedyfeatureadd (CLF, Trainx, Trainy, scoretype= "AUC", goodfeatures=[], maxFeaNum=150) Joblib.dump (selectedfeainds, ' modelpersistence/selectedfeainds.pkl ') #selectedFeaInds =joblib.load (' Modelpersistence/selectedfeainds.pkl ') trainx=trainx[:,selectedfeainds]testx=testx[:,selectedfeainds]print Trainx.shapetrainn=len (trainy) print "Creating train and test sets for blending ..." #!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Always use a seed for randomized proceduresmodels=[randomforestclassifier (n_estimators=1999, criterion= ' Gini ', N_jobs=-1, Random_state=seed), Randomforestclassifier (n_estimators=1999, criterion= ' entropy ', N_jobs=-1, Random_state=seed), Extratreesclassifier (n_estimators=1999, criterion= ' Gini ', n_ Jobs=-1, Random_state=seed), Extratreesclassifier (n_estimators=1999, criterion= ' entropy ', N_jobs=-1, random_state= SEED), Gradientboostingclassifier (learning_rate=0.1, n_estimators=101, subsample=0.6, Max_depth=8, random_state= SEED)] #StratifiedKFold is a variation of k-fold which returns stratified Folds:each set contains approximately the same p Ercentage of samples of each target class as the complete set. #kfcv =kfold (N=trainn, N_folds=nfold, Shuffle=true, Random_st Ate=seed) Kfcv=stratifiedkfold (Y=trainy, N_folds=nfold, Shuffle=true, Random_state=seed) dataset_trainBlend= Np.zeros ((Trainn, Len (models))) Dataset_testblend=np.zeros ((Len (TESTX), Len (models))) Meanauc=0.0for I, model in ENU Merate (models): print "Model", I, "= =" *20 dataset_testblend_j=np. Zeros ((Len (TESTX), Nfold)) for J, (Traini, Testi) in Enumerate (KFCV): print "Fold", J, "^" *20 xtrai N=trainx[traini] Xcv=trainx[testi] Ytrain=trainy[traini] ycv=trainy[testi] Model.fit (Xtrain,Yt Rain) Ypred=model.predict_proba (XCV) [:, 1] dataset_trainblend[testi, i]=ypred Dataset_testblend_j[:,j] =model.predict_proba (TESTX) [:, 1] dataset_testblend[:,i]=dataset_testblend_j.mean (1) aucscore=metrics.roc_auc_ Score (Trainy, dataset_trainblend[:, I]) print "model%d, CV mean AUC:%0.9f"% (i, Aucscore) meanauc+=aucscoreprint "All models, CV mean AUC:%0.9f"% (Meanauc/len (models)) ' 0.77860.78140.72300.72390.8199mean auc:0.7654 ' ' print ' Blending models ... "#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! If we want to predict some real values, use Ridgecvmodel=logisticregression (n_jobs=-1) c=np.linspace (0.001,1.0,1000) Trainauclist=[]for C in C:model. C=c Model.fit (dataset_trainblend,trainy) trainproba=model.predict_proba (datAset_trainblend) [:, 1] trainauc=metrics.roc_auc_score (Trainy, Trainproba) trainauclist.append ((TrainAuc, C)) Sortedtrainauclist=sorted (trainauclist) for TRAINAUC, C in sortedtrainauclist:print "c=%f = trainauc=%f"% (c, Trai NAUC) ' C = trainProba0.0001 = 0.126..0.001 = 0.8071880.01 = 0.8158330.03 = 0.8206740.04 = 0.821  2950.05 = 0.821439 ***0.06 = 0.8211290.07 = 0.8205210.08 = 0.8200670.1 = 0.8190360.3 = 0.8132101.0 = = 0.80900210.0 = 807334 ' model. C=SORTEDTRAINAUCLIST[-1][1] #0.05model.fit (Dataset_trainblend,trainy) Trainproba=model.predict_proba (dataset_ Trainblend) [:, 1]print "train AUC:%f"% Metrics.roc_auc_score (Trainy, Trainproba) #0.821439print "Model.coef_:", model. Coef_print "Predict and Saving results ..." Submitproba=model.predict_proba (dataset_testblend) [:, 1]DF=PD. DataFrame (Submitproba) print df.describe () SaveFile (Submitid, Submitproba, filename= "1submit.csv") #0.815536 [ Blending makes result < GBC 0.8199]#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Blending models ISN ' T a good idea when one model obviously better than others ... ' Count 75818.000000mean 0.039187s TD 0.033691min 0.02487625% 0.02840050% 0.02965075% 0.034284max 0.806586 "Print "Minmaxscaler predictions to [0,1] ..." mms=preprocessing. Minmaxscaler (feature_range= (0, 1)) Submitproba=mms.fit_transform (Submitproba) df=pd. DataFrame (Submitproba) print df.describe () SaveFile (Submitid, Submitproba, filename= "1submitscale.csv") #0.815536 "        Count 75818.000000mean 0.018307std 0.043099min 2.5e-07% 0.0045095% 0.00610775% 0.012035max 1.000000 "

In fact, there are a lot of things to say, but this article on this side, after all, a 1000+ people's preaching will make people feel bored, in the future to participate in other competitions together to say it.

http://blog.kaggle.com/2016/02/22/profiling-top-kagglers-leustagos-current-7-highest-1/

Coincide with Daniel:

What is does your iteration cycle look like?

Understand the dataset. At least enough to build a consistent validation set.
Build a consistent validation set and test its relationship with the leaderboard score.
Build a very simple model.
Look for approaches used in similar competitions in the past.
Start feature Engineering, step by step to create a strong model.
Think about ensembling, being it by creating alternate versions of the feature set or using different modeling techniques (XG b, RF, linear regression, neural nets, factorization machines, etc).

Record a failed Kaggle match (3): Where the failure is, greedy filter feature, cross-validation, blending

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More