which Classifier is should I Choose?
This is one of the most import questions to ask when approaching a machine learning problem. I find it easier to just test them all at once. Here's your favorite Scikit-learn algorithms applied to the leaf data. In [1]:
Import NumPy as NP
import pandas as PD
import Seaborn as SNS
import Matplotlib.pyplot as PLT
def warn (*arg S, **kwargs): Pass
import warnings
Warnings.warn = Warn from
sklearn.preprocessing import Labelencoder From
sklearn.cross_validation import stratifiedshufflesplit
train = Pd.read_csv ('.. /input/train.csv ')
test = Pd.read_csv ('.. /input/test.csv ')
Data PreparationIn [2]:
# Swiss Army knife function to organize the data
def encode (train, test):
le = Labelencoder (). Fit (train.species)
labels = Le.transform (train.species) # Encode species strings
classes = List (Le.classes_) # Save Column Names for submission
test_ids = test.id # Save test IDs for submission
train = Train.drop ([' species ', ' id '], a Xis=1)
test = Test.drop ([' ID '], Axis=1)
return train, labels, test, test_ids, Classes
train, labels, test, Test_ids, classes = Encode (train, test)
Train.head (1)
OUT[2]:
|
margin1 |
margin2 |
margin3 |
margin4 |
margin5 |
| margin6
margin7 |
margin8 |
margin9 |
margin10 |
... |
Texture55 |
texture56 |
texture57 |
texture58 |
texture59 |
Texture60 |
texture61 |
texture62 |
Texture63 |
texture64 |
0 |
0.007812 |
0.023438 |
0.023438 |
0.003906 |
0.011719 |
0.009766 |
0.027344 |
0.0 |
0.001953 |
0.033203 |
... |
0.007812 |
0.0 |
0.00293 |
0.00293 |
0.035156 |
0.0 |
0.0 |
0.004883 |
0.0 |
0.025391 |
1 rowsx192 columns stratified train/test Split
Stratification is necessary for this dataset because there are a relatively large number of classes (classes for 990 SA Mples). This would ensure we have all classes represented in both the train and test indices. In [3]:
SSS = stratifiedshufflesplit (labels, ten, test_size=0.2, random_state=23)
for Train_index, Test_index in SSS:
x_ Train, X_test = Train.values[train_index], Train.values[test_index]
y_train, y_test = Labels[train_index], labels[ Test_index]
Sklearn Classifier Showdown
Simply Looping through out-of-the box classifiers and printing the results. Obviously, these would perform much better after tuning their hyperparameters, but this gives you a decent ballpark idea. In [4]:
From sklearn.metrics import Accuracy_score, log_loss from sklearn.neighbors import kneighborsclassifier from SKLEARN.SVM Import SVC, Linearsvc, nusvc from Sklearn.tree import decisiontreeclassifier from sklearn.ensemble import Randomforestcla Ssifier, Adaboostclassifier, gradientboostingclassifier from Sklearn.naive_bayes import gaussiannb from Sklearn.discriminant_analysis Import lineardiscriminantanalysis from sklearn.discriminant_analysis Import Quadraticdiscriminantanalysis classifiers = [Kneighborsclassifier (3), SVC (kernel= "RBF", c=0.025, Probability=tru
e), Nusvc (Probability=true), Decisiontreeclassifier (), Randomforestclassifier (), Adaboostclassifier (), Gradientboostingclassifier (), GAUSSIANNB (), Lineardiscriminantanalysis (), Quadraticdiscriminantanalysis ()] # Logging for Visual Comparison log_cols=["Classifier", "accuracy", "Log Loss"] log = PD. DataFrame (Columns=log_cols) for CLF in Classifiers:clf.fit (X_train, y_train) name =clf.__class__.__name__ print ("=" *30) print (name) print (' ****results**** ') train_predictions = cl F.predict (x_test) acc = Accuracy_score (y_test, train_predictions) print ("accuracy: {:. 4%}". Format (ACC)) T Rain_predictions = Clf.predict_proba (x_test) ll = Log_loss (y_test, train_predictions) print ("Log loss: {}". Format ( ll)) Log_entry = PD