The previous article used the Xgboost CLI to do two categories, now to do a multiple classification.
The data set used is the UCI skin disease set
A total of 34 attribute sets, 6 categories of labels, property set in addition to family history is a nominal value, the other is a linear value line values
7. Attribute information:--Complete attribute documentation:clinical Attributes: (Take values 0, 1, 2, 3, Unle SS otherwise indicated) 1:erythema 2:scaling 3:definite borders 4:itching 5:koebner PHE Nomenon 6:polygonal papules 7:follicular papules 8:oral mucosal involvement 9:knee and elbow I Nvolvement 10:scalp Involvement 11:family history, (0 or 1) 34:age (linear) histopathological Attr Ibutes: (Take values 0, 1, 2, 3) 12:melanin incontinence 13:eosinophils in the infiltrate 14:PNL Ate 15:fibrosis of the papillary dermis 16:exocytosis 17:acanthosis 18:hyperkeratosis 19:par Akeratosis 20:clubbing of the Rete ridges 21:elongation of the rete ridges 22:thinning of the Suprapapil Lary Epidermis 23:spongiform pustule 24:munro microabcess 25:focal hypergranulosis 26:disappearanc
E of the granular layer 27:vacuolisation and damage of basal layer 28:spongiosis 29:saw-tooth appearance of Retes 30:folli Cular Horn plug 31:perifollicular parakeratosis 32:inflammatory monoluclear inflitrate 33:band-like infi Ltrate
The real data is as long as this:
1,1,2,3,2,2,0,3,0,0,0,2,0,0,0,2,2,1,2,0,0,0,0,0,3,0,3,0,3,1,0,2,3,50,3
3,2,1,2,0,0,0,0,1,2,0,0,0,1,0,0,2,0,3,2,2,2,1,2,0,2,0,0,0,0,0,1,0,50,1
3,2,0,2,0,0,0,0,0,0,0,0,1,2,0,2,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,10,2
2,3,3,3,3,0,0,0,3,3,0,0,0,0,0,0,3,2,2,3,3,3,1,3,0,0,0,0,0,0,0,1,0,34,1
2,2,1,0,0,0,0,0,1,0,1,0,0,2,0,0,2,1,2,2,1,2,0,1,0,0,0,0,0,0,0,0,0,?, 1
2,1,0,0,2,0,0,0,0,0,0,0,0,0,0,2,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,?, 4
2,2,1,2,0,0,0,0,0,0,0,0,0,2,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,?, 2
2,1,2,3,2,3,0,2,0,0,1,1,0,0,0,2,1,1,2,0,0,0,0,0,1,0,2,0,2,0,0,0,3,?, 3
Last column label, question mark indicating age unknown
, this time we trained in Python script instead of Xgboost CLI, which looks more convenient:
train.py:
#!/usr/bin/python from __future__ Import Division import numpy as NP import xgboost as XGB # label need to is 0 to Num_ Class-1 data = Np.loadtxt ('./dermatology.data ', delimiter= ', ', Converters={33:lambda x:int (x = = '? '), 34:lambda X:int (x)-1}) Sz = Data.shape train = Data[:int (sz[0] * 0.7),:] test = Data[int (sz[0) * 0.7):,:] train_x = train[:,: 3 3] train_y = train[:, test_x = test[:,: test_y = test[:, Xg_train = XGB. Dmatrix (train_x, label=train_y) xg_test = XGB. Dmatrix (test_x, label=test_y) # setup parameters for Xgboost param = {} # use Softmax multi-class classification param[' OB Jective '] = ' multi:softmax ' # scale weight of positive examples ' eta '] = param[0.1 ' param['] = 6 max_depth ' param['] = 1 param[' nthread '] = 4 param[' num_class '] = 6 watchlist = [(Xg_train, ' Train '), (xg_test, ' Test ')] Num_round = 5 BST = Xgb.train (param, Xg_train, Num_round, watchlist) # get prediction pred = bst.predict (xg_test) error_rate = np.sum (pred!= test_y)/test_y.shape[0] Print (' Test error using Softmax = {} '. Format (error_rate)) # Do the same thing again, but output probabilities param[' objective '] = ' multi:softprob ' BST = Xgb.train (param, Xg_train, Num_round, watchlist) # Note:this Convention has been changed since xgboost-unity # get prediction, this are in 1D array, need reshape to (Ndata, nclass) Pred_prob = BST.PR Edict (Xg_test). Reshape (Test_y.shape[0], 6) Pred_label = Np.argmax (Pred_prob, axis=1) error_rate = np.sum (pred!= Test_Y)
/test_y.shape[0] Print (' test error using Softprob = {} '. Format (error_rate))
The code is also very simple, it is worth mentioning that is to deal with the Age column data, the label changed to 0 from the beginning of the notation, and two training methods:
Multi:softprob
And
Multi:softmax