The Sklearn module provides a solution to the decision tree without having to build the wheel yourself (it will not be made, it feels slightly complicated):
Here are the notes:
Introduction of Sklearn.tree parameters and suggestions for use of recommended parameters
Official website: http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html class Sklearn.tree.DecisionTreeClassifier (criterion= ' Gini ', splitter= ' best ', Max_depth=none, Min_samples_split=2,min_ Samples_leaf=1, Max_features=none, Random_state=none, Min_density=none, compute_importances=none,max_leaf_nodes= None)
More important parameters:
Criterion: The decision-making method of the best segmentation attribute adopted in this tree is stipulated, there are two kinds: "Gini", "entropy".
Max_depth: Limits the maximum depth of a decision tree and is useful for preventing overfitting.
Min_samples_leaf: Limits the minimum number of samples that a leaf node contains, which is useful for preventing the data fragmentation problem mentioned above
Some important property methods in the module are:
N_classes_: The number of classes in the decision tree.
Classes_: Returns all kinds of labels in the decision tree.
Feature_importances_: The importance of feature, the greater the value, the more important.
Fit (x, Y, Sample_mask=none, X_argsorted=none, Check_input=true, Sample_weight=none) feeds the dataset X, and the label set Y into the classifier for training, One parameter to note here is: Sample_weright, which is as long as the number of samples, carrying the weight of each sample.
Get_params (deep=true) Gets the various parameters of the decision tree.
Set_params (**params) adjusts each parameter of the decision tree.
Predict (x) feeds the sample X and gets the prediction of the decision tree. Multiple samples can be fed at the same time.
Transform (x, Threshold=none) returns some of the more important feature of X, the equivalent of clipping data.
Score (X, y, Sample_weight=none) returns the test score on the data set X, Y, and the correct rate.
Usage recommendations
1. When we have more feature in our data, we must have enough data to support our algorithm, otherwise it is easy to overfitting
2. PCA is a way to avoid high-dimensional data overfitting.
3. Start exploring from a smaller tree and print it out using the export method.
4. Use the max_depth parameter to slowly add and test the model to find the best depth.
5. Use Min_samples_split and Min_samples_leaf parameters to control the number of leaf node samples and prevent overfitting.
6. Balance the data of each kind in the training data to prevent a kind of data dominate.
After the start of combat:
#-*-coding:utf-8-*- fromSklearnImportTree fromSklearn.metricsImportPrecision_recall_curve fromSklearn.metricsImportClassification_report fromSklearn.cross_validationImportTrain_test_splitImportNumPy as NP#reading DataData=[]labels=[]#writes data to the list according to the data format in textWith open ('C:\Users\cchen\Desktop\sample.txt','R') as F: forLineinchf:linelist=line.split (' ') Data.append ([Float (EL) forElinchLinelist[:-1]]) labels.append (linelist[-1].strip ())#Print Data#[[1.5, 50.0], [1.5, 60.0], [1.6, 40.0], [1.6, 60.0], [1.7, 60.0], [1.7, 80.0], [1.8, 60.0], [1.8, 90.0], [1.9, 70.0], [1.9, 80.0]]#Print Labelsx=np.array (data) labels=Np.array (labels)#Print Labels#[' thin ' fat ' thin ' fat ' thin ' fat ' thin ' fat ' thin ' fat 'y=Np.zeros (Labels.shape)#Print y#[0.0. 0.0. 0.0. 0.0. 0.0.]#print labels== ' fat '#[False True false true false true]#This method of substitution is ingenious and can be used to assign a value to a list using a Boolean value. If I'm going to write a loop. y[labels=='Fat']=1#Print y#[0.1. 0.1. 0.1. 0.1. 0.1.]#split the training data and test data, the 20% as the test data, in fact, I feel that the direct shard can be, but this is more tall on a pointX_train,x_test,y_train,y_test=train_test_split (x,y,test_size=0.2)#using information entropy as the dividing standard to train the decision treeClf=tree. Decisiontreeclassifier (criterion='Entropy')#Print CLF#Decisiontreeclassifier (Class_weight=none, criterion= ' entropy ', Max_depth=none,#Max_features=none, Max_leaf_nodes=none,#min_impurity_split=1e-07, Min_samples_leaf=1,#min_samples_split=2, min_weight_fraction_leaf=0.0,#Presort=false, Random_state=none, splitter= ' best ')Clf.fit (X_train,y_train)#writing a decision tree to a fileWith open (r'C:\Users\cchen\Desktop\tree.dot','w+') as F:f=tree.export_graphviz (clf,out_file=f)#digraph Tree {#node [shape=box];#0 [label= "x[1] <= 70.0\nentropy = 0.9544\nsamples = 8\nvalue = [3, 5]"];#1 [label= "x[0] <= 1.65\nentropy = 0.971\nsamples = 5\nvalue = [3, 2]"];#0-1 [labeldistance=2.5, labelangle=45, headlabel= "True"];#2 [label= "x[1] <= 55.0\nentropy = 0.9183\nsamples = 3\nvalue = [1, 2]"];#1-2;#3 [label= "entropy = 0.0\nsamples = 1\nvalue = [1, 0]"];#2-3;#4 [label= "entropy = 0.0\nsamples = 2\nvalue = [0, 2]"];#2-4;#5 [label= "entropy = 0.0\nsamples = 2\nvalue = [2, 0]"];#1-5;#6 [label= "entropy = 0.0\nsamples = 3\nvalue = [0, 3]"];#0-6 [labeldistance=2.5, labelangle=-45, headlabel= "False"];# }#coefficients reflect the influence of each characteristic value#Print Clf.feature_importances_#[0.3608012 0.6391988], the height factor can be seen to affect the larger#Test Results PrintAnwser=clf.predict (X_train)#Print X_trainPrintAnwser#[1.0. 1.0. 1.0. 1.0.]PrintY_train#[1.0. 1.0. 1.0. 1.0.]PrintNp.mean (anwser==Y_train)#1.0 very accurate, after all, using the training data#Let's use the test data to seeAnwser=clf.predict (x_test)PrintAnwser#[0.0.]Printy_test#[0.0.]PrintNp.mean (anwser==y_test)#1.0 is also very accurate .#This is the note in the tutorial, I did not touch#accuracy and Recall #准确率: the rate at which a category is correctly tested in test results #召回率: The ratio that a category is correctly predicted in the actual results #测试结果: Array ([0., 1., 0., 1., 0., 1., 0., 1., 0., 0.]) #真实结果: A Rray ([0., 1., 0., 1., 0., 1., 0., 1., 0., 1.]) #分为thin的准确率为0.83. It is because the classifier has separated 6 thin, of which 5 are correct, so the accuracy of thin is 5/6=0.83. #分为thin的召回率为1.00. is because there are 5 thin in the dataset, and the classifier divides them (though a fat is divided into thin! ), recall rate 5/5=1. #分为fat的准确率为1.00. Don't dwell on it. #分为fat的召回率为0.80. is because there are 5 fat in the dataset, and the classifier only divides 4 (a fat is divided into thin! ), recall rate 4/5=0.80. #本例中, the goal is to make sure that the fat you find is really fat (accurate), or to be sure to find as many fat as possible (recall rate). precision,recall,thresholds=Precision_recall_curve (Y_train,clf.predict (x_train))PrintPrecision,recall,thresholds#[1.1.] [1.0.] [1.]Anwser=clf.predict_proba (x) [:, 1]PrintClassification_report (y,anwser,target_names=['Thin','Fat'])#Precision recall F1-score support #Thin 1.00 1.00 1.00 5 #Fat 1.00 1.00 1.00 5##avg/total 1.00 1.00 1.00
Re-exploration of decision tree algorithm using Sklearn for decision tree Combat