Decision Tree

1. Decision tree and random forest belong to the category of supervised learning in machine learning, which is mainly used for classification problems.

The decision tree algorithm has these kinds: ID3, C4.5, CART, the algorithm based on decision tree has bagging, random forest, GBDT and so on.

Decision tree is a tree-shaped structure for decision-making algorithm, for the sample data according to known conditions or called characteristics of the fork, the final establishment of a tree, tree leaf nodules identify the final decision. The new data can be judged according to the tree. Random forest is an optimal decision-making algorithm based on multiple decision trees.

2. Case:

Figure 1 is a simple, structured decision tree that predicts whether a loan user has the ability to repay a loan. The loan user mainly has three attributes: whether owns the property, whether marries, the average monthly income. Each internal nodule represents an attribute condition, and the leaf nodule indicates whether the loan user has the ability to repay.

The properties described here, that is, the characteristics of the algorithm, correspond to the data table is the field.

What is said here can be repaid/not repaid, which is a classification problem.

3. Decision tree Feature Selection:

As we can see, the first node uses "owning a property" as a condition, which is a feature. So why do we choose "owning a property" on the first condition, and what are the conditions and basis of the choice? The Gini coefficient is described below: The

Gini index is a measure of the purity of another data, and its formula is:

The formula can be proved by a mathematical formula:

X+y=1

f=1-(x^2+y^2)

F (x) =1-x^2-(1-x) ^2=-2x^2+ The 2x

curve is:

As you can see, when X is closer to 0 or 1 o'clock, the smaller the coefficient, the higher the purity of the data represents. There is the middle of a certain type of data accounted for a larger proportion, also in line with our understanding.

where C represents the number of categories in the DataSet, and Pi represents the proportion of the sample number of categories I to all samples. As you can see from this formula, the higher the data mix in the data set, the higher the Gini index. When DataSet D has only one data type, the Gini index has a minimum value of 0.

If the selected property is a, then the Gini index of the split DataSet D is calculated as:

where k indicates that sample D is divided into K-sections, and DataSet D splits into K-Dj datasets.

for feature selection, you need to select the smallest post-split Gini index. It is also possible to use the Gini index gain as the basis of the decision tree selection feature. The formula is as follows:

When selecting a feature in a decision tree, you should select the feature with the largest gain value of the Gini index as the condition of the nodule splitting.

Another, similar to the Gini coefficient, can use information entropy, the concept of entropy has been physically learned, the more chaotic, the greater the entropy, do not long explanation:

assume that in the sample dataset D, mixed with the C category of data. When building a decision tree, select a feature value as a node for the tree based on the given sample data set. In a dataset, you can calculate the entropy of information in this data:

is a formula for analyzing the gain of information, and it can be seen that the Gini coefficient is roughly similar.

If the selected property is a, then the Gini index of the split DataSet D is calculated as:

where k indicates that sample D is divided into K-sections, and DataSet D splits into K-Dj datasets.

for feature selection, you need to select the smallest post-split Gini index. It is also possible to use the Gini index gain as the basis of the decision tree selection feature. The formula is as follows:

When selecting a feature in a decision tree, you should select the feature with the largest gain value of the Gini index as the condition of the nodule splitting.

4. Pruning:

It is divided into pre-pruning and post-pruning.

Pre-pruning the governor of the decision tree in advance, the post-pruning effect is better, but the post-pruning will be wasted in the calculation of the tertiary growth process.

Random Forest

1. Random Forest principle:

The random forest is a classification algorithm proposed by Leo Breiman (2001), which uses the self-service method (bootstrap) resampling technique to randomly extract n samples from the original training sample set N to generate a new training sample set training decision tree. Then follow the above steps to generate m tree decision tree to form a random forest, the classification of the new data according to the classification tree voting number of points determined. The essence is an improvement to the decision tree algorithm, which merges multiple decision trees together, and the establishment of each tree depends on the samples extracted independently.

The classification ability of a single tree may be small, but after randomly generating a large number of decision trees, a test sample can be statistically selected for the most probable classification by the classification results of each tree.

The approximate process for random forests is as follows:

1) n samples are selected from a sample set with a random sampling;

2) randomly select K features from all features to establish decision trees for selected samples using these features (usually cart, but other or mixed);

3) Repeat the above two steps m time, that is, to generate m tree decision tree, the formation of random forest;

4) for the new data, after each tree decision, the final vote to identify the type of.

2. Random Forest features:

Random forests have many advantages:

1) Each tree selected part of the sample and some characteristics, to a certain extent to avoid overfitting;

2) Each tree randomly selects the sample and randomly chooses the characteristic, makes it has the good anti-noise ability, the performance is stable;

3) can handle very high dimensional data, and do not have to do feature selection;

4) suitable for parallel computing;

5) The implementation is relatively simple;

Disadvantages:

1) more complex parameters;

2) Model training and prediction are slow.

3. Use:

Random forest algorithms are implemented in most data processing software, and can be called directly when used, just specify the required parameters.

There are a number of parameters to be set before the random forest model training, according to the PAI platform implementation has the following:

o algorithm type: (optional) The types of algorithms available are the ID3 algorithm, the cart algorithm, the c4.5 algorithm, and the hybrid algorithm that evenly spreads the above three algorithms by default.

o Number of trees: number of trees in the forest, range (0, 1000)

o Number of random attributes: (optional) Single tree at the time of generation, select the optimal feature, the number of random features. There are four types of logn,n/3,sqrtn,n available, where N is the total number of attributes

o Tree Maximum depth: (optional) maximum depth of single tree, range [1,∞], 1 means full growth.

o Minimum number of leaf node records: (optional) The minimum count of leaf node data. The minimum number is 2

o Minimum record percentage of leaf nodes: (optional) The number of leaf node data is the smallest proportion of the parent node, and the range [0,100],-1 represents no limit. Default-1

o Maximum number of records per tree: (optional) The number of random data entered by a single tree in the forest. Range of (1000, 1000000]

4. Model Evaluation:

After the algorithm model is established, it needs to be evaluated to judge the model's merits. A model is typically built using a training set (training set), and a test set is used to evaluate the model. Classification accuracy, recall rate, false alarm rate and accuracy are classified as the evaluation indexes. These indicators are calculated based on the confusion matrix (confusion matrix).

The confusion matrix is used to evaluate the accuracy of the supervised learning model, and each column of the matrix represents an instance of a class, and each row represents an instance of the actual class. Take the two classification problem as an example, as shown in the following table:

which

P (Positive sample): The number of samples in a positive case.

N (negative sample): The number of samples in negative cases.

TP (True Positive): The number of positive cases that are correctly predicted.

FP (False Positive): The negative example is predicted as the number of positive cases.

FN (False negative): predicts the number of positive cases into negative cases.

TN (True Negative): The number of negative cases correctly predicted.

There are several indexes that can be used to evaluate the classification model according to the confusion matrix.

The classification accuracy is the probability that the positive and negative samples are correctly classified, and the formula is:

The recall rate is the probability that the positive sample is identified, and the formula is:

False alarm rate is the probability that negative samples are divided into positive samples by mistake, and the formula is:

Accuracy is the degree to which the classification result is a positive sample, and the formula is:

Methods of evaluation include retention method, random quadratic sampling, cross-validation and self-help method.

The retention method (holdout) is the most basic way to evaluate the performance of a classification model. The original data set that is marked is divided into two copies of the training set and the test set, the training set is used to train the classification model, and the test set is used to evaluate the classification model performance. However, this method is not suitable for smaller samples, and the model may be highly dependent on the composition of the training set and the test set.

Random quadratic sampling (subsampling) refers to the repeated use of retention methods to improve the classifier evaluation method. Also, this approach does not apply to the insufficient number of training sets, and may result in some data not being used in the training set.

Cross-validation (cross-validation) refers to the data into the same amount of k, each time the use of data to classify, select one of them as a test set, the remainder of the k-1 is a training set, repeat K times, just so that each data is used for a test set k-1 training set. The advantage of this method is that as much data as the training set data, each training set data and the test set data are independent of each other, and completely cover the entire data set. There is also a disadvantage, that is, the classification model runs K times, the computational overhead is large.

The self-Help method (bootstrap) means that in its method, the training set data is sampled with a put back, that is, the data that has been selected as the training set is put back into the original data set, so that the data has the opportunity to be extracted again. It works well in cases where the number of samples is not much.

Other similar algorithms

1.Bagging

The bagging algorithm is similar to a random forest, except that each tree uses all features rather than just a subset of the features. The algorithm process is as follows:

1) n samples are selected randomly from sample set;

2) on all attributes, set up the classifier (CART or SVM or ...) for the n samples. ）；

3) Repeat the above two steps m times, that is, generate M classifier (CART or SVM or ...) ）；

4) Run the data on the M classifier and finally vote to determine which category to divide.

2.GBDT

GBDT (Gradient boosting decision Tree) is an iterative decision tree algorithm, which consists of multiple decision trees, and the conclusion of all trees is summed up to make the final result. is considered to be a strong generalization ability of an algorithm.

GBDT is a widely used algorithm that can be used for classification and regression. GBDT also has other names, such as Mart (multiple Additive Regression tree), GBRT (Gradient boosting Regression tree), and tree net.

Unlike classification trees such as C4.5, GBDT is a regression tree. The difference is that each node of the regression tree (not necessarily the leaf node) is given a predictive value that, for example, is equal to the average age of all people who belong to the node. Branching out each feature finds the best one to branch, but the best measure is no longer the information gain, but the minimization of the mean variance, such as (everyone's age – predicted age) ^2 sum/N. That is, the more people are predicted to go wrong, the greater the error, the greater the mean variance, the most reliable branch basis can be found by minimizing the mean variance. Branches until the age of each leaf node is unique or reaches a predetermined termination condition (such as the maximum number of leaves). If the age of the final leaf knot is not unique, then the average age of all people on the node is the predicted age of the leaf node.

Gradient iterations (Gradient boosting), that is, by iterating over multiple trees to make joint decisions. The core of GBDT is that each tree learns the residuals of all previous tree conclusions and the residuals are the sum of the actual values that can be added to the predicted value. For example, A's true age is 18 years old, but the first tree predicts the age is 12 years old, the difference is 6 years old, namely the residual difference is 6 years old. So in the second tree we set the age of a to 6 years old to study, if the second tree really can point a to a 6-year-old leaf knot, the sum of the two trees is the true age of A; if the second tree concludes that the 5-year-old, A is still a 1-year-old residual, the third tree A's age becomes 1 years old, continue to study.

GBDT Specific implementation Please check it yourself.

Sample code and Parameter tuning

Data set:

Our data set is from a well-known data mining competition website, is a survey of the Titanic, tourist survival. Can be downloaded from here: Https://www.kaggle.com/c/titanic/data

The above picture, I downloaded from the official website, in general, the inside of each row of data, almost 11 fields, including the visitor's age, name, sex, buy a few of the tickets and other positions, and finally his survival situation, in this accident, he died or survived.

Don't want to explain, just read the data.

`importas npimportas pdfromimport RandomForestClassifiertrain = pd.read_csv("E:/train.csv", dtype={"Age": np.float64},)train.head(10)`

With a little analysis, we can filter out variables related to the survival of a visitor: Pclass, Sex, age, Sibsp,parch,fare, embarked. Generally speaking, the visitor's name, the ticket number purchased for its survival should affect very little.

`len(train_data)out:891`

We have a total of 891 data, nearly 900, we use 600 as training data, the remaining 291 as test data, through the random forest parameters constantly tuning, to find out in the test results, the most accurate prediction of the random forest model.

Before a specific experiment, let's take a look at some of the variables that you need to be aware of using a random forest model:

In Sklearn, the function model for random forests is:

`randomforestclassifier (bootstrap= true , Class_weight=none , Criterion= ' Gini ' , Max_depth=none , Max_features= ' auto ' , Max_leaf_nodes=none , Min_samples_leaf=1 , Min_samples_split=2 , Min_weight_fraction_leaf=0.0 , N_estimators=10 , n_jobs= 1 , Oob_score=false , Random_state=none , Verbose=0 , Warm_start= False ) `

- Parametric analysis

A. Max_features:

The random forest allows a single decision tree to use the maximum number of features. Python provides multiple options for the maximum number of features. Here are a few of them:

Auto/none: Simply select all features, and each tree can take advantage of them. In this case, there is no limit to each tree.

SQRT: This option is the square root of each subtree that can take advantage of the total number of features. For example, if the total number of variables (features) is 100, each subtree can take only 10 of them. "Log2" is another similar type of option.

0.2: This option allows the subtree of each random forest to take advantage of the number of variables (features) of 20%. We can use the "0.X" format if we want to examine the function of the X-Percent feature.

How does max_features affect performance and speed?

Adding max_features generally improves the performance of the model because we have more options to consider on each node. However, this may not be entirely right, as it reduces the diversity of individual trees, which is the unique advantage of random forests. However, you can be sure that by increasing max_features you will reduce the speed of the algorithm. Therefore, you need the right balance and choose the best max_features.

B. N_estimators:

You want to establish the number of subtrees before you can predict by the maximum number of votes or averages. A lot of subtrees can make the model more performance, but at the same time make your code slow. You should choose the highest possible value as long as your processor can afford to live, as this makes your predictions better and more stable.

C. Min_sample_leaf:

If you've written a decision tree before, you can see the importance of the smallest sample blade size. The leaf is the end node of the decision tree. The smaller leaves make the model easier to capture noise from the training data. In general, I prefer to set the minimum number of leaf nodes to greater than 50. In your own case, you should try to find the optimal one by trying a variety of leaf size types.

Here we have the above mentioned three parameters, tuning, first parameter A, because in our data, the total data segment is only seven or eight, so we simply select all the features, so we just need to tune the remaining two variables.

In Sklearn's own random forest algorithm, the input value must be an integer or a floating-point number, so we need to preprocess the data to convert the string into integers or floating-point numbers:

` def harmonize_data(Titanic): # fills the empty data and turns the string data into an integer representation # for the Age field is missing, we replace it with the mean value of all agestitanic["Age"] = titanic["Age"].fillna (titanic["Age"].median ())# Sex Man: with 0 alternativetitanic.loc[titanic["Sex"] =="Male","Sex"] =0 # Sex woman: with 1 alternativetitanic.loc[titanic["Sex"] =="female","Sex"] =1titanic["embarked"] = titanic["embarked"].fillna ("S") titanic.loc[titanic["embarked"] =="S","embarked"] =0titanic.loc[titanic["embarked"] =="C","embarked"] =1titanic.loc[titanic["embarked"] =="Q","embarked"] =2titanic["Fare"] = titanic["Fare"].fillna (titanic["Fare"].median ())returnTitanictrain_data = Harmonize_data (train)`

The code above is to clean the raw data, fill the missing data, and convert the string type data into int data.

The following work, we began to divide the training data and test data, the total data has 891, we use 600 training data sets, the remaining 291 as a test data set.

`# list fields that have an impact on the survival resultpredictors = ["Pclass","Sex","Age","SIBSP","Parch","Fare","embarked"]# Store different parameter values, and corresponding precision, each element is a ternary group (a, B, c)results = []# parameter values for the minimum leaf nodeSample_leaf_options = List (range (1, -,3))# Number of decision Trees parameter valuesN_estimators_options = List (range (1, +,5)) Groud_truth = train_data[' survived '][601:] forLeaf_sizeinchSample_leaf_options: forN_estimators_sizeinchN_estimators_options:alg = Randomforestclassifier (Min_samples_leaf=leaf_size, N_estimators=n_estimators_size, Rand Om_state= -) Alg.fit (train_data[predictors][: -], train_data[' survived '][: -]) predict = Alg.predict (train_data[predictors][601:])# Use a ternary group to record the current min_samples_leaf,n_estimators, and the accuracy on the test data set, respectivelyResults.append (Leaf_size, n_estimators_size, (Groud_truth = = predict). mean ()))# Real and predictive results are compared and the accuracy rate is calculatedPrint ((Groud_truth = = predict). Mean ())# The one ternary group with the highest print accuracyPrint (max (results, key=Lambdax:x[2]))`

In general, the parameters of a random forest will not be very large fluctuations, compared to the neural network, random forest even with the default parameter, can achieve good results. In our case, by rough tuning, we can achieve a 84% prediction accuracy on the test set, and I think the effect should be unexpected.

Enclose all the code: (The running time is relatively long)

`__author__ =' Administrator 'ImportNumPy asNpImportPandas asPd fromSklearn.ensembleImportRandomforestclassifiertrain = Pd.read_csv ("E:/train.csv", dtype={"Age": Np.float64},) def harmonize_data(Titanic): # fills the empty data and turns the string data into an integer representationtitanic["Age"] = titanic["Age"].fillna (titanic["Age"].median ()) titanic.loc[titanic["Sex"] =="Male","Sex"] =0titanic.loc[titanic["Sex"] =="female","Sex"] =1titanic["embarked"] = titanic["embarked"].fillna ("S") titanic.loc[titanic["embarked"] =="S","embarked"] =0titanic.loc[titanic["embarked"] =="C","embarked"] =1titanic.loc[titanic["embarked"] =="Q","embarked"] =2titanic["Fare"] = titanic["Fare"].fillna (titanic["Fare"].median ())returnTitanictrain_data = Harmonize_data (train) predictors = ["Pclass","Sex","Age","SIBSP","Parch","Fare","embarked"]results = []sample_leaf_options = List (range (1, -,3)) N_estimators_options = List (range (1, +,5)) Groud_truth = train_data[' survived '][601:] forLeaf_sizeinchSample_leaf_options: forN_estimators_sizeinchN_estimators_options:alg = Randomforestclassifier (Min_samples_leaf=leaf_size, N_estimators=n_estimators_size, Rand Om_state= -) Alg.fit (train_data[predictors][: -], train_data[' survived '][: -]) predict = Alg.predict (train_data[predictors][601:])# Use a ternary group to record the current min_samples_leaf,n_estimators, and the accuracy on the test data set, respectivelyResults.append (Leaf_size, n_estimators_size, (Groud_truth = = predict). mean ()))# Real and predictive results are compared and the accuracy rate is calculatedPrint ((Groud_truth = = predict). Mean ())# The one ternary group with the highest print accuracyPrint (max (results, key=Lambdax:x[2]))`

If you have any questions, you can contact [email protected]

Problem left: If the characteristics are more, in random forest algorithm, each time a decision tree is constructed, a certain number of features are randomly selected. So when the final model is trained, how do you judge which features are useless and leave only the useful ones?

Random forest (principle/sample implementation/parameter tuning)