1. is a tree-based model better than a linear model?
Why use a tree model if I can use logistic regression to solve classification problems and linear regression to solve regression problems? Many of us have this problem. In fact, you can use any algorithm. It depends on the type of problem you are trying to solve. There are a number of key factors that will help you decide which algorithm to use:
- If the relationship between the dependent and independent variables is well approximated by a linear model, then linear regression will outperform the tree-based model.
- If there is a highly nonlinear and complex relationship between the dependent variable and the independent variable, the tree model will be superior to the classical regression method.
- If you need to build a model that is easy to explain to people, decision tree models are always better than linear models. Decision tree models are even easier to interpret than linear regression!
2. What are the key parameters of tree modeling? How can I avoid over-fitting a decision tree?
Overfitting is one of the main challenges in decision tree modeling. If there is no limit, it will provide you with 100% training set accuracy, because in the worst case, it will eventually produce 1 leaves for each observation. Therefore, when modeling a decision tree, it is critical to prevent overfitting, which can be done in two ways:
- Set constraints on tree size
- Trees pruning (tree pruning)
Let's discuss these two issues briefly.
Set constraints on tree size
This can be done by using the various parameters used to define the tree. First, let's look at the general structure of the decision tree:
- 1. Minimal sample of node splits (Minimum samples for a node split)
- Defines the minimum number of samples (or observations) required in the node to consider splitting.
- Used to control over-fit. A higher value blocks the model learning relationship, which may be highly specific to a particular sample selected for the tree.
- Values that are too high can result in an under-fitting, so you should use CV (calculate Variance) to adjust.
- 2. Minimum sample (leaf) of the end node (Minimum samples for a terminal node (leaf))
-
- Defines the minimum sample (or observation value) required in an endpoint or leaf.
- Used to control over-fitting, similar to Min_samples_split.
- In general, a lower value should be chosen to solve the unbalanced class problem, as the majority of minority areas will be accounted for by a large proportion.
- 3 Maximum tree depth (vertical depth) (Maximum depth of tree (vertical depth))
-
- The maximum depth of the tree.
- Used to control over-fitting, because a higher depth will allow the model to learn a relationship that is very specific to a particular sample.
- You should use CV to adjust.
- 4. Maximum endpoints (Maximum number of terminal nodes)
-
- The maximum number of end nodes or leaves in the tree.
- can be defined instead of max_depth. Due to the creation of a two-fork tree, the depth ' n ' will produce a maximum of 2 ^ n leaves.
- 5. Maximum characteristics to consider when splitting a tick (Maximum features to consider for split)
-
- The number of features to consider when searching for the optimal number of splits, which should be randomly selected.
- The square root of the total number of functions works well, but we should check the 30-40% of the feature totals.
- Higher values can lead to overfitting.
Trees pruning (tree pruning)
Pruning can further improve the performance of the tree. It removes the branch of the feature that is not of importance, so that we reduce the complexity of the tree, thereby improving its predictive power by reducing overfitting.
Pruning can begin with the root or leaf. The simplest pruning method starts with the leaves and removes each node of the class that belongs to the leaf, maintaining this change if the accuracy is not reduced. It is also known as reducing error trimming. You can use more sophisticated pruning methods, such as cost complexity pruning, which uses the learning parameters (α) to weigh whether nodes can be removed based on the size of the subtree. This is also known as the Weakest link pruning.
Advantages of the Cart
- Easy to understand, explain, visualize.
- The decision tree implicitly performs variable filtering or feature selection.
- Digital and categorical data can be processed. You can also handle multiple output problems.
- Decision trees require a relatively small amount of effort for data preparation by the user.
- The nonlinear relationship between parameters does not affect tree performance.
The disadvantage of cart
- Decision tree Learners can create overly complex trees that do not generalize well to data-known as over-fitting.
- The decision tree may be unstable because small changes in the data can result in a completely different tree-called variance-that needs to be reduced by means of bagging and boosting.
- The greedy algorithm cannot guarantee the return of the global optimal decision tree. This can be mitigated by training multiple trees, such as features and random substitution of samples.
- If certain classes dominate, decision tree learners create biased trees. Therefore, it is recommended that you balance the data set before you fit the decision tree.
3. What criteria are used to determine the correct final tree size?
In this approach, the available data is divided into two groups: the training set used to form the learning hypothesis and a separate set of validations to evaluate the accuracy of the hypothesis, especially the impact of the hypothesis on the evaluation of pruning.
The motivation is this: even though learners may be misled by random and coincidental rules within the training set, the validation set is unlikely to exhibit the same random fluctuations. Therefore, the validation set can be expected to provide security checks for overfitting.
Of course, the validation set must be large enough for itself to provide a statistically significant sample of the instance. A common heuristic is to keep one-third of the examples available in the validation set and train with the other two-thirds.
4. How do we use validation sets to prevent overfitting?
A method known as reducing error pruning (Quinlan 1987) is to treat each decision node in the tree as a candidate for pruning. The pruning decision node includes deleting the subtree that is the root of the node, making it a leaf node, and assigning it the most common classification of the training samples associated with that node.
The node is deleted only if the resulting pruning tree is performing on the validation set that is not worse than the original condition. iteratively trims a node, always selecting its remove node that increases the decision tree precision rather than the validation set. The pruning of the nodes continues until further pruning is harmful (that is, reducing the accuracy of the tree on the validation set).
Reduce the role of error pruning in decision Tree Learning: The accuracy of the test set increases as nodes are removed from the tree. Here, the validation set for pruning differs from the training set and the test set. The accuracy of the validation set used for pruning is not displayed.
In addition, the available data is divided into three subsets: training samples, validation examples for pruning trees, and a set of test examples for providing unbiased estimation accuracy in future unseen examples.
If you have a large amount of data available, it is an effective way to use a separate set of data to guide pruning. A common heuristic is that the training set accounted for 60% of all data, the validation set accounted for 20%, and the test set accounted for 20%. The main disadvantage of this approach is that when data is limited, withholding some of the data for the validation set can reduce even the use of training examples.