brief introduction
Tree based learning algorithm is considered to be one of the best and most commonly used supervised learning methods. The tree based method gives the prediction model the ability of high precision, stability and easy interpretation. Unlike linear models, they map nonlinear relationships very well. They apply to any problem at hand (classification or regression). Decision tree, random forest and gradient enhancement are widely used in various data science problems. Therefore, it is very important for each analyst to learn these algorithms and use them in modeling. The purpose of this paper is to introduce the basic theory of decision tree learning and propose ID3 algorithm.
What is a
decision tree?
In short, the decision tree is a tree, in which each branch node represents the choice between multiple alternatives, and each leaf node represents a decision. It is a supervised learning algorithm, which is mainly used in classification problems, and is suitable for classification and continuous input and output variables. It is one of the most widely used and practical methods of inductive reasoning (inductive reasoning is the process of drawing general conclusions from concrete examples). The decision tree learns and trains data from given examples and predicts what is invisible.
The graphic representation of
decision tree can be as follows:
There are several algorithms for building decision trees, some of which are:
Cart (classification and regression tree): use Gini index as a measure.
ID3 algorithm: entropy function and information gain are used as measurement.
In this paper, we will focus on ID3 algorithm.
Important terms related to
decision tree
Basic terms:
Root node: it represents the entire population or sample, and is further divided into two or more sets of the same kind.
Splitting: This is the process of dividing a node into two or more child nodes.
Decision node: when a child node is divided into more children, it is called a decision node.
Leaf / terminal node: an undivided node is called a leaf or terminal node.
Pruning: when we delete the children of a decision node, this process is called pruning. You can say the opposite process of division.
Branch / sub tree: the sub parts of the whole tree are called branches or subtrees.
Parent and child node: the node divided into child nodes is called the parent node of the child node, where the child node is the child node of the parent node.
Classification vs regression
When the dependent variable is continuous, the regression tree is used. When the dependent variable is classification, the classification tree is used.
In the case of regression tree, the value obtained by the terminal node in the training data is the average value of the observation in the region. Therefore, if the invisible data observation belongs to the region, we will use the average value for prediction.
In the case of classification tree, the value (class) obtained by the terminal node in the training data is the observation mode in the region. Therefore, if the invisible data observation belongs to the region, we will use the model value for prediction.
Both trees divide the prediction space (independent variables) into different and non overlapping regions. For simplicity, you can think of these areas as boxes.
Both trees follow a top-down greedy approach called recursive binary splitting. We call it "top-down" because when all observations are available in a single area, it starts at the top of the tree and continuously divides the predictor space into two new branches under the tree. It is called "greedy" because it only focuses on the key to the current split (looking for the best available variable), and does not care about the splitting that will lead to better trees in the future. Continue the splitting process until the user-defined stop criteria are reached. For example, once the number of observations per node is less than 50, we can tell the algorithm to stop.
In both cases, the splitting process causes the tree to grow completely until it reaches the stop criteria. However, a fully grown tree may overfill the data, resulting in poor accuracy of invisible data. This brings about 'pruning'. Pruning is one of the techniques used to solve over fitting.
advantage
Easy to understand: the output of the decision tree is easy to understand even for people who are not analyzing the background. It doesn't need any statistical knowledge to read and interpret them. Its graphical representation is very intuitive.
Useful in data exploration: decision trees are one of the fastest ways to identify the relationships between the most important variables and two or more variables. With the help of decision tree, we can create new variables (features) with better ability to predict target variables. You can refer to an article (trick to enhance power of expression model). It can also be used in the data exploration phase. For example, we are working on a problem where we have information available for hundreds of variables, and a decision tree will help identify the most important variables.
Requires less data cleansing: it requires less data cleansing than some other modeling techniques. It is not affected by outliers and missing values.
The data type is not a constraint: it can handle numeric and categorical variables.
Nonparametric methods: decision trees are considered nonparametric methods. This means that the decision tree has no assumptions about spatial distribution and classifier structure.
shortcoming
Over fitting: over fitting is one of the most practical difficulties in decision tree model. This problem is solved by setting model parameters and pruning constraints.
Not suitable for continuous variables: when dealing with continuous numerical variables, the decision tree will lose information when classifying different types of variables.