**Learning is a step-by-step process. Let's first understand what a decision tree is. As the name suggests, a decision tree is used to make decisions and make judgments on a thing. How can we determine it? Why? It is a question worth thinking about.**

**See the following two simple examples:**

**Example 1**

Now imagine a girl's mother wants to introduce her boyfriend to her daughter. The girl considers whether she wants to go or not based on her own situation, so she has the following conversation:

Daughter: How old is it?

Mother: 26.

Daughter: Long Shuai?

Mother: very handsome.

Daughter: high income?

Mother: Not very high. Moderate.

Daughter: is it a civil servant?

Mother: Yes. I work in the tax bureau.

Daughter: Well, I'll see you.

This girl's decision-making process is a typical classification tree decision. It is equivalent to dividing men into two categories by age, appearance, income, and whether or not civil servants: Seeing and seeing. Assume that the girl's requirement for a man is: A civil servant who is under 30 years old, looks medium and above, and is a high-income or above income, then this can be used to represent the Decision-Making logic of the girl.

Figure 1

**Example 2**

This example is from the book machine learning by Tom M. MITCHELL:

Mr. Wang's goal is to use the weather forecast next week to find out when people will play golf. He understands that the main reason people decide whether to play the game depends on the weather conditions. The weather conditions are clear, clouds and rain; the temperature is expressed in Fahrenheit; the relative humidity is expressed in percentage; and there is also no wind. In this way, we can construct a decision tree as follows (based on the weather classification, we can decide whether to play tennis on this day ):

Figure 2

The preceding decision tree corresponds to the following expressions:

(Outlook = sunny ^ humidity <= 70) V (outlook = overcast) V (outlook = rain ^ wind = WEAK)

After reading two examples, we will define a decision tree: A tree structure similar to a flowchart,

Each non-leaf node indicates a test on a property;

Each branch represents an output of the test;

Each leaf node represents a class label.

Here, we should have a decision tree. Then, two questions are raised:

1: How to Use Decision Tree Classification?

How to use it? It is very simple. Given a tuples, I start from the root of the decision tree and find a path that matches the conditions from the root to the leaf node one by one, the class Prediction stored on the leaf node is what we want. Therefore, it is easy to convert a decision tree into a classification rule (the path to which a class is assigned ).

2: Why is Decision Tree Classifier so popular?

1: Decision Tree construction does not require any domain knowledge or parameter settings. Therefore, it is suitable for probing knowledge discovery;

2: decision trees can process high-dimensional data;

3: the form of decision tree is intuitive and easy to understand;

4: Generally, the decision tree classifier has a good accuracy;

With the above simple preparations, I believe you can't wait to construct a decision tree. So the question is, how can we construct this beautiful tree?

Obviously, we can guess something:

1. Since he is a tree, if he is to be constructed, he must be constructed from the root, right ?! It is meaningless to have other nodes without a root. Therefore, the decision tree structure will be a top-down process (the root is at the top );

2. When we want to extend the tree downward, each selection attribute is a recursive process, we will certainly learn the corresponding measurement methods.

Next, we will introduce two methods (algorithms) to solve this problem. They are ID3 and C4.5. Because the C4.5 algorithm is an optimized version of the ID3 algorithm, let's first introduce how ID3 is going.

Open Wikipedia:

ID3 algorithm (iterative dichotomiser 3 iteration binary tree 3 generation) is an algorithm used for decision tree invented by Ross Quinlan.

This algorithm is based on the Occam Razor: the smaller the decision tree, the better the decision tree (simple theory ). However, this algorithm does not always generate the smallest tree structure. It is a heuristic algorithm. The Occam Razor elaborated on the concept of information entropy:

The ID3 algorithm can be summarized as follows:

- Use all unused attributes and calculate the sample entropy
- Select the attribute with the smallest entropy value
- Generate nodes that contain this attribute

Seeing this, I feel a little bit like ID3. Here is a summary of ID3 ideas on the Internet:

**Top-down greedy search Traversal**Possible decision tree space construction decision tree (this method is the basis of ID3 algorithm and C4.5 algorithm );
- From"
**Which attribute will be tested on the Root Node of the tree?**"Start;
- Use statistical tests to determine the ability of each instance attribute to separately classify training samples,
**Test the root node of the tree with the best classification capability.**(How to define or judge an attribute is the best classification capability? This is the information gain, or information gain rate, which will be described below ).
- Generates a branch for each possible value of the root node attribute, and
**Arrange training samples to appropriate branches**(That is, the branch corresponding to the property value of the sample.
- Repeat this process and use the training sample associated with each branch node to select the best property to be tested at this point.

After learning about how ID3 constructs a decision tree, we will discuss the links that are still unclear.

A: In the process of constructing a decision tree, how can we determine which attribute is the best, that is, which attribute should we use to extend the decision tree?

Obviously, this is a process in which attributes compete with each other. Whoever loses or wins, there must be a criterion for judging. The professional name is the attribute selection measurement. There are generally three common attribute selection metrics: information gain (ID3), gain rate (C4.5) and Gini index (Gini index, cart ). ID3 uses information gain as the attribute selection measurement. Therefore, this article only describes information gain. He was proposed by Shannon in the information theory that studies the message value or "information content. Set node n to represent or store the tuples of partition D. Select the attribute with the highest information as the split attribute of node n. This attribute minimizes the amount of information required for the tuples In the result partition and reflects the minimum randomness or "non-purity" of these partitions ". This method minimizes the expected number of tests required for an object classification and ensures that a simple tree is found (remember "Occam Razor ?).

(Because the book is detailed enough and not obscure. Therefore, the content in the book is referenced twice .)

The following is a supplement:

**In information gain, the measure is to see how much information a feature can bring to a classification system. The more information it brings, the more important the feature is. For a feature, when the system has it and does not have it, the amount of information will change, and the difference between the front and back information is the amount of information this feature brings to the system. The so-called amount of information is entropy.**

**Therefore, after calculating the gain () value for each attribute, we need to pick the largest one!**

Now that we have understood the concept and Calculation Method of entropy and information gain. Let's use an example to train your hands.

Eg:

Problem description: based on the information in the preceding table, a decision tree is generated to predict whether a customer will buy a computer.

Steps:

1. Calculate Info (buy_computer );

The buy_computer is discrete (all attributes in this example are discrete. If they are continuous, they can be discretization)

The table shows that class no corresponds to five tuples, and class Yes corresponds to nine tuples.

Therefore:

2. Calculate the expected information requirements for each other attribute.

Age entropy:

Age information gain:

Obviously, age has the highest information gain in the attribute, so it is selected as the split attribute.

Get:

The next thing we need to do is recursion. That is, the problem is divided into three subproblems for solving.