Decision trees-predicting contact lens types (ID3 algorithm, C4.5 algorithm, cart algorithm, Gini index, pruning, random forest)

Source: Internet
Author: User
Tags id3

1.

1, the introduction of the problem

2. An example

3. Basic Concepts

4, ID3

5, C4.5

6. CART

7. Random Forest

2.

What algorithms should we design so that the computer automatically classifies the application information of the loan applicant to determine whether the loan can be made?

A girl's mother wanted to introduce her boyfriend to the girl, so she had the following conversation:

Daughter: How old are you?

Mother: 26.

Daughter: Long handsome not handsome?

Mother: Very handsome.

Daughter: Is the income high?

Mother: Not very high, medium condition.

Daughter: Is it a civil servant?

Mother: Yes, I work in the Inland Revenue Department.

Daughter: Well, I'll meet you.

Decision making process:

This girl's decision-making process is a typical classification tree decision. The equivalent of dividing a man into two categories by age, appearance, income and whether civil servants: see and see

3. Definition:

A decision tree is a tree structure that describes the classification (see or disappear) of sample instances (men).

The decision tree consists of a node and a forward edge. At the top is the root node, at which time all the samples are together and after that node the sample is divided into the child nodes. Each child node then uses the new feature to further make the decision until the last leaf node. Leaf nodes contain only a single class of samples (see or disappear) and do not need to be divided.

Two types of nodes: internal nodes and leaf nodes.

An internal node represents a feature or attribute, and a leaf node represents a class.

4. Entropy

Feature Selection

First of all, what criteria (attributes, characteristics) should we choose as our primary condition (the root node) to divide the sample (man) and decide to see or not? --Feature selection

The mother wants her daughter to have a clear attitude and decide whether to see or not, so as to give the man a definite answer.

Mothers need to get as much information as possible to reduce uncertainty.

How is information measured? --Entropy

The more information a mother gets, the more clearly her daughter's attitude is, and the less uncertainty the man sees and disappears. Therefore, the amount of information corresponds to uncertainty. Use entropy to represent the measure of uncertainty.

Entropy definition: If one thing has K-type results, the probability of each result is

The amount of information we get when we look at the results of this event is:

The greater the entropy, the greater the uncertainty of the random variable (see and disappear).

5. Conditional entropy (local, entropy under the premise of the occurrence of the phenomenon)

Conditional entropy H (y| X) indicates the uncertainty of the random variable y under the condition of the known random variable x. For example, to know the age of the boys under the precondition, according to the daughter see and disappear uncertainty.

When the probability of entropy and conditional entropy is estimated by data, the corresponding entropy and conditional entropy are called empirical entropy and empirical conditional entropy. If the probability is 0, make 0log0=0

6. Information gain

The information gain indicates that the information of the characteristic X (age) makes the uncertainty of the information of Class Y (see and disappear) decrease.

Feature a pairs the information gain G (d,a) of the training dataset D, defined as the empirical entropy H (d) of the set D and the empirical condition entropy H under the given condition of the feature a (d| A) The Difference

Entropy h (Y) and conditional entropy H (y| X) The difference is called mutual information, which is G (d,a).

Information gain indicates that information increases, information increases, the less uncertainty, the mother should choose to make information gain increased conditions to ask daughter.

7. Feature selection method for information gain criteria

For DataSet D, calculate the information gain for each feature, compare their size, and select the feature with the most information gain.

8. Loan Application Sample Data Sheet (example)

According to the loan Application Sample data sheet, we have 15 sample records, the sample capacity is 15. Finally, there are 2 categories of loans, including 9 Records and No 6 records. There are 4 different characteristics of age, work, own house and credit situation. Each feature has a different value, such as the age of the old, medium and green 3 kinds of values.

The definition of entropy

Calculate Empirical entropy

It then calculates the information gain of each feature on DataSet D. The 4 characteristics of age, work, own house and credit conditions were expressed in a1,a2,a3,a4.

According to the age has the value youth, middle age, old age.

Youth Loan is 2 records, no 3 records, total 5 records

Middle-aged loan is 3 records, no 2 records, total 5 records

Old age loan is 4 records, No 1 records, total 5 records

Conditional Entropy Formula

Conditional Entropy Formula

The condition entropy for the known condition of age is

D1,D2,D3 is a subset of samples of young, middle-aged and old age, respectively.

Age-Conditional information gain is

Information gain with work

Have information about the house gain

Information gain on credit conditions

Finally, the information gain value of each feature is compared, and the information gain value of its own house is maximal for the characteristic A3, so the feature A3 is chosen as the optimal feature.

Combined with the first example, we can know that age as the preferred feature of the information gain the most, the choice of age as the first condition of seeing and missing.

9.ID3 algorithm

The core of the ID3 algorithm is to apply the information gain criterion selection feature to the decision tree each sub-node, to construct the decision tree recursively, the concrete method is: Starting from the root node, the node calculates the information gain of all possible features, selects the feature with the maximum information gain as the feature of the node, and sets up the child nodes by different values And then recursively call the above method to construct a decision tree.

Until the information gain of all features is small or there is no feature to choose from. Finally get a decision tree.

Continuing the previous process, the feature A3 is selected as the feature of the root node because the feature A3 (with its own house) has the largest information gain value. It divides the training data set into two subsets D1 (A3 = yes) and D2 (A3 value is no). Since D1 only have the same class of sample points, you can explicitly loan to D1, so it becomes a leaf node, and the node class is labeled "Yes".

For D2, new features need to be selected from feature A1 (age), A2 (working), and A4 (credit conditions). Calculate the information gain for each feature:

Select the most information gain feature A2 (with work) as a node feature. A2 has 2 values, a child node that corresponds to "yes" (with work), contains 3 samples, they belong to the same class, so this is a leaf node, the class is labeled "Yes", another child node that corresponds to "no" (no Work), contains 6 samples, belongs to the same class, which is also a leaf node, and the class is labeled "No."

In other words, there are 15 lenders, and after the availability of a room, there are 6 people in the House who can make a loan. The remaining 9 people need to be further screened to see if there is a job for the screening condition, there are 3 people working for the loan, and 6 people without a job are not able to borrow.

The decision tree uses only two characteristics (with two internal nodes) to have its own house as the primary judgment condition, and then to have the job as a condition of the decision whether the loan can be made.

The ID3 algorithm has only the tree generation, so the tree generated by the algorithm is easy to produce over fitting, too thin, and considering too many conditions.

10.c4.5 algorithm

1. When selecting attributes with information gain, the preference is to select more than the attribute value of the branch, that is, the attribute with many values.

2. Continuous attributes cannot be processed.

Information gain ratio definition: The information gain ratio of feature a to training data set D is defined as its information gain and training data D on the entropy of the value of the characteristic a ha (D) ratio

where n is the number of feature a values. If a represents age.

Improvement of C4.5 algorithm

C4.5 algorithm is one of the ten algorithms of data mining, it is an improvement to the ID3 algorithm, and there are several improvements relative to the ID3 algorithm.

(1) Selecting attributes with information gain ratio

(2) pruning the tree during the construction of the decision tree

(3) can also be processed for non-discrete data

(4) Ability to process incomplete data

11.CART algorithm

Categorical regression trees (cart,classification and Regression tree) its core idea is the same as ID3 and C4.5, the main difference is that the CART in each node is the dichotomy, that each node can only have two child nodes, and finally constitute a two-fork tree.

Partitioning method

Pruning

Name

Temperature

Surface coverage

Viviparous

Egg production

Can Fly

Aquatic

Have legs

Hibernating

Class Tag

People

Constant temperature

Hair

Is

Whether

Whether

Whether

Is

Whether

Mammal type

Giant Python

Cold - blooded

Scales

Whether

Is

Whether

Whether

Whether

Is

Crawl class

Salmon

Cold - blooded

Scales

Whether

Is

Whether

Is

Whether

Whether

Fish

Whale

Constant temperature

Hair

Is

Whether

Whether

Is

Whether

Whether

Mammal type

Frog

Cold - blooded

No

Whether

Is

Whether

Sometimes

Is

Is

Amphibian Type

Dragons

Cold - blooded

Scales

Whether

Is

Whether

Whether

Is

Whether

Crawl class

Bat

Constant temperature

Hair

Is

Whether

Is

Whether

Is

Whether

Mammal type

Cat

Constant temperature

Skin

Is

Whether

Whether

Whether

Is

Whether

Mammal type

Leopard print Shark

Cold - blooded

Scales

Is

Whether

Whether

Is

Whether

Whether

Fish

Turtle

Cold - blooded

Scales

Whether

Is

Whether

Sometimes

Is

Whether

Crawl class

Porcupine

Constant temperature

Setose

Is

Whether

Whether

Whether

Is

Is

Mammal type

Eel

Cold - blooded

Scales

Whether

Is

Whether

Is

Whether

Whether

Fish

Newt

Cold - blooded

No

Whether

Is

Whether

Sometimes

Is

Is

Amphibian Type

The above example is a property with 8, and each property has multiple discrete values. On each node of the decision tree we can divide by any one of the values of any one of these properties. For example, at the beginning we press:

1) surface covering for hair and non-fur

2) surface covering for scale and non-scale

3) Body temperature is constant temperature and non-constant temperature

To produce a tree of about two children, according to which division is the best? Generally we use the Gini index as a dividing standard. The more cluttered the categories that are contained in the population, the greater the Gini index (similar to the concept of entropy)

12.GINI Index

In the classification problem, if there are k classes, the probability of a sample point belonging to Class I is pi, then the Gini index is defined as

The temperature of the temperature includes 5 mammals, 2 birds, the temperature of non-constant temperature contains reptiles 3, fish 3, amphibians 2.

The temperature of the body contains 5 mammals, 2 birds, then:

The body temperature is non-constant temperature contains reptiles 3, fish 3, amphibian 2, then:

Set of Gini indices

If the sample set D is divided into D1 and D2 according to whether the feature a is taking a certain possible value A, then the Gini gain of Set D is defined in the condition of feature a as

If the "temperature is constant temperature and non-isothermal" division, we get Gini gain:

The Gini index of a set represents the uncertainty of set D, and the greater the value of a Gini, the greater the uncertainty of the sample belonging to a class, which is similar to the entropy. We are always looking for more information to reduce uncertainty. Therefore, the best choice of feature division is to make the set of Gini index Gini the smallest division.

13. Pruning

When the cart tree is divided too finely, the noise data is too well-fitted. So we have to work through pruning. Pruning is also divided into pre-pruning and post-pruning.

Pre-pruning refers to the construction of the tree in the process of knowing which nodes can be cut off, so simply do not split these nodes.

Post-pruning refers to the construction of a complete decision tree and then to examine which subtrees can be cut off.

The cart pruning algorithm cuts some sub-trees from the bottom of the "fully grown" decision tree, making the decision tree smaller (the model becomes simpler), so that the unknown data can be more accurately predicted.

The cart pruning algorithm consists of two steps: First, from the bottom of the decision tree generated by the generating algorithm, the T0 is continuously pruned until the root node of T0 is formed, and a subtree sequence is established, and then the sub-tree sequence is tested on the independent validation data set by cross-validation method, and the optimal subtree is selected.

Surface error rate gain α for each non-leaf node in the cart tree (the rate at which the error increases, the smaller the better)

is the number of leaf nodes included in the subtree.

is the error cost of the node T, if the node is pruned:

R (t) is the error rate of node T;

P (t) is the ratio of the data on the node T to all data;

is the error cost of the sub-tree TT, if the node is not pruned. It is equal to the sum of the error cost of all leaf nodes on the sub-tree TT.

There is a non-leaf node T4:

It is known that all of the data has a total of 60, then the node T4 node error cost is:

Note: The class of the leaf node is defined as the class that covers the majority of the samples, that is, the correct number is the majority, and the sub-error is a minority.

Sub-tree error costs are:

There are 3 leaf nodes on the subtree with T4 as the root node, and finally:

A non-leaf node with the smallest alpha value is found, leaving the child empty, that is, the node becomes a leaf node, that is, pruning.

14. Random Forest

Random forest is the establishment of a number of decision trees, forming a decision tree of the "forest", through a number of tree voting to make decisions. This method can effectively improve the classification accuracy of new samples.

Steps for a random forest:

First of all, the sample data is sampled with a back-up, and a plurality of sample sets are obtained. In particular, n samples are randomly extracted from the original n training samples (including possible duplicate samples).

Then, by randomly extracting m features from the candidate features, as an alternative feature of the decision under the current node, we choose the best partitioning of the training samples from these features. Constructs a decision tree with each sample set as a training sample. A single decision tree is calculated using the cart algorithm and does not prune after generating the sample set and defining the feature.

Finally, after obtaining the required number of decision trees, the random forest method votes the output of these trees, taking the most votes class as the decision of the random forest.

The random forest method sampled the training samples and sampled the features, which fully ensured the independence between each tree, and made the voting result more accurate.

Random forest randomness is reflected in the training sample of each tree is random, the split attribute of each node in the tree is also randomly selected. With these 2 random factors, even if each decision tree is not pruned, the random forest does not produce a fitting phenomenon.

There are two controllable parameters in a random forest: the number of trees in the forest (generally selected values are larger) and the size of the extracted attribute value m.

Advantages of random forests:

(1) More accurate classification results

(2) Ability to handle high-dimensional attributes without having to do feature selection

(3) Even if a large part of the data is lost, can still maintain high accuracy

(4) Fast learning process

(5) What attributes are important to be given after the training is completed

(6) Parallel computing is easy to achieve

(7) in training

15. Code-Implement ID3 algorithm

1. Prepare training Data

2. Calculate Information gain

Below is the calculation

Calculation below

3. Recursive Construction decision tree

When all the features are used up, a majority voting method is used to determine the classification of the leaf node, that is, the leaf node belongs to a certain class of the maximum number of samples, then we say that the leaf node belongs to that category!

Create Tree

To run the test:

4. View the resulting decision tree

5. Test data

6. Storage of decision Trees

Constructing a decision tree is a time-consuming task. To save compute time, it is a good idea to call the already constructed decision tree each time the classification is performed. To solve this problem, you need to use the Python module pickle to serialize the object, and the serialized object can save the object on disk and read it when needed.

To run the test:

7. Example: Predicting contact lens types using decision Trees

Summarize

Decision trees-predicting contact lens types (ID3 algorithm, C4.5 algorithm, cart algorithm, Gini index, pruning, random forest)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.