Decision Tree Classification

Last Update:2018-12-05 Source: Internet

Author: User

Tags knowledge base

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A decision tree, also known as a decision tree, is a tree structure used for classification. Each internal node represents a test of an attribute, each edge represents a test result, and the leaf node (leaf) represents a class) or class distribution. The top node is the root node. Decision Trees are classified into two types: Classification Tree and regression tree. The classification tree makes decision trees for discrete variables, and the regression tree makes decision trees for continuous variables.

A top-down recursive construction method is used to construct a decision tree. The result of the decision tree structure is a binary or multi-Cross Tree. Its input is a set of training data with class tags. The internal node (non-leaf node) of a binary tree is generally expressed as a logical judgment, for example, a is a logical judgment in the form of (A = B), where A is an attribute, B is a property value of this attribute, and the edge of the tree is the branch result of logical judgment. The internal node of the Multi-Cross Tree (ID3) is an attribute, and the edge is all the values of this attribute. If there are several attribute values, there are several edges. Tree leaf nodes are all class labels.

Using Decision Trees for classification is divided into two steps:

Step 1: use the training set to establish and refine a decision tree and establish a decision tree model. This process is actually a process of acquiring knowledge from data for machine learning.

Step 1: Use the generated decision tree to classify the input data. For input records, test the attribute values of records from the root node until a leaf node is reached to locate the class of the record.

The key to the problem is to create a decision tree. This process is generally divided into two phases:

(1) Tree Building: see the decision tree building algorithm. We can see that this is a recursive process and a tree will be generated.

(2) pruning (Tree Pruning): pruning aims to reduce the ups and downs caused by noise in the training set.

The evaluation of the decision tree method.

Advantages

Compared with other classification algorithms, decision trees have the following advantages:

(1) fast: computation is relatively small and easy to convert to classification rules. As long as the root of the tree goes down to the leaf, the split condition along the way can uniquely identify a classification predicate.

(2) High Accuracy: the mined classification rules are highly accurate and easy to understand. The decision tree can clearly show which fields are important.

Disadvantages

Disadvantages of decision trees:

(1) lack of scalability: due to deep Priority Search, algorithms are limited by the memory size, making it difficult to process large training sets. An example: In the Irvine machine learning knowledge base, the maximum allowed data set is 2000 kb and records. Modern data warehouses often store several GB-bytes of massive data. It is obviously not feasible to use the previous method.

(2) to process large datasets or continuous improvement algorithms (discretization and sampling) not only increase the overhead of classification algorithms, but also reduce the accuracy of classification, it is difficult to predict continuous fields. When there are too many categories, errors may increase rapidly. preprocessing is required for time-ordered data.

However, the decision tree algorithm used based on classification mining does not take noise into account, and the generated decision tree is perfect. This is only theoretical. In practical application, A large amount of real-world data is not determined by the will, and some fields may be missing values (missing values); data may not contain noise or errors accurately; it may be that the data is incomplete due to the lack of necessary data.

In addition, the decision tree technology also has some shortcomings. For example, when there are many categories, there may be even many errors. Moreover, it is difficult to make accurate predictions on the continuous fields. Generally, algorithms are classified based on an attribute.

In the case of noise, full fitting will lead to overfitting, that is, full fitting of training data does not have good prediction performance. Pruning is a technique to overcome noise. It also simplifies the tree and makes it easier to understand. In addition, the decision tree technology may also cause sub-tree replication and fragmentation issues.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More