Decision Tree Classification

Last Update:2014-07-16 Source: Internet

Author: User

Tags benchmark

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A decision tree, also known as a decision tree, is a tree structure used for classification. Each internal node represents a benchmark test on a certain attribute, each edge represents a benchmark test result, and the leaf node (leaf) represents a class) or class distribution. The top node is the root node. Decision Trees are classified into two types: Classification Tree and regression tree. The classification tree makes decision trees for discrete variables, and the regression tree makes decision trees for continuous variables.

Constructing a decision tree is a top-down recursive Construction Method of the aggregate. The result of the decision tree structure is a binary or multi-Cross Tree. Its input is a set of training data with class tags. The internal node (non-leaf node) of a binary tree is generally expressed as a logical inference. For example, in the form of (a = B) logical inference, where A is an attribute, B is a property value of this attribute, and the edge of the tree is the branch result of logical inference. The internal node of the Multi-Cross Tree (ID3) is an attribute, and the edge is the full value of this attribute. There are several attribute values, and there are several sides. Tree leaf nodes are all class labels.

Using Decision Trees for classification is divided into two steps:

Step 1: use the training set to establish and refine a decision tree and establish a decision tree model. This process is actually a process of acquiring knowledge from data for machine learning.

Step 1: Use the generated decision tree to classify the input data. For the input records, record the attribute values from the root node in sequence until a leaf node is reached to locate the class of the record.

The key to the problem is to create a decision tree. This process is generally divided into two phases:

(1) Tree Building: see the decision tree building algorithm. We can see that this is a recursive process and we will finally get a tree.

(2) pruning (Tree Pruning): pruning aims to reduce the ups and downs caused by noise in the training set.

The evaluation of the decision tree method.

Strengths

Compared with other classification algorithms, decision trees have the following advantages:

(1) fast: computation is relatively small and easy to convert to classification rules. Only when the root of the tree is walking down to the leaf, the split condition along the way can uniquely identify a classification predicate.

(2) High Accuracy: the mined classification rules are highly accurate and easy to understand. The decision tree can clearly show which fields are more important than the limit.

Disadvantages

Disadvantages of decision trees:

(1) lack of scalability: Because deep-priority search is performed, the size of the algorithm is limited by the memory and it is difficult to process large training sets. Example: In the Irvine machine learning knowledge base, the maximum allowed dataset is 2000 kb and records. Modern data warehouses often store several GB-bytes of massive data. It is obviously not feasible to use the previous method.

(2) In order to process large datasets or continuous improvement algorithms (discretization and sampling), the additional overhead of classification algorithms is added and the accuracy of classification is reduced, it is more difficult to pre-prepare a continuous field than a sequence. When there are too many categories, errors may be added faster than a sequence. for time-ordered data, a lot of preprocessing is required.

However, the decision tree algorithm used based on classification mining does not take noise into account, and the generated decision tree is perfect. This is only theoretical. In practical application, A large amount of real-world data is not determined by the will, and some fields may be missing values (missing values); data may not contain noise or errors accurately; it may be that the data is incomplete due to the lack of necessary data.

In addition, the decision tree technology also has some shortcomings. For example, when there are many categories, there may be many or even many errors. In addition, it is more difficult to make accurate preferences for the continuous fields than the limit. Generally, an algorithm is classified based on an attribute.

In the case of noise, full fitting will lead to overfitting, that is, full fitting of training data does not have very good pre-tuning performance. Pruning is a technique to overcome noise. At the same time, it can simplify the tree and make it easier to understand. In addition, the decision tree technology may also cause sub-tree replication and fragmentation issues.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More