Data Mining Series (6) Decision tree Classification algorithm

Last Update:2017-02-27 Source: Internet

Author: User

Tags list of attributes split

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

From the beginning, I will introduce the classification problem, mainly introduce decision tree algorithm, naive Bayesian, support vector machine, BP neural network, lazy learning algorithm, stochastic forest and adaptive Enhancement Algorithm, classification model selection and result evaluation. A total of 7, welcome attention and Exchange.

This article first introduces some basic knowledge of classification problem, then mainly narrates the principle and realization of decision tree algorithm, and finally makes use of decision tree algorithm to make a survival prediction application of Titanic crew.

First, the classification of the basic introduction

Like birds of a feather flock together, the problem of classification is only in our lives since ancient times. Classification is an important branch of data mining, which is widely used in many aspects, such as medical disease discrimination, spam filtering, spam interception, customer analysis and so on. Classification problems can be divided into two categories:

Collation: Classification refers to discrete data categories, such as the basis of a person's handwriting to determine whether this is a male or female, there are only two categories, the category is a discrete set of space {male, female}. Prediction: Forecasting refers to the classification of continuous data, such as predicting the humidity of 8 weather tomorrow, the humidity of the weather is changing at any time, the weather at 8 o'clock is a specific value, it does not belong to a finite set of space. Forecasting is also called regression analysis, which is widely used in the field of finance.

Although the processing of discrete data and continuous data are different, but in fact, they transform each other, for example, we can be judged by a comparison of a characteristic value, if the value is greater than 0.5 to be identified as a male, less than or equal to 0.5 is considered a female, this translates into a continuous processing And the weather humidity value is segmented into discrete data.

Data classification is divided into two steps:

Constructs the model, uses the training data set to train the classifier, uses the constructed classifier model to classify the test data.

Good classifier has a good generalization ability, that is, it can not only achieve high accuracy in the training data set, but also can achieve high accuracy in the test data set. If a classifier is just a good performance on the training data, but in the test data performance, the classifier has been fitted, it just put the training data down, and did not catch the entire data space characteristics.

Ii. Classification of decision Trees

The decision tree algorithm is based on the branch structure of tree to realize classification. The following figure is an example of a decision tree in which the internal nodes represent the judgment of a property, and the branch of the node is the corresponding judgment result; the leaf node represents a class mark.

The table above is a prediction of whether a person will buy a decision tree to buy a computer, with this tree, we can sort out the new records, starting with the root node (age), if someone's age is middle-aged, we're going to have a direct sense that the person will buy a computer, and if it's a teenager, it needs to be further determined if it's a student. If it is old age, it is necessary to further determine its credit rating until the leaf node can determine the category of record.

The decision tree algorithm has the advantage that it can produce the rule which the person can understand directly, this is the Bayesian, the neural network algorithm does not have the characteristic, the decision tree's accuracy rate is also relatively high, moreover does not need to understand the background knowledge to be possible to classify, is a very effective algorithm. The decision tree algorithm has many variants, including ID3, C4.5, C5.0, cart, etc., but its basis is similar. Let's look at the basic idea of the decision Tree algorithm:

Algorithm: Generatedecisiontree (d,attributelist) generates a decision tree based on the training data record D. Input: Data record D, the training data set containing the class mark; Attribute list AttributeList, a set of candidate attributes, used to make judgments in an internal node. The attribute Selection method Attributeselectionmethod () to select the best classification attribute. Output: a decision tree. Procedure: Constructs a node n; if all records in data record d have the same class label (in Category C): The node n is marked as c as a leaf node, and the node n is returned, and if the list of attributes is empty: The nodes n is labeled as the class with the most class in D as the leaf nodes, and returns the knot N; Call Attributeselectionmethod (d,attributelist) to select the best split criterion splitcriterion; The node n is labeled as the best splitting criterion splitcriterion; If the splitting attribute value is discrete, and allows the decision tree to fork split: Subtract the splitting attribute from the attribute list, attributelsit-= Splitattribute; For each value of the splitting attribute J: The record set of J is met by the DJ; If the DJ is empty: Create a new leaf node F, mark the class with the most class in D, and hang the node F under N; Otherwise: the recursive call Generatedecisiontree (dj,attributelist) Gets the subtree node nj, and the NJ hangs under N; Returns the node n;

The 1, 2, 3 steps of the algorithm are obvious, and the best attribute selection function for step 4th will be described later. Now only know that it can find a criterion, so that the classification of the subtree according to the judgment node as pure as possible, here the pure is only a class mark; The 5th step is to set the test expression of node n according to the splitting criterion. In the 6th step, when constructing a multiple-fork decision tree, the discrete attributes are used only once in the node n and its subtree, and then deleted from the list of available attributes. For example, in the previous diagram, the attribute selection function is used to determine the best splitting attribute of the age, the age has three values, each value corresponds to a branch, the later no longer use the attribute of age. The time complexity of the algorithm is O (k*| D|*log (| d|)), K is the number of attributes, | D| is the number of records in Recordset D.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More