C4.5 algorithm

Last Update:2015-07-10 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

"Scope of Application"

To deal with the classification problem, the C4.5 algorithm can be used as long as the inter-class boundary of the target problem can be determined by means of tree decomposition or rule discriminant.

Properties

Supervised learning

"Basic Ideas"

Given a dataset, all instances are described by a set of attributes, each of which belongs to only one category, and the C4.5 algorithm runs on a given dataset to learn a mapping from attribute values to categories that can then be used to classify new unknown instances

"Algorithm principle"

Input:an attribute-valued DataSet D

1:tree = {}

2:If D is ' pure ' or other stopping criteria met then

3:terminate

4:End If

5: For allattribute a∈d do

6:compute information-theoretic If we split on a

7:End for

8:aBest = best attribute according to above computed criteria

9:tree = Create A decision node, tests a best in the root

10:DV = induced sub-datasets from D based on a best

One: for all Dv do

12:treev = C4.5 (Dv)

13:attach Treev to the corresponding branch of the Tree

+:End for

:return Tree

"Algorithm Elaboration"

A given dataset is represented by the root node, and a specific property is tested on each node starting at the root node, dividing the node datasets into smaller subsets and representing them in a subtree. The process is performed until the subset becomes "pure", that is, all instances in the sub-set belong to the same category, and the tree stops growing.

"Algorithm Essentials"

1. Information Theory Guidelines

The C4.5 algorithm uses information theory criteria such as gain (Gain), Gain rate (Gain Ratio) to select the appropriate attributes to divide the subtree. The gain is used to calculate the entropy reduction of the class distribution, the greater the gain, the better the classification effect of this attribute is, the disadvantage is that it is too biased to select more properties of the output result. The gain rate has the advantage of overcoming this deviation, so the C4.5 algorithm's default information theory criterion is the gain rate .

Gainratio (a) = Gain (a)/entropy (a)

Among them, Gain (a) =entropy (Category in D)-∑| Dv|/d*entropy (Category in Dv)

Entropy =-∑P*LOG2 (P)

D is the entire data set, DV is a subset of D, the instance on the DV property value is the same, category different

The entropy (a) of property a depends only on the probability distribution of the value, regardless of the category.

The gain (a) of attribute A is related to the category.

"Code Implementation"

http://www.rulequest.com/Personal/

The content of the article refers to the Tsinghua University Press, "The ten algorithms of data Mining," organized, hereby declare

This article is from "Lucas" blog, please be sure to keep this source http://4292565.blog.51cto.com/4282565/1672788

C4.5 algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

C4.5 algorithm

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support