C4.5 algorithm

Source: Internet
Author: User

"Scope of Application"

To deal with the classification problem, the C4.5 algorithm can be used as long as the inter-class boundary of the target problem can be determined by means of tree decomposition or rule discriminant.

Properties

Supervised learning

"Basic Ideas"

Given a dataset, all instances are described by a set of attributes, each of which belongs to only one category, and the C4.5 algorithm runs on a given dataset to learn a mapping from attribute values to categories that can then be used to classify new unknown instances

"Algorithm principle"

Input:an attribute-valued DataSet D

1:tree = {}

2:If D is ' pure ' or other stopping criteria met then

3:terminate

4:End If

5: For allattribute a∈d do

6:compute information-theoretic If we split on a

7:End for

8:aBest = best attribute according to above computed criteria

9:tree = Create A decision node, tests a best in the root

10:DV = induced sub-datasets from D based on a best

One: for all Dv do

12:treev = C4.5 (Dv)

13:attach Treev to the corresponding branch of the Tree

+:End for

:return Tree

"Algorithm Elaboration"

A given dataset is represented by the root node, and a specific property is tested on each node starting at the root node, dividing the node datasets into smaller subsets and representing them in a subtree. The process is performed until the subset becomes "pure", that is, all instances in the sub-set belong to the same category, and the tree stops growing.

"Algorithm Essentials"

1. Information Theory Guidelines

The C4.5 algorithm uses information theory criteria such as gain (Gain), Gain rate (Gain Ratio) to select the appropriate attributes to divide the subtree. The gain is used to calculate the entropy reduction of the class distribution, the greater the gain, the better the classification effect of this attribute is, the disadvantage is that it is too biased to select more properties of the output result. The gain rate has the advantage of overcoming this deviation, so the C4.5 algorithm's default information theory criterion is the gain rate .

Gainratio (a) = Gain (a)/entropy (a)

Among them, Gain (a) =entropy (Category in D)-∑| Dv|/d*entropy (Category in Dv)

Entropy =-∑P*LOG2 (P)

D is the entire data set, DV is a subset of D, the instance on the DV property value is the same, category different

The entropy (a) of property a depends only on the probability distribution of the value, regardless of the category.

The gain (a) of attribute A is related to the category.

"Code Implementation"

http://www.rulequest.com/Personal/

The content of the article refers to the Tsinghua University Press, "The ten algorithms of data Mining," organized, hereby declare

This article is from "Lucas" blog, please be sure to keep this source http://4292565.blog.51cto.com/4282565/1672788

C4.5 algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.