"Scope of Application"
To deal with the classification problem, the C4.5 algorithm can be used as long as the inter-class boundary of the target problem can be determined by means of tree decomposition or rule discriminant.
Properties
Supervised learning
"Basic Ideas"
Given a dataset, all instances are described by a set of attributes, each of which belongs to only one category, and the C4.5 algorithm runs on a given dataset to learn a mapping from attribute values to categories that can then be used to classify new unknown instances
"Algorithm principle"
Input:an attribute-valued DataSet D
1:tree = {}
2:If D is ' pure ' or other stopping criteria met then
3:terminate
4:End If
5: For allattribute a∈d do
6:compute information-theoretic If we split on a
7:End for
8:aBest = best attribute according to above computed criteria
9:tree = Create A decision node, tests a best in the root
10:DV = induced sub-datasets from D based on a best
One: for all Dv do
12:treev = C4.5 (Dv)
13:attach Treev to the corresponding branch of the Tree
+:End for
:return Tree
"Algorithm Elaboration"
A given dataset is represented by the root node, and a specific property is tested on each node starting at the root node, dividing the node datasets into smaller subsets and representing them in a subtree. The process is performed until the subset becomes "pure", that is, all instances in the sub-set belong to the same category, and the tree stops growing.
"Algorithm Essentials"
1. Information Theory Guidelines
The C4.5 algorithm uses information theory criteria such as gain (Gain), Gain rate (Gain Ratio) to select the appropriate attributes to divide the subtree. The gain is used to calculate the entropy reduction of the class distribution, the greater the gain, the better the classification effect of this attribute is, the disadvantage is that it is too biased to select more properties of the output result. The gain rate has the advantage of overcoming this deviation, so the C4.5 algorithm's default information theory criterion is the gain rate .
Gainratio (a) = Gain (a)/entropy (a)
Among them, Gain (a) =entropy (Category in D)-∑| Dv|/d*entropy (Category in Dv)
Entropy =-∑P*LOG2 (P)
D is the entire data set, DV is a subset of D, the instance on the DV property value is the same, category different
The entropy (a) of property a depends only on the probability distribution of the value, regardless of the category.
The gain (a) of attribute A is related to the category.
"Code Implementation"
http://www.rulequest.com/Personal/
The content of the article refers to the Tsinghua University Press, "The ten algorithms of data Mining," organized, hereby declare
This article is from "Lucas" blog, please be sure to keep this source http://4292565.blog.51cto.com/4282565/1672788
C4.5 algorithm