C4.5 Introduction
C4.5 is a series of algorithms used in machine learning and data mining classification problems. Its goal is to supervise learning: Given a dataset, each tuple can be described with a set of attribute values, each of which belongs to a class in a mutually exclusive category. The goal of C4.5 is to find a mapping relationship from attribute values to categories by learning, and this mapping can be used to classify entities that are unknown to the new category.
Because ID3 algorithm has some problems in practical application, so Quinlan puts forward C4.5 algorithm, strictly speaking C4.5 can only be an improved algorithm of ID3.
The C4.5 algorithm inherits the advantages of the ID3 algorithm, and improves the ID3 algorithm in the following ways:
1) Using the information gain rate to select the attribute, overcomes the disadvantage of choosing the attribute with the information gain to choose the value;
2) pruning in the process of tree construction;
3) be able to complete the discretization of the continuous properties of the processing;
4) Ability to process incomplete data.
The C4.5 algorithm has the following advantages: The resulting classification rules are easy to understand and the accuracy rate is high. The disadvantage is that in the process of constructing the tree, the data sets need to be scanned and sorted several times, which results in the inefficiency of the algorithm. In addition, the C4.5 is only suitable for data sets that can reside in memory, and the program cannot run when the training set is too large to fit in memory.
C4.5 classifier
We illustrate how the C4.5 algorithm calculates the information gain and chooses the decision node using a typical example of a training dataset D, which has been quoted many times.
Four of these properties determine whether the activity is active or canceled. The above training set has 4 attributes, namely the attribute collection A={outlook, temperature, humidity, windy}, and the class label has 2, namely the class tag set C={yes, no}, respectively, is suitable for outdoor sports and not suitable for outdoor sports, is actually a two classification problem.
Advantages and disadvantages of C4.5 and algorithm flow
The advantage of C4.5 algorithm is that the classification rules are easy to understand and the accuracy rate is high.
The disadvantage of the C4.5 algorithm is that in the process of constructing the tree, the data sets need to be scanned and sorted several times, which results in the inefficiency of the algorithm.
C4.5 Algorithm Flow:
Demo sample
Algorithm test:
Https://github.com/zongtui/zongtui-Algorithm-test
Come with me. Data Mining (--c4.5)