Come with me. Data Mining (--c4.5)

Source: Internet
Author: User
Tags id3

C4.5 Introduction

C4.5 is a series of algorithms used in machine learning and data mining classification problems. Its goal is to supervise learning: Given a dataset, each tuple can be described with a set of attribute values, each of which belongs to a class in a mutually exclusive category. The goal of C4.5 is to find a mapping relationship from attribute values to categories by learning, and this mapping can be used to classify entities that are unknown to the new category.

Because ID3 algorithm has some problems in practical application, so Quinlan puts forward C4.5 algorithm, strictly speaking C4.5 can only be an improved algorithm of ID3.

The C4.5 algorithm inherits the advantages of the ID3 algorithm, and improves the ID3 algorithm in the following ways:

1) Using the information gain rate to select the attribute, overcomes the disadvantage of choosing the attribute with the information gain to choose the value;

2) pruning in the process of tree construction;

3) be able to complete the discretization of the continuous properties of the processing;

4) Ability to process incomplete data.

The C4.5 algorithm has the following advantages: The resulting classification rules are easy to understand and the accuracy rate is high. The disadvantage is that in the process of constructing the tree, the data sets need to be scanned and sorted several times, which results in the inefficiency of the algorithm. In addition, the C4.5 is only suitable for data sets that can reside in memory, and the program cannot run when the training set is too large to fit in memory.

C4.5 classifier

We illustrate how the C4.5 algorithm calculates the information gain and chooses the decision node using a typical example of a training dataset D, which has been quoted many times.

Four of these properties determine whether the activity is active or canceled. The above training set has 4 attributes, namely the attribute collection A={outlook, temperature, humidity, windy}, and the class label has 2, namely the class tag set C={yes, no}, respectively, is suitable for outdoor sports and not suitable for outdoor sports, is actually a two classification problem.

Advantages and disadvantages of C4.5 and algorithm flow

The advantage of C4.5 algorithm is that the classification rules are easy to understand and the accuracy rate is high.

The disadvantage of the C4.5 algorithm is that in the process of constructing the tree, the data sets need to be scanned and sorted several times, which results in the inefficiency of the algorithm.

C4.5 Algorithm Flow:

Demo sample

Algorithm test:

Https://github.com/zongtui/zongtui-Algorithm-test

Come with me. Data Mining (--c4.5)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.