Reprint: ID3 algorithm

Source: Internet
Author: User
Tags id3

ID3 algorithm

The ID3 algorithm is J. Ross Quinlan The Classification prediction algorithm presented in 1975. The core of the algorithm is "information entropy".

information entropy is a measure of the information contained in a set of data and probabilities. A group of data more orderly information entropy is also lower, extreme if only one group of data is not 0, the other is 0, then the entropy equals 0, because it is only possible that the non-0 of the situation occurs, it gives people the information has been determined, or does not contain any information, because the information entropy content of 0. The more chaotic a set of data is, the higher the entropy is, the greater the entropy if a group of data is evenly distributed at extremes, because we don't know the probability of that happening . Fake such as the Group data consists of {d1,d2,..., dn}, and is sum, then the formula for information entropy is

The classification prediction algorithm belongs to the instruction learning, which is based on the training data, according to the reference attribute of the dependency degree of the target attribute to the reference attribute level processing, this level of processing is embodied in the creation decision tree, the purpose is to generate the discriminant tree, generate rules, to judge the future data. Take the following data as an example:


a total of 14 records, the target property is, whether to buy a computer, a total of two cases, yes or No. There are 4 cases of reference attributes, namely, age,income,student,credit_rating. Attribute age has 3 kinds of values, namely Youth,middle_aged,senior. The attribute income has 3 kinds of values, namely, High,medium,low. The attribute student has 2 kinds of values, namely, No,yes. The attribute credit_rating has 2 kinds of values, namely fair,excellent. We first seek the information entropy of the reference attribute:

5 in the formula indicates that 5 no,9 means that 9 yes,14 is the total number of records. Next we ask each reference attribute to take their respective values corresponding to the information entropy of the target attribute, take the attribute of age as an example, there are 3 kinds of value cases, respectively, Youth,middle_aged,senior, first consider Youth,youth total occurrence 5 times, 3 no,2 times Yes, So information entropy:


Similar to the information entropy obtained middle_aged and senior, respectively: 0 and 0.971. The information entropy for the entire attribute age should be their weighted average value:

The following introduces the concept of information gain (information gain), denoted by gain (D), that the concept refers to the effective reduction of information entropy, the higher the amount, the more information entropy that the target attribute loses in the reference attribute, the more the attribute should be in the upper layer of the decision tree

Similar can be obtained Gain (income) =0.029,gain (stduent) =0.151,gain (credit_rating) = 0.048. The maximum value is gain (age), so first, according to the reference attribute age, divide the data into 3 categories, as follows:

Then the classification is recursive according to the above method. The conditions for recursive termination are:

1, when divided into a class, the target property is a value, such as here when the age of middle_aged, the target attribute is all yes.

2, when divided into a certain class, the proportion of a certain value reached a given threshold, such as here when the age to take youth, there are 60% is no, of course, the actual threshold is far greater than 60%.

The ID3 algorithm has many variants, but the basic idea is the same. But it is likely to need to traverse the database multiple times, inefficient, otherwise naïve Bayesian classification.

Reprint: ID3 algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.