Entropy and information gain in decision tree algorithm based on dry-algorithm

Source: Internet
Author: User

What is a decision tree? Why use a decision tree?

   

A decision tree is a binary tree, or multiple fractions. A great deal of effort is being done to subdivide large amounts of data. In daily life, the algorithms of decision trees are used every day. Small to user classification, large to auxiliary decision. He's actually used a lot.

As for why a decision tree is used, the individual thinks it is simply because of this algorithm. Code implementation is mainly If-else can be achieved. But the development of this algorithm is also from ID3--->c4.5----->c5.0.

Its main steps are two: 1, build 2, cut a tree

How to make a contribution, is how to divide your data, according to what kind of characteristics? Demographic data, for example, are based on age, height, weight, education ... And how to choose the inside an indicator, this is the foundation of achievement. Comparing the advantages and disadvantages of these indicators is more important, and this is the entropy and information gain we will discuss in this paper.


Entropy

 

Nickname is called information gain, or demographic data, you want to analyze the target population is what also determines the size of the indicators. For example, if you analyze demographic data to refer to the target group of people who want to buy your diet pills, that weight is especially important in all indicators. This is what we can know on the forehead, and when there are too many statistical indicators, the concept of entropy is used to determine which indicator is important. It is equivalent to quantifying these indicators into numerical values that we can compare.

In modern society, the amount of data in our hands is very large. For example, some state-owned institutions, in which the data fields (that is, the above-mentioned indicators) are mostly over 20 columns. And how to make sure that this field is what we need is to be linked to our analytical goals. And the data of each column is divided into categorical variables and continuous variables, and the decision tree plays a role in the classification variables, but if the continuous variable or all discrete variables are converted, it can also be treated by categorical variables.

Well, to understand the concept of entropy, you can see what variables to choose for each sub-point. This step completes the process of achievement.


Cut the tree

  

When a tree grows particularly lush, we begin to prune, and the gardener's claim is pruning, and in our decision tree algorithm it's called a tree cut.

The process of cutting trees is not complicated, that is, you use entropy to divide the final data to a unified requirements, such as you want to two-tree or multi-branch tree? How many levels do you want? How much data is needed for the smallest fork? This is all a factor to consider. Cut the tree to the final stage and start interacting with the business unit. To verify your algorithm, that is, to return to the above example, you find the users whether when you contact each other, the other side will buy your product.

Then according to the business section of the information, you can do forward processing (mail products) or reverse processing (correct product features for precise marketing).


Summarize

  

When the above operation, and then the original data according to the principle of 4:3:3 allocation, and then set into the corresponding model to do classification verification, I believe in the process of data analysis and constantly improve their own models and algorithms. is bound to achieve our goal!

The above is their own decision tree algorithm in the key part of the understanding, there are shortcomings, but also hope that we can make a lot of mistakes, learn from each other, common progress!

This article is from the data Mining and visualization blog, so be sure to keep this source http://bingyang.blog.51cto.com/533655/1859824

Entropy and information gain in decision tree algorithm for dry-------algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.