Collective intelligent programming-Decision Tree Modeling (I) and collective Intelligent Modeling

Source: Internet
Author: User

Collective intelligent programming-Decision Tree Modeling (I) and collective Intelligent Modeling

This article mainly introduces a very popular classification algorithm: decision tree classification algorithm.

Example: the number of users on a website that are willing to pay for certain advanced membership features.

Step 1: import data;

Create a new treepredict. py price and write the following data. The specific data information is: Source website, location, read FAQ, page views, and service type.

my_data=[['slashdot','USA','yes',18,'None'],        ['google','France','yes',23,'Premium'],        ['digg','USA','yes',24,'Basic'],        ['kiwitobes','France','yes',23,'Basic'],        ['google','UK','no',21,'Premium'],        ['(direct)','New Zealand','no',12,'None'],        ['(direct)','UK','no',21,'Basic'],        ['google','USA','no',24,'Premium'],        ['slashdot','France','yes',19,'None'],        ['digg','USA','no',18,'None'],        ['google','UK','no',18,'None'],        ['kiwitobes','UK','no',19,'None'],        ['digg','New Zealand','yes',12,'Basic'],        ['slashdot','UK','no',21,'None'],        ['google','UK','yes',18,'Basic'],        ['kiwitobes','France','yes',19,'Basic']]

II. Introduction of decision tree:

Decision tree is a simple machine learning method. It is a very intuitive way to classify the data of the observed object. After training, the decision tree looks like an if-then statement in the shape of a leaf.

Once we have a decision tree, the decision-making process will become very simple. We only need to keep down the path of the tree until the leaf node is ready.

Create a new class named decisionnode, which represents every node in the tree:

class decisionnode:        def __init__(self,col=-1,value=None,results=None,tb=None,fb=None):            self.col=col            self.value=value            self.results=results            self.tb=tb            self.fb=fb


Each node has five instance variables. These five instance variables are set during initialization.

1. col is the column index value corresponding to the judgment condition to be tested.

2. value corresponds to the value that must be matched in the current column in order to make the result true

3. Tb and fb are also decisionnodes. They correspond to the nodes in the tree relative to the Child tree of the current node when the result is true or false.

4. results stores the results of the current branch, which is a dictionary. Except for leaf nodes, this value is None on other nodes.

Iii. Training on logarithm

We use an algorithm called CART (classification regression tree. To construct a decision tree, the algorithm first creates a root node. Then, all the observed variables in the evaluation table are selected to split the data.

The divideset function splits the List into two datasets based on the data in a column in the list. This function accepts a list, a number indicating the position of the column in the table, and a reference value used to split the column as a parameter. The algorithm returns two lists: the first is the data that meets the conditions, and the second is not.

def divideset(rows,column,value):        split_function=None        if isinstance(value,int) or isinstance(value,float):                split_function=lambda row:row[column]>=value        else:                split_function=lambda row:row[column]==value        set1=[row for row in rows if split_function(row)]        set2=[row for row in rows if not split_function(row)]        return (set1,set2)

Pay attention to the role of lambda expressions in the above Code, do not understand can be viewed: http://www.cnblogs.com/itdyb/p/5014052.html

 

>>> import treepredict>>> treepredict.divideset(treepredict.my_data,2,'yes')([['slashdot', 'USA', 'yes', 18, 'None'], ['google', 'France', 'yes', 23, 'Premium'], ['digg', 'USA', 'yes', 24, 'Basic'], ['kiwitobes', 'France', 'yes', 23, 'Basic'], ['slashdot', 'France', 'yes', 19, 'None'], ['digg', 'New Zealand', 'yes', 12, 'Basic'], ['google', 'UK', 'yes', 18, 'Basic'], ['kiwitobes', 'France', 'yes', 19, 'Basic']], [['google', 'UK', 'no', 21, 'Premium'], ['(direct)', 'New Zealand', 'no', 12, 'None'], ['(direct)', 'UK', 'no', 21, 'Basic'], ['google', 'USA', 'no', 24, 'Premium'], ['digg', 'USA', 'no', 18, 'None'], ['google', 'UK', 'no', 18, 'None'], ['kiwitobes', 'UK', 'no', 19, 'None'], ['slashdot', 'UK', 'no', 21, 'None']])

4. select the most suitable Splitting Scheme
First, we need to count the results of each item in the dataset. The Code is as follows:

def uniquecounts(rows):        results={}        for row in rows:                r=row[len(row)-1]  #last row  result                #print r                if r not in results:results[r]=0                results[r]+=1        return results

The function above is used to find different possible results and return a dictionary containing the number of occurrences of each result.

Next we will examine the non-purity and entropy of Gini.

If you do not know about Gini purity and entropy, see Data Mining-Concepts and technology or Baidu (hey, the formula is troublesome, so the younger brother will not write it)

1. Gini purity: The expected error rate in which certain results from the set are randomly applied to a data item in the set. The calculation function is as follows:

def giniimpurity(rows):        total=len(rows)        counts=uniquecounts(rows)        #print tatal,counts        imp=0        for k1 in counts:                p1=float(counts[k1])/total                #print p1                for k2 in counts:                        if k1==k2:continue                        p2=float(counts[k2])/total                        #print p2                        imp+=p1*p2                        #print imp        return imp

This function calculates the probability by dividing the number of occurrences of each item in the set by the total number of rows in the set, and then adds up the multiplication accumulation of these probability values. In this way, the total probability that a row of data is randomly allocated to the error result is obtained. The smaller the value, the better.

2. entropy: indicates the unordered degree of the Set, which is basically equivalent to the degree of mixing of the set. The function is as follows:

def entropy(rows):        from math import log        log2=lambda x:log(x)/log(2)        results=uniquecounts(rows)        ent=0.0        for r in results.keys():                p=float(results[r])/len(rows)                ent=ent-p*log2(p)        return ent

The formula for calculating entropy is H (x) = E [I (xi)] = E [log2 1/p (xi)] =-ε p (xi) log2 p (xi) (I = 1, 2 ,.. n)

>>> treepredict.giniimpurity(treepredict.my_data)0.6328125>>> treepredict.entropy(treepredict.my_data)1.5052408149441479

5. Build a tree in recursive Mode

Information Gain: the difference between the current entropy and the entropy after the weighted average of the two new groups.

def buildtree(rows,scoref=entropy):  if len(rows)==0: return decisionnode()  current_score=scoref(rows)  # Set up some variables to track the best criteria  best_gain=0.0  best_criteria=None  best_sets=None    column_count=len(rows[0])-1  for col in range(0,column_count):    # Generate the list of different values in    # this column    column_values={}    for row in rows:       column_values[row[col]]=1    # Now try dividing the rows up for each value    # in this column    for value in column_values.keys():      (set1,set2)=divideset(rows,col,value)            # Information gain      p=float(len(set1))/len(rows)      gain=current_score-p*scoref(set1)-(1-p)*scoref(set2)      if gain>best_gain and len(set1)>0 and len(set2)>0:        best_gain=gain        best_criteria=(col,value)        best_sets=(set1,set2)  # Create the sub branches     if best_gain>0:    trueBranch=buildtree(best_sets[0])    falseBranch=buildtree(best_sets[1])    return decisionnode(col=best_criteria[0],value=best_criteria[1],                        tb=trueBranch,fb=falseBranch)  else:    return decisionnode(results=uniquecounts(rows))

For the display, prediction, and pruning of decision trees.

See collective Smart Programming

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.