Collective intelligent programming-Decision Tree Modeling (I) and collective Intelligent Modeling
This article mainly introduces a very popular classification algorithm: decision tree classification algorithm.
Example: the number of users on a website that are willing to pay for certain advanced membership features.
Step 1: import data;
Create a new treepredict. py price and write the following data. The specific data information is: Source website, location, read FAQ, page views, and service type.
my_data=[['slashdot','USA','yes',18,'None'], ['google','France','yes',23,'Premium'], ['digg','USA','yes',24,'Basic'], ['kiwitobes','France','yes',23,'Basic'], ['google','UK','no',21,'Premium'], ['(direct)','New Zealand','no',12,'None'], ['(direct)','UK','no',21,'Basic'], ['google','USA','no',24,'Premium'], ['slashdot','France','yes',19,'None'], ['digg','USA','no',18,'None'], ['google','UK','no',18,'None'], ['kiwitobes','UK','no',19,'None'], ['digg','New Zealand','yes',12,'Basic'], ['slashdot','UK','no',21,'None'], ['google','UK','yes',18,'Basic'], ['kiwitobes','France','yes',19,'Basic']]
II. Introduction of decision tree:
Decision tree is a simple machine learning method. It is a very intuitive way to classify the data of the observed object. After training, the decision tree looks like an if-then statement in the shape of a leaf.
Once we have a decision tree, the decision-making process will become very simple. We only need to keep down the path of the tree until the leaf node is ready.
Create a new class named decisionnode, which represents every node in the tree:
class decisionnode: def __init__(self,col=-1,value=None,results=None,tb=None,fb=None): self.col=col self.value=value self.results=results self.tb=tb self.fb=fb
Each node has five instance variables. These five instance variables are set during initialization.
1. col is the column index value corresponding to the judgment condition to be tested.
2. value corresponds to the value that must be matched in the current column in order to make the result true
3. Tb and fb are also decisionnodes. They correspond to the nodes in the tree relative to the Child tree of the current node when the result is true or false.
4. results stores the results of the current branch, which is a dictionary. Except for leaf nodes, this value is None on other nodes.
Iii. Training on logarithm
We use an algorithm called CART (classification regression tree. To construct a decision tree, the algorithm first creates a root node. Then, all the observed variables in the evaluation table are selected to split the data.
The divideset function splits the List into two datasets based on the data in a column in the list. This function accepts a list, a number indicating the position of the column in the table, and a reference value used to split the column as a parameter. The algorithm returns two lists: the first is the data that meets the conditions, and the second is not.
def divideset(rows,column,value): split_function=None if isinstance(value,int) or isinstance(value,float): split_function=lambda row:row[column]>=value else: split_function=lambda row:row[column]==value set1=[row for row in rows if split_function(row)] set2=[row for row in rows if not split_function(row)] return (set1,set2)
Pay attention to the role of lambda expressions in the above Code, do not understand can be viewed: http://www.cnblogs.com/itdyb/p/5014052.html
>>> import treepredict>>> treepredict.divideset(treepredict.my_data,2,'yes')([['slashdot', 'USA', 'yes', 18, 'None'], ['google', 'France', 'yes', 23, 'Premium'], ['digg', 'USA', 'yes', 24, 'Basic'], ['kiwitobes', 'France', 'yes', 23, 'Basic'], ['slashdot', 'France', 'yes', 19, 'None'], ['digg', 'New Zealand', 'yes', 12, 'Basic'], ['google', 'UK', 'yes', 18, 'Basic'], ['kiwitobes', 'France', 'yes', 19, 'Basic']], [['google', 'UK', 'no', 21, 'Premium'], ['(direct)', 'New Zealand', 'no', 12, 'None'], ['(direct)', 'UK', 'no', 21, 'Basic'], ['google', 'USA', 'no', 24, 'Premium'], ['digg', 'USA', 'no', 18, 'None'], ['google', 'UK', 'no', 18, 'None'], ['kiwitobes', 'UK', 'no', 19, 'None'], ['slashdot', 'UK', 'no', 21, 'None']])
4. select the most suitable Splitting Scheme
First, we need to count the results of each item in the dataset. The Code is as follows:
def uniquecounts(rows): results={} for row in rows: r=row[len(row)-1] #last row result #print r if r not in results:results[r]=0 results[r]+=1 return results
The function above is used to find different possible results and return a dictionary containing the number of occurrences of each result.
Next we will examine the non-purity and entropy of Gini.
If you do not know about Gini purity and entropy, see Data Mining-Concepts and technology or Baidu (hey, the formula is troublesome, so the younger brother will not write it)
1. Gini purity: The expected error rate in which certain results from the set are randomly applied to a data item in the set. The calculation function is as follows:
def giniimpurity(rows): total=len(rows) counts=uniquecounts(rows) #print tatal,counts imp=0 for k1 in counts: p1=float(counts[k1])/total #print p1 for k2 in counts: if k1==k2:continue p2=float(counts[k2])/total #print p2 imp+=p1*p2 #print imp return imp
This function calculates the probability by dividing the number of occurrences of each item in the set by the total number of rows in the set, and then adds up the multiplication accumulation of these probability values. In this way, the total probability that a row of data is randomly allocated to the error result is obtained. The smaller the value, the better.
2. entropy: indicates the unordered degree of the Set, which is basically equivalent to the degree of mixing of the set. The function is as follows:
def entropy(rows): from math import log log2=lambda x:log(x)/log(2) results=uniquecounts(rows) ent=0.0 for r in results.keys(): p=float(results[r])/len(rows) ent=ent-p*log2(p) return ent
The formula for calculating entropy is H (x) = E [I (xi)] = E [log2 1/p (xi)] =-ε p (xi) log2 p (xi) (I = 1, 2 ,.. n)
>>> treepredict.giniimpurity(treepredict.my_data)0.6328125>>> treepredict.entropy(treepredict.my_data)1.5052408149441479
5. Build a tree in recursive Mode
Information Gain: the difference between the current entropy and the entropy after the weighted average of the two new groups.
def buildtree(rows,scoref=entropy): if len(rows)==0: return decisionnode() current_score=scoref(rows) # Set up some variables to track the best criteria best_gain=0.0 best_criteria=None best_sets=None column_count=len(rows[0])-1 for col in range(0,column_count): # Generate the list of different values in # this column column_values={} for row in rows: column_values[row[col]]=1 # Now try dividing the rows up for each value # in this column for value in column_values.keys(): (set1,set2)=divideset(rows,col,value) # Information gain p=float(len(set1))/len(rows) gain=current_score-p*scoref(set1)-(1-p)*scoref(set2) if gain>best_gain and len(set1)>0 and len(set2)>0: best_gain=gain best_criteria=(col,value) best_sets=(set1,set2) # Create the sub branches if best_gain>0: trueBranch=buildtree(best_sets[0]) falseBranch=buildtree(best_sets[1]) return decisionnode(col=best_criteria[0],value=best_criteria[1], tb=trueBranch,fb=falseBranch) else: return decisionnode(results=uniquecounts(rows))
For the display, prediction, and pruning of decision trees.
See collective Smart Programming