Python decision tree and python Decision Tree

Source: Internet
Author: User

Python decision tree and python Decision Tree

1. Introduction to Decision Tree

Http://www.cnblogs.com/lufangtao/archive/2013/05/30/3103588.html

2. Decision-making is the pseudo-code for implementation

"Read training data" "find the possible values of each attribute" "recursively call the function for creating a decision tree" "para: node, remaining example, the number of remaining attributes "if" is 0 "return most_of_result else if" the remaining samples belong to the same category (yes/no) "return yes/no else: "Calculate the entropy increase of each remaining attribute" and find the corresponding attribute with the maximum entropy increase, that is, the optimal classification attribute. "" classify by the optimal classification attribute. For each branch, recursively call and create a function to obtain the entire decision tree"

3. python Data Structure Design

1. Dataset: used to store two-dimensional training data training_data

A two-dimensional list array. For a two-dimensional list to obtain data of a column, you can use zip (* dataset) [num]

2. Attribute Set: used to store the attribute name attri_name_set

One-dimensional list

3. Possible attribute values: stores the possible values of each attribute.

Dict + set: the key of dict is the attribute name and value is of the set type. This ensures that there are no duplicates.
New set type: attri [I] = set ()

4.

 

 

4. code

#-*-Coding: UTF-8-*-from _ future _ import divisionimport math _ author _ = 'jiayin' # date:
The actual decision tree, read the training data from test.txt,
# Global variable training_data = [] # dataset (two-dimensional list table) attri ={}# Attribute set (dict + set) attri_name_set = [] class Dtree_node (object ): def _ init _ (self): self. attriname = None self. sub_node ={}# the subnode is of the dict type root = Dtree_node () # input data def get_input (): # The property set attribute is of the dict structure, and the key is the property name (str ), value is the value type that can be obtained by this attribute. set # The first attribute is usually numbered, And the last attribute is usually the decision result, only yes/no global attri global attri_name_set file_read = open ("test.txt") line = file_read.readline (). split () attri_name_set = line [:] # print line for I in line: attri [I] = set () line = file_read.readline (). split () # Read data and calculate the possible values of each attribute. The value is while line: training_data.append (line) for I in range (1, len (line)-1 ): attri [attri_name_set [I]. add (line [I]) line = file_read.readline (). split () # Get most_of _ resultdef getmost (dataset_result): p = 0 n = 0 for I in dataset_result: if I = 'yes': p + = 1 else: n + = 1 return 'yes' if p> n else 'no' # Calculate the entropy def cal_entropy (dataset_result): num_yes = 0 num_no = 0 for I in dataset_result: if I = 'yes': num_yes + = 1 else: num_no + = 1 if num_no = 0 or num_yes = 0: return 0 total_num = num_no + num_yes per_yes = num_yes/total_num per_no = num_no/total_num return-per_yes * math. log (per_yes, 2)-per_no * math. log (per_no, 2) # Calculate the entropy of an attribute # parameter: DataSet and attribute name, initial entropy def cal_incr_entr_attri (data_set, attriname, init_entropy ): global attri global attri_name_set incr_entr = init_entropy attri_index = attri_name_set.index (attriname) # extract different values of this attribute, calculate the entropy, and obtain the entropy increase for I in attri [attriname]: # new_data = data_set [:] new_data = filter (lambda x: True if x [attri_index] = I else False, data_set) if len (new_data) = 0: continue num = cal_entropy (zip (* new_data) [-1]) incr_entr-= len (new_data)/len (data_set) * num return incr_entr # determine whether the remaining dataset is a result def if_all_label (dataset_result, result): # result = dataset_result [0] for I in range (0, len (dataset_result )): if dataset_result [I] <> result: break return False if dataset_result [I] <> result else True # create a decision tree # parameter: root: node dataset: The remaining dataset attriset: def create_Dtree (root_node, data_set, attri_set): global attri global attri_name_set ''' # If the current dataset is empty, the previous most_of_result layer should be returned, here you need to modify if len (data_set) = 0: return None ''' # consider if the remaining attribute set is empty, then return most_of_result if len (attri_set) = 0: print zip (* data_set) root_node.attriname = getmost (zip (* data_set) [-1]) # zip (* dataset) [-1] indicates that the last column is retrieved, return None # If the remaining dataset is a result, elif if_all_label (zip (* data_set) [-1] will be returned. 'Yes'): root_node.attriname = 'yes' return None elif if_all_label (zip (* data_set) [-1], 'no '): root_node.attriname = 'no' return None # print zip (* data_set) init_entropy = cal_entropy (zip (* data_set) [-1]) # Calculate the initial entropy max_entropy = 0 for I in attri_set: entropy = inline (data_set, I, init_entropy) if entropy> max_entropy: max_entropy = entropy inline = I new_attri = attri_set [:] inline = best_attri attri_index = inline) for attri_value in attri [best_attri]: # new_data = data_set [:] new_data = filter (lambda x: True if x [attri_index] = attri_value else False, data_set) root_node.sub_node [attri_value] = Dtree_node () # if the number of datasets under this branch is 0, use most_of_result if len (new_data) = 0: root_node.sub_node [attri_value] of the parent node. attriname = getmost (zip (* data_set) [-1]) else: create_Dtree (root_node.sub_node [attri_value], new_data, new_attri) def print_Dtree (Root_node, layer ): print Root_node.attriname count = 1 if len (Root_node.sub_node)> 0: for sub in Root_node.sub_node.keys (): for I in range (layer): print "| ", print "| ---- % 10 s ---" % sub, assert isinstance (layer, object) print_Dtree (Root_node.sub_node [sub], layer + 1) # count + = 1def main (): global root global attri_name_set get_input () # input attri_set = attri_name_set [1:-1] # extract the attribute create_Dtree (root, training_data, attri_set) to be classified # create a decision tree print_Dtree (root, 0) # print the decision tree main ()

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.