Machine Learning Practice decision tree glasses men buy glasses

Source: Internet
Author: User

Decision tree is an extremely easy-to-understand algorithm. After a model is built, it is a series of nested if... else... or nested switches.

Advantages: low computing complexity, easy to understand output results, insensitive to missing median values, and the ability to process irrelevant feature data;

Disadvantages: excessive matching may occur;

Applicable data types: numeric and nominal.


Python implementation of decision tree:

(1) implement several tool functions: Calculate the entropy function, divide the dataset tool function, and calculate the maximum probability attribute;

(1) entropy calculation: entropy indicates the degree of disorder in the set. The more unordered the set, the greater the entropy;

def entropy(dataset):from math import log  log2 = lambda x:log(x)/log(2) results={}  for row in dataset:  r = row[len(row)-1]results[r] = results.get(r, 0) + 1ent = 0.0for r in results.keys():  p = float(results[r]) / len(dataset)  ent=ent-p*log2(p)  return ent  

(2) Obtain a dataset Based on attributes and values:

def fetch_subdataset(dataset, k, v):return [d[:k]+d[k+1:] for d in dataset if d[k] == v]

This function has only one short row. It indicates that the subset of column K value of V is obtained from the dataset sequence, and the column K is removed from the obtained Subset. Python is simple and elegant.

(3) calculate the maximum probability attribute. When building a decision tree, when processing all decision attributes, we cannot uniquely differentiate data, we use the majority voting method to select the final classification:

def get_max_feature(class_list):class_count = {}for cla in class_list:class_count[cla] = class_count.get(cla, 0) + 1sorted_class_count =  sorted(class_count.items(), key=lambda d: d[1], reverse=True) return sorted_class_count[0][0]

(2) functions for selecting the optimal data partitioning method:

Select the optimal Partitioning Method for the set: Which column of values is used to divide the set to obtain the maximum information gain?

def choose_decision_feature(dataset):ent, feature = 100000000, -1for i in range(len(dataset[0]) - 1):feat_list = [e[i] for e in dataset]unq_feat_list = set(feat_list)ent_t = 0.0for f in unq_feat_list:sub_data = fetch_subdataset(dataset, i, f)ent_t += entropy(sub_data) * len(sub_data) / len(dataset)if ent_t < ent:ent, feature = ent_t, ireturn feature


(3) recursive decision tree construction:

def build_decision_tree(dataset, datalabel):cla = [c[-1] for c in dataset]if len(cla) == cla.count(cla[0]):return cla[0]if len(dataset[0]) == 1:return get_max_feature(dataset)feature = choose_decision_feature(dataset)feature_label = datalabel[feature]decision_tree = {feature_label:{}}del(datalabel[feature])feat_value = [d[feature] for d in dataset]unique_feat_value = set(feat_value)for value in unique_feat_value:sub_label = datalabel[:]decision_tree[feature_label][value] = build_decision_tree(\fetch_subdataset(dataset, feature, value), sub_label)return decision_tree

(4) using decision trees

def classify(decision_tree, feat_labels, testVec):label = decision_tree.keys()[0]next_dict = decision_tree[label]feat_index = feat_labels.index(label)for key in next_dict.keys():if testVec[feat_index] == key:if type(next_dict[key]).__name__ == 'dict':c_label = classify(next_dict[key], feat_labels, testVec)else:c_label = next_dict[key]return c_label

(5) Decision Tree persistence

(1) Storage

def store_decision_tree(tree, filename):import picklef = open(filename, 'w')pickle.dump(tree, f)f.close()

(2) read

def load_decision_tree(filename):import picklef = open(filename)return pickle.load(f)

(6) at the end of the day, it is time to return to the topic and give the glasses to the male.

The following contact lens dataset comes from the UCI database, which contains the observed conditions for eye condition of many patients and the contact lens type recommended by doctors, contact lenses include hard materials, soft materials, and unsuitable contact lenses.

The data is as follows:

youngmyopenoreducedno lensesyoungmyopenonormalsoftyoungmyopeyesreducedno lensesyoungmyopeyesnormalhardyounghypernoreducedno lensesyounghypernonormalsoftyounghyperyesreducedno lensesyounghyperyesnormalhardpremyopenoreducedno lensespremyopenonormalsoftpremyopeyesreducedno lensespremyopeyesnormalhardprehypernoreducedno lensesprehypernonormalsoftprehyperyesreducedno lensesprehyperyesnormalno lensespresbyopicmyopenoreducedno lensespresbyopicmyopenonormalno lensespresbyopicmyopeyesreducedno lensespresbyopicmyopeyesnormalhardpresbyopichypernoreducedno lensespresbyopichypernonormalsoftpresbyopichyperyesreducedno lensespresbyopichyperyesnormalno lenses


The test procedure is as follows:

def test():f = open('lenses.txt')lense_data = [inst.strip().split('\t') for inst in f.readlines()]lense_label = ['age', 'prescript', 'astigmatic', 'tearRate']lense_tree = build_decision_tree(lense_data, lense_label)

The test results here are as follows:

 


Glasses men can finally buy the right glasses...


All the code is stuck below:

def entropy(dataset):from math import log  log2 = lambda x:log(x)/log(2) results={}  for row in dataset:  r = row[len(row)-1]results[r] = results.get(r, 0) + 1ent = 0.0for r in results.keys():  p = float(results[r]) / len(dataset)  ent=ent-p*log2(p)  return ent  def fetch_subdataset(dataset, k, v):return [d[:k]+d[k+1:] for d in dataset if d[k] == v]def get_max_feature(class_list):class_count = {}for cla in class_list:class_count[cla] = class_count.get(cla, 0) + 1sorted_class_count =  sorted(class_count.items(), key=lambda d: d[1], reverse=True) return sorted_class_count[0][0]def choose_decision_feature(dataset):ent, feature = 100000000, -1for i in range(len(dataset[0]) - 1):feat_list = [e[i] for e in dataset]unq_feat_list = set(feat_list)ent_t = 0.0for f in unq_feat_list:sub_data = fetch_subdataset(dataset, i, f)ent_t += entropy(sub_data) * len(sub_data) / len(dataset)if ent_t < ent:ent, feature = ent_t, ireturn featuredef build_decision_tree(dataset, datalabel):cla = [c[-1] for c in dataset]if len(cla) == cla.count(cla[0]):return cla[0]if len(dataset[0]) == 1:return get_max_feature(dataset)feature = choose_decision_feature(dataset)feature_label = datalabel[feature]decision_tree = {feature_label:{}}del(datalabel[feature])feat_value = [d[feature] for d in dataset]unique_feat_value = set(feat_value)for value in unique_feat_value:sub_label = datalabel[:]decision_tree[feature_label][value] = build_decision_tree(\fetch_subdataset(dataset, feature, value), sub_label)return decision_treedef store_decision_tree(tree, filename):import picklef = open(filename, 'w')pickle.dump(tree, f)f.close()def load_decision_tree(filename):import picklef = open(filename)return pickle.load(f)def classify(decision_tree, feat_labels, testVec):label = decision_tree.keys()[0]next_dict = decision_tree[label]feat_index = feat_labels.index(label)for key in next_dict.keys():if testVec[feat_index] == key:if type(next_dict[key]).__name__ == 'dict':c_label = classify(next_dict[key], feat_labels, testVec)else:c_label = next_dict[key]return c_labeldef test():f = open('lenses.txt')lense_data = [inst.strip().split('\t') for inst in f.readlines()]lense_label = ['age', 'prescript', 'astigmatic', 'tearRate']lense_tree = build_decision_tree(lense_data, lense_label)return lense_treeif __name__ == "__main__":tree = test()print tree


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.