Decision tree is an extremely easy-to-understand algorithm. After a model is built, it is a series of nested if... else... or nested switches.
Advantages: low computing complexity, easy to understand output results, insensitive to missing median values, and the ability to process irrelevant feature data;
Disadvantages: excessive matching may occur;
Applicable data types: numeric and nominal.
Python implementation of decision tree:
(1) implement several tool functions: Calculate the entropy function, divide the dataset tool function, and calculate the maximum probability attribute;
(1) entropy calculation: entropy indicates the degree of disorder in the set. The more unordered the set, the greater the entropy;
def entropy(dataset):from math import log log2 = lambda x:log(x)/log(2) results={} for row in dataset: r = row[len(row)-1]results[r] = results.get(r, 0) + 1ent = 0.0for r in results.keys(): p = float(results[r]) / len(dataset) ent=ent-p*log2(p) return ent
(2) Obtain a dataset Based on attributes and values:
def fetch_subdataset(dataset, k, v):return [d[:k]+d[k+1:] for d in dataset if d[k] == v]
This function has only one short row. It indicates that the subset of column K value of V is obtained from the dataset sequence, and the column K is removed from the obtained Subset. Python is simple and elegant.
(3) calculate the maximum probability attribute. When building a decision tree, when processing all decision attributes, we cannot uniquely differentiate data, we use the majority voting method to select the final classification:
def get_max_feature(class_list):class_count = {}for cla in class_list:class_count[cla] = class_count.get(cla, 0) + 1sorted_class_count = sorted(class_count.items(), key=lambda d: d[1], reverse=True) return sorted_class_count[0][0]
(2) functions for selecting the optimal data partitioning method:
Select the optimal Partitioning Method for the set: Which column of values is used to divide the set to obtain the maximum information gain?
def choose_decision_feature(dataset):ent, feature = 100000000, -1for i in range(len(dataset[0]) - 1):feat_list = [e[i] for e in dataset]unq_feat_list = set(feat_list)ent_t = 0.0for f in unq_feat_list:sub_data = fetch_subdataset(dataset, i, f)ent_t += entropy(sub_data) * len(sub_data) / len(dataset)if ent_t < ent:ent, feature = ent_t, ireturn feature
(3) recursive decision tree construction:
def build_decision_tree(dataset, datalabel):cla = [c[-1] for c in dataset]if len(cla) == cla.count(cla[0]):return cla[0]if len(dataset[0]) == 1:return get_max_feature(dataset)feature = choose_decision_feature(dataset)feature_label = datalabel[feature]decision_tree = {feature_label:{}}del(datalabel[feature])feat_value = [d[feature] for d in dataset]unique_feat_value = set(feat_value)for value in unique_feat_value:sub_label = datalabel[:]decision_tree[feature_label][value] = build_decision_tree(\fetch_subdataset(dataset, feature, value), sub_label)return decision_tree
(4) using decision trees
def classify(decision_tree, feat_labels, testVec):label = decision_tree.keys()[0]next_dict = decision_tree[label]feat_index = feat_labels.index(label)for key in next_dict.keys():if testVec[feat_index] == key:if type(next_dict[key]).__name__ == 'dict':c_label = classify(next_dict[key], feat_labels, testVec)else:c_label = next_dict[key]return c_label
(5) Decision Tree persistence
(1) Storage
def store_decision_tree(tree, filename):import picklef = open(filename, 'w')pickle.dump(tree, f)f.close()
(2) read
def load_decision_tree(filename):import picklef = open(filename)return pickle.load(f)
(6) at the end of the day, it is time to return to the topic and give the glasses to the male.
The following contact lens dataset comes from the UCI database, which contains the observed conditions for eye condition of many patients and the contact lens type recommended by doctors, contact lenses include hard materials, soft materials, and unsuitable contact lenses.
The data is as follows:
youngmyopenoreducedno lensesyoungmyopenonormalsoftyoungmyopeyesreducedno lensesyoungmyopeyesnormalhardyounghypernoreducedno lensesyounghypernonormalsoftyounghyperyesreducedno lensesyounghyperyesnormalhardpremyopenoreducedno lensespremyopenonormalsoftpremyopeyesreducedno lensespremyopeyesnormalhardprehypernoreducedno lensesprehypernonormalsoftprehyperyesreducedno lensesprehyperyesnormalno lensespresbyopicmyopenoreducedno lensespresbyopicmyopenonormalno lensespresbyopicmyopeyesreducedno lensespresbyopicmyopeyesnormalhardpresbyopichypernoreducedno lensespresbyopichypernonormalsoftpresbyopichyperyesreducedno lensespresbyopichyperyesnormalno lenses
The test procedure is as follows:
def test():f = open('lenses.txt')lense_data = [inst.strip().split('\t') for inst in f.readlines()]lense_label = ['age', 'prescript', 'astigmatic', 'tearRate']lense_tree = build_decision_tree(lense_data, lense_label)
The test results here are as follows:
Glasses men can finally buy the right glasses...
All the code is stuck below:
def entropy(dataset):from math import log log2 = lambda x:log(x)/log(2) results={} for row in dataset: r = row[len(row)-1]results[r] = results.get(r, 0) + 1ent = 0.0for r in results.keys(): p = float(results[r]) / len(dataset) ent=ent-p*log2(p) return ent def fetch_subdataset(dataset, k, v):return [d[:k]+d[k+1:] for d in dataset if d[k] == v]def get_max_feature(class_list):class_count = {}for cla in class_list:class_count[cla] = class_count.get(cla, 0) + 1sorted_class_count = sorted(class_count.items(), key=lambda d: d[1], reverse=True) return sorted_class_count[0][0]def choose_decision_feature(dataset):ent, feature = 100000000, -1for i in range(len(dataset[0]) - 1):feat_list = [e[i] for e in dataset]unq_feat_list = set(feat_list)ent_t = 0.0for f in unq_feat_list:sub_data = fetch_subdataset(dataset, i, f)ent_t += entropy(sub_data) * len(sub_data) / len(dataset)if ent_t < ent:ent, feature = ent_t, ireturn featuredef build_decision_tree(dataset, datalabel):cla = [c[-1] for c in dataset]if len(cla) == cla.count(cla[0]):return cla[0]if len(dataset[0]) == 1:return get_max_feature(dataset)feature = choose_decision_feature(dataset)feature_label = datalabel[feature]decision_tree = {feature_label:{}}del(datalabel[feature])feat_value = [d[feature] for d in dataset]unique_feat_value = set(feat_value)for value in unique_feat_value:sub_label = datalabel[:]decision_tree[feature_label][value] = build_decision_tree(\fetch_subdataset(dataset, feature, value), sub_label)return decision_treedef store_decision_tree(tree, filename):import picklef = open(filename, 'w')pickle.dump(tree, f)f.close()def load_decision_tree(filename):import picklef = open(filename)return pickle.load(f)def classify(decision_tree, feat_labels, testVec):label = decision_tree.keys()[0]next_dict = decision_tree[label]feat_index = feat_labels.index(label)for key in next_dict.keys():if testVec[feat_index] == key:if type(next_dict[key]).__name__ == 'dict':c_label = classify(next_dict[key], feat_labels, testVec)else:c_label = next_dict[key]return c_labeldef test():f = open('lenses.txt')lense_data = [inst.strip().split('\t') for inst in f.readlines()]lense_label = ['age', 'prescript', 'astigmatic', 'tearRate']lense_tree = build_decision_tree(lense_data, lense_label)return lense_treeif __name__ == "__main__":tree = test()print tree