Study on decision tree algorithm of machine learning practice

Source: Internet
Author: User
Tags id3

About this article, my original blog address is located in http://blog.csdn.net/qq_37608890, this article from the author on December 06, 2017 18:06:30 written content (http://blog.csdn.net /qq_37608890/article/details/78731169).

This article based on the recent Learning machine learning Books network articles, special will be some of the learning ideas to do a summary, the details are as follows. If there is any improper, please danale a lot of advice, thank you here.

I. Decision tree (decision Trees) Overview

1. Decision Tree Concept

The decision tree (decision tree) is a tree structure (can be a two-fork tree or a non-binary). Each of its non-leaf nodes represents a test on a feature attribute, and each branch represents the output of the feature attribute on a range of domains, and each leaf node holds a category. The process of decision making using decision tree is to test the corresponding feature attribute in the classification item from the root node, and select the output branch according to its value until the leaf node is reached, and the category of leaf node is stored as the decision result.

2 Working principle

The first problem that needs to be solved when constructing a decision tree is which feature on the current dataset will be decisive in classifying the data. To find the defining characteristics, we need to evaluate each feature. Once the test is complete, the raw data is divided into subsets of data. These subsets of data are distributed across all branches of the first decision point. If the data under a branch is of the same type, Spam messages that are not currently read are classified correctly, and there is no need to classify the datasets. Otherwise, the process of dividing a subset of data needs to be repeated. The algorithm that divides the subset here is the same as the method that divides the original dataset, Until all data with the same type is entered into a subset of data. Construct the decision tree Pseudo code function createbranch () as follows:

    To detect whether each subkey in the dataset belongs to the same category:            If so return class label                  Else                    look for the best feature partitioning dataset to divide                    the data set to                     create a branch node                               for each subset                                       of partitions Call the function createbranch () and increase the return result to the branch node in the Return                      branch node                      

Once we have constructed a decision tree model, it is very easy to classify it based on it. This is done by testing a feature of the instance from the root node, assigning the instance to its child nodes (that is, selecting the appropriate branch), depending on the test structure, or when the branch may reach the leaf node or another internal node, and then use the new test condition to execute recursively until a leaf node is reached. When we reach the leaf node, we get the final classification result. A small example is described below.

In layman's terms, the idea of a decision tree classification is similar to finding an object. Now imagine a girl's mother to introduce a boyfriend to this girl, so the following dialogue:

Daughter: How old are you?
Mother: 26.
Daughter: Long handsome not handsome?
Mother: Very handsome.
Daughter: Is the income high?
Mother: Not very high, medium condition.
Daughter: Is it a civil servant?
Mother: Yes, I work in the Inland Revenue Department.
Daughter: Well, I'll meet you.


This girl's decision-making process is a typical classification tree decision. The equivalent of dividing a man into two categories through age, appearance, income and civil servants: see and disappear. Suppose the girl's requirements for a man are: 30 years old, above-average and high-income or middle-income civil servants, then this can be used to represent the girl's decision-making logic:

fully expressed the girl's decision to see a date, where the green node indicates the criteria, the orange node represents the decision result, and the arrows indicate the decision path in a different situation, the Red arrows indicate the girl's decision-making process in the example above.
This picture basically can be regarded as a decision tree, said that it "basically can calculate" is because the decision conditions in the figure is not quantified, such as income high school low, and so on, not a strict decision tree, if all the conditions are quantified, it becomes a real decision tree.

3. Related characteristics of decision tree

    • Advantages: The computational complexity is not high, the output is easy to understand, the missing middle value is not sensitive, can process the irrelevant characteristic data.

    • Disadvantage: An over-matching problem may occur.

    • Use data types: numeric and nominal.

4. General Process

(1) Collect data: You can use any method.

(2) Prepare the data: The construction algorithm only applies to nominal-type data, so the numerical data must be discretized.

(3) Analyze data: You can use any method, after the construction tree is complete, you should check whether the graphics are as expected.

(4) Training algorithm: Structure tree data structure.

(5) Test algorithm: Use the empirical tree to calculate the error rate.

(6) Use algorithm: This step can be applied to any supervised learning algorithm, and the decision tree can be used to better understand the intrinsic meaning of the data.

Two decision tree Scenarios

Assuming that there is now a game called "15 questions", the rules of the game are simple: one side of the game thinks something in the mind, the other participants ask him questions, only 15 questions are allowed, and the answer to the question can only be answered by right or wrong. The person who asks the question gradually shrinks the range of things to be guessed by inferring the decomposition, and finally gets the answer to the game. The decision tree works like 15 questions, and the user enters a series of data and gives the game answers.

An imaginary message classification system is given, which first detects the sending mail domain name. If the address is myemployer.com, place it in the "messages you need to read when you are bored". Otherwise, you'll need to check that the message content contains word hockey, and if it contains a message that's classified as "a friend's mail that needs to be processed in a timely manner," otherwise it's categorized as "junk e-mail that doesn't need to be read."


a very important task of the decision tree is to understand the knowledge contained in the data (which is significantly different from the intrinsic meaning that the K-nearest neighbor algorithm cannot give the data), so the decision tree can use unfamiliar data sets and extract a series of rules from which the machine creates rules based on the data set. Is the process of machine learning.

Three decision Tree Project case a pair of marine organisms to judge fish and non-fish

1 Project situation

The data in the table below contains 5 marine organisms, characterized by the fact that they do not appear to be able to survive and whether they have fins. The animals are divided into two categories: fish and non-fish. If you want to select a data based on the characteristics given, it involves quantifying the basis of the data before you can judge it.


We first construct the CreateDataSet () function for the data input and the Shannon entropy function to calculate the given data set calcshannonent ()

   

    Def createdataset ():          dataSet = [[[], ' yes '], [[], ' yes '], [1,0, ' no '], [0,1, ' no '], [                    0,1, ' no ']          labels=[' no surfacing ', ' flippers ']          # change to discrete values          return dataset,labels      #信息增益      # Calculate Shannon entropy for a given data            def calcshannonent (DataSet): #the the number of unique elements and their occurance          numentries = Len ( DataSet)          labelcounts = {} for          Featvec in DataSet:              currentlabel=featvec[-1]              if CurrentLabel not in Labelcounts.keys (): Labelcounts[currentlabel] = 0              Labelcounts[currentlabel] +=1          shannonent = 0.00000          For key in labelcounts:              prob = float (Labelcounts[key])/numentries              shannonent-= prob * log (prob,2)   #log b ASE 2                        return shannonent  

Execution

    Mydat,labels=createdataset ()      Mydat  

Get

[[1, 1, ' yes '], [1, 1, ' yes '], [1, 0, ' no '], [0, 1, ' no '], [0, 1, ' no ']

Execution

Calcshannonent (Mydat)  

Get

0.9709505944546686

The higher the entropy, the more data is mixed, and we can add more categories to the data set to see how the entropy changes.

Divides the dataset by the given characteristics, leaving the row remaining columns of the specified feature equal to value as the child dataset.

    def splitdataset (DataSet, Index, value): "" "Splitdataset (by traversing the dataset dataset, find the row of value for the Colnum column of index corresponding) is to classify the index column, and if the index column's data equals value, divide the index into the new dataset we created args:dataset the data set to be         The data set that is divided by index indicates that the index column of each row divides the characteristics of the dataset by the value of the attribute that represents the value that corresponds to the index column that needs to be returned. Returns:index column as value "the dataset needs to exclude index columns" "" "Retdataset = [] for Featvec in DataSet : # Index column Value "The dataset needs to exclude index columns" # to determine if the value of the index column is values if featvec[index] = = Val UE: # Chop out index used for splitting # [: index] represents the former index line, even if Index is 2, which is the first I to take Featvec                 Ndex line Reducedfeatvec = Featvec[:index] "" Please Baidu query: Extend and append the difference                 List.append (object) Adds an object to the list objects list.extend (sequence) adds the contents of a sequence SEQ to the list 1, when using append, is to treat New_media as an object,The overall package is added to the Music_media object.                 2. When using extend, the New_media is treated as a sequence, combining the sequence with the Music_media sequence and placing it behind it.                 result = [] Result.extend ([]) Print result result.append ([4,5,6])                 Print result Result.extend ([7,8,9]) print results:                 [1, 2, 3]                 [1, 2, 3, [4, 5, 6]]                 [1, 2, 3, [4, 5, 6], 7, 8, 9]                  "' Reducedfeatvec.extend (featvec[index+1:]) # [index+1:] means to skip the index+1 row of index and take the next data # collect result values the row "The row needs to exclude the index column" Retdataset.append (REDUCEDFEATVEC) return R   Etdataset

Choose the best way to divide the data set :

    def choosebestfeaturetosplit (DataSet): "" "Choosebestfeaturetosplit (choose the best Feature) Args:da          Taset DataSet returns:bestfeature Optimal feature column "" "# Find out how many columns of the first row are Feature, and the last column is the label column  Numfeatures = Len (dataset[0])-1 # The raw information entropy of the DataSet baseentropy = Calcshannonent (dataSet) # The optimal information gain value, and optimal Featurn number bestinfogain, bestfeature = 0.0,-1 # Iterate over all the features for I in Ran              GE (numfeatures): # Create a list of all the examples of this feature # get all the data under the corresponding feature Featlist = [Example[i] For example in DataSet] # get a set of unique values # Gets the collection after the tick, using              Set de-uniquevals the list data = set (featlist) # Create a temporary information entropy newentropy = 0.0              # iterate through the value collection of a column, calculate the information entropy for the column # iterate through all the unique attribute values in the current feature, divide the data set once for each unique attribute value, calculate the new entropy value of the dataset, and sum the entropy worth of all unique features.       For value in Uniquevals:           Subdataset = Splitdataset (DataSet, I, value) # calculation probability prob = Len (subdataset)/FL  Oat (len (dataSet)) # Calculate information entropy newentropy + = prob * Calcshannonent (subdataset) # gain[Information Gain]: information change before and after the data set to obtain the maximum value of information entropy # Information gain is the reduction of entropy or data disorder.              Finally, the information gain in all features is compared, and the index value of the best feature partition is returned.              Infogain = baseentropy-newentropy print ' infogain= ', Infogain, ' bestfeature= ', I, baseentropy, newentropy          if (Infogain > Bestinfogain): bestinfogain = Infogain Bestfeature = i   Return bestfeature

Training algorithm: Structure of structure tree

Functions to create a tree

    def createtree (DataSet, labels): Classlist = [Example[-1] For example in DataSet] # if the first value of the last column of the dataset is          Current number = The number of the whole set, and only one category, just return the result just the line # The first stop condition: All class labels are exactly the same, and the class label is returned directly. The count () function is the number of times the values in the parentheses appear in the list if Classlist.count (classlist[0]) = = Len (classlist): Return classlist          [0] # If the dataset has only 1 columns, then the first class with the highest number of labels, as the result # second stop condition: Using all the features, you still cannot divide the dataset into groupings that contain only unique categories. If Len (dataset[0]) = = 1:return majoritycnt (classlist) # Select the optimal column to get the label meaning of the optimal column BESTF Eat = Choosebestfeaturetosplit (dataSet) # Gets the name of the label Bestfeatlabel = labels[bestfeat] # Initialize MYTR EE mytree = {bestfeatlabel: {}} # Note: The labels list is a mutable object, referenced in a Python function as a parameter simultaneous, and can be globally modified # so this line of code causes the same name to change outside the function The amount of the element is removed, causing the sentence to not be executed, suggesting ' no surfacing ' is not in List del (Labels[bestfeat]) # Remove the optimal column, and then it's branch to do the classification FE Atvalues = [Example[bestfeat] For example in dataSet] Uniquevals = Set (featvalues) for value in Uniquevals: # Find the remaining label sublabels = labels[:] # iterates over all attribute values contained in the current selection feature, recursively calls the function on each dataset partition Createtree () Mytree[bestfeatlabel][value] = Createtree (Splitdataset (da   Taset, Bestfeat, value), sublabels) # print ' Mytree ', value, Mytree return mytree

Test algorithm: Use decision tree to perform classification

    def classify (Inputtree, Featlabels, Testvec): "" "classify (for input nodes, classify) args:inputtr EE decision tree Model featlabels feature label corresponding name Testvec test input data Returns:classlabel classification Result value, you need to map the label to know the name "" "# Gets the root node of the tree for the key value Firststr = Inputtree.keys () [0] # Get root node by key          The corresponding value seconddict = Inputtree[firststr] # Determines the root node name gets the order of the root node in the label, so that you know how the input Testvec began to classify the control tree. Featindex = Featlabels.index (firststr) # test data, find the root node corresponding to the label location, you will know from the input data of the first to start classification key = Testvec[featind Ex] valueoffeat = Seconddict[key] print ' + + ', firststr, ' xxx ', seconddict, '---', key, ' >>> ',  Valueoffeat # Determine if the branch ends: Determine if the valueoffeat is dict type if isinstance (Valueoffeat, dict): Classlabel = Classify (Valueoffeat, Featlabels, Testvec) Else:classlabel = valueoffeat return Classla   Bel


Three Project case 2: Using decision trees to predict contact lens types

Project Overview

Contact lens types include hard materials, soft materials, and not suitable for wearing contact lenses. We need to use decision trees to predict the type of contact lenses a patient needs to wear.
Development process

(1) Collection of data: text files provided.
(2) Parsing Data: Parsing tab-delimited data rows
(3) Analyze data: Quickly check the data, ensure that the data content is parsed correctly, and draw the final tree diagram using the Createplot () function.
(4) Training algorithm: Use the Createtree () function.
(5) Test algorithm: Write test function Validation decision tree can correctly classify a given data instance.
(6) Use algorithm: stores the data structure of the tree so that it does not need to be re-constructed the next time it is used.

Collect data: Provided text file

The text file data format is as follows:

Young   myope   no  reduced no lenses  pre myope   no  reduced no lenses  presbyopic  Myope   No  

parsing data: Resolving tab-delimited data rows

    Lecses = [Inst.strip (). Split (' \ t ') for Inst in Fr.readlines ()] "      lenseslabels = [' Age ', ' prescript ', ' astigmatic ', ' tea Rrate ']  

Analyze data: Quickly examine the data, ensure that the data content is parsed correctly, and draw the final tree diagram using the Createplot () function.

Treeplotter.createplot (Lensestree)  


Training algorithm: Using the Createtree () function

    >>> Lensestree = Trees.createtree (lenses, lenseslabels)      >>> lensestree  


Get

    {' tearrate ': {' reduced ': ' No lenses ', ' normal ': {' astigmatic ': {' yes ': {' prescript ': {' hyper ': {' age ': {'      pre ': ' No ' Lenses ', ' presbyopic ':      ' no lenses ', ' young ': ' Hard '}, ' Myope ': ' Hard '}, ' no ': {' age ': {' pre ': '      soft ', ' Presbyopic ': {' prescript ': {' hyper ': ' soft ', ' myope ':      ' no ' lenses '}}, ' Young ': ' Soft '}}}}  

Summary of Five

In fact, the decision tree is similar to the flowchart with the terminating block, so the terminating block here is the classification result. When we do data processing, we first need to evaluate the inconsistency of the data in the collection, that is, to calculate Shannon Entropy, the next step is to find the most plan to divide the data, The final implementation of all data with the same type is divided into a subset of the same data. When building a data tree, we typically use recursion to transform a dataset into a decision tree. In most cases, we don't construct new data structures, Instead, a dictionary of data structures embedded in the Python language is used to store tree node information. Each step selects the feature with the most information gain as the decision block and finally completes the decision tree generation.


The Matplotlib annotation feature allows you to convert a stored tree structure into an easy-to-understand graphic. Examples of contact lenses illustrate that decision trees may result in excessive data set partitioning, resulting in an over-matching of data set problems. Of course, you can merge adjacent leaf nodes that cannot generate information gain by cropping the decision tree. To solve this problem (over-matching).


About the decision tree Construction algorithm, here This article only uses the ID3 algorithm, certainly also has the C4.5 and the cart algorithm. For the full working process of the decision tree, there are three parts:


1 feature selection;


2 generating decision trees;


3 Pruning section


In addition to the ID3 algorithm, the other two algorithms have a pruning part process. So this also caters to the problem of contact lens overfitting.

About the decision tree section, the author first organized here, the following opportunities will be for the C4.5 and cart algorithm to do some induction and finishing. There are shortcomings, please colleagues more guidance.

Study on decision tree algorithm of machine learning practice

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.