Python machine learning decision tree and python machine Decision Tree

Source: Internet
Author: User

Python machine learning decision tree and python machine Decision Tree

Decision tree (DTs) is an unsupervised learning method for classification and regression.

Advantages: low computing complexity, easy to understand output results, insensitive to missing median values, and the ability to process irrelevant feature data
Disadvantage: the problem of over-matching may occur.
Applicable data types: numeric and nominal source code download https://www.manning.com/books/machine-learning-in-action

Run demo important reference learning: http://blog.csdn.net/dream_angel_z/article/details/45965463

Key Algorithms

If so return class label;

Else

Find the best feature for dividing a dataset
Divide a dataset
Create branch nodes
For each branch node
Call the createBranch function and add the returned results to the branch node.
Return branch node

Corresponding code

Def createTree (dataSet, labels ):
ClassList = [example [-1] for example in dataSet] is not the last element in dataset [-1] {dataset}. in this case, the last one element in each dataset element is
If classList. count (classList [0]) = len (classList): if the returned List has the same count type, this type is returned! Indicates whether the subnode can be classified. If the subnode is of the same type, the return value is recursive. Otherwise, the subnode can be classified recursively.
Return classList [0] # stop splitting when all of the classes are equal
If len (dataSet [0]) = 1: # stop splitting when there are no more features in dataSet if there is only one element
Return majorityCnt (classList)
BestFeat = chooseBestFeatureToSplit (dataSet) select the best feature index
BestFeatLabel = labels [bestFeat]. Do you get this label flippers or no surfaces?
MyTree = {bestFeatLabel :{}} and then create the subtree of the best category
Del (labels [bestFeat]) deletes the best category.
FeatValues = [example [bestFeat] for example in dataSet]
UniqueVals = set (featValues) set is a classification, depending on how many types
For value in uniqueVals:
SubLabels = labels [:] # copy all of labels, so trees don't mess up existing labels
MyTree [bestFeatLabel] [value] = createTree (splitDataSet (dataSet, bestFeat, value), subLabels)
Return myTree

 

Information gain occurs after a dataset is divided. The biggest principle of dividing a dataset is to make disordered data more orderly. Here we understand the principle of splitting a pie:

Describe the degree of information complexity and information in unit entropy. It corresponds to the pie density. If it is an equal-density vertical cut cake,

The weight of each part is g = total G * the proportion of each part in the circle! Similarly, if the information entropy is the same after division, the small h of each small part of data is equal to the total H of pro *, and the sum is h [I] = H.

However, what we need is the opposite: what we need is not the same information entropy, but not equal. For example, on the top, Green may be the grass stuffing, yellow is the apple stuffing, and blue is the purple potato, different Density!

We need to divide it correctly! Classify and find the line approaching different fillings. Here, the small h will be minimized, and in the end, the total H will approach the minimum value without changing the area, which is the solution to the optimization problem.


Debugging process
CalcShannonEnt <class 'LIST'>: [[1, 'no'], [1, 'no'] = 0 log (1, 2) * 0.4 = 0 Why is it 0, because pro must be 1
Log (prob, 2) log (1, 2) = 0; 2 ^ 0 = 1, because prob <= 1, so log (Value, 2) <0
<Class 'LIST'>: [[1, 'yes'], [1, 'yes'], [0, 'no'] = 0.91> * 0.6 = 0.55
25 rows for featVec in dataSet: Frequency Meter for prop
ChooseBestFeatureToSplit ()
0.9709505944546686 = calcShannonEnt (dataSet) <class 'LIST'>: [[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'No'], [0, 1, 'no'], [0, 1, 'no']

# Check whether each subitem of the dataset belongs to the same class: If the value is a, and the result is y or n, It is a class. Therefore, only two parameters are input.
0.5509775004326937 = + = prob * calcShannonEnt (subDataSet) after the subset is separated, the probability * Shannon drops, and the sum obtained, the original overall Shannon drops Ratio

# The closer the data is, the less the Shannon entropy value and the closer it is to 0. The more different the data is, the greater the logic is.
# Calculate only the featVec [-1] result tag of its dataSet
Def calcShannonEnt (dataSet ):


0.4199730940219749 infoGain = baseEntropy-newEntropy

 

Summary:

At the beginning, I couldn't understand the code and couldn't understand what to do! Classification. Our goal is to classify a bunch of data and label them with labels.
Like k adjacent classify ([0, 0], group, labels, 3), it means to put the new data [0, 0] in the group by k = 3 adjacent algorithm, categories in labels data! The group corresponds to the label!

Later we can see

        

 

Data dataSet indicates the value of a dimension and whether the last one is fish. Result tag


Therefore, we need to partition each dimension and label the results into a two-dimensional array to compare the categories.
The test should be: input the values of the First n dimensions, vectors, and output is yes or no!
At the beginning, I was dizzy and well organized. I had to straighten out my ideas and read the code to make it easy to understand!
After understanding the target and initial data, you can understand that classList is the result tag !, Is the result tag of the dataset to be classified.
Labels is the feature name, corresponding to the dimension of the Start dataset, and the feature name strname.
BestFeatLabel: It is recommended that the dimension names of classification features be the first dimension or the second dimension.
FeatValues is an array of values under the bestFeatLabel dimension. The Group under this dimension is used for new classification comparison.
UniqueVals uses set to determine whether it is a class,
For example
DataSet = [[1, 1, 'yes'], [0, 1, 'yes'], [1, 0, 'no'], [1, 0, 'No'], [0, 0, 'no']
Labels = ['no surfacing', 'flippers',]
CreateTree: {'flippers': {0: 'no', 1: 'yes'} directly omitting the dimension of no surfacing.

 

Finally, let's talk about the decision tree in another article:

Decision tree: speed up efficiency! Divide the first negative tag with the 'maximum optimiz', And the tag must be further divided! Negative, directly return the leaf node answer! The corresponding other dimensions do not continue to be judged!

In theory, even if no decision tree algorithm is used, this means that all data dimensions are rotated once every time! And there is the last tag answer! Number of dimensions * Number of Data! Complexity! This is a matching answer to the memory! Suitable expert system! Poor ability to predict unexpected situations! However, the data volume is large, the speed is fast, and it can also feel intelligent! Because it is a repeat of past experience! But is it dead? No, it's not dead! The final effort is endless, but the decision tree is dynamic! Learning! Change tree! At least it is dynamic! When the data is incomplete, it may be incomplete!When a judgment can be solved, one judgment is used. No more judgment is needed! Dimension added!

Please leave a message! Please give me more advice!

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.