Sklearn database example-Decision Tree Classification and sklearn database example Decision-Making

Last Update:2016-12-08 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Introduction of decision tree algorithm on Sklearn: http://scikit-learn.org/stable/modules/tree.html

1. Decision Tree: A non-parametric supervised learning method, mainly used for classification and regression. The goal of an algorithm is to create a model that predicts the target variables by inferring data features and learning decision-making rules. As shown in the following figure, the decision tree uses a series of if-then-else decision rules to approximate a sine curve.

Advantages of decision tree:

Easy to understand, clear principles, and visualization of Decision Trees
Easy data preparation. Other methods require Data normalization, creating virtual variables, and deleting blank variables. (Note: This module does not support missing values)
The cost of using a decision tree is the log level of the data point.
Ability to process numeric and classified data
Ability to handle multiple outputs
Use the white box model (internal structures can be directly observed ). A given situation can be observed, so this result can be interpreted using boolean logic. On the contrary, if a black box model (ANN) is used, the results may be hard to explain.
The model can be verified through statistical tests. This makes it possible to calculate the reliability of the model.
Even if the model assumes that the actual model that violates the data generated, the performance is still good.

Disadvantages of decision tree:

An overly complex rule, that is, overfitting, may be created. To avoid this problem, pruning, setting the minimum number of samples for leaf nodes, and setting the maximum depth of decision trees are sometimes necessary.
Decision Trees are sometimes unstable because of small changes in data, which may generate completely different decision trees. This problem can be mitigated by the overall average (ensemble. It should refer to multiple experiments.
Learning the optimal decision tree is a complete NP problem. Therefore, the actual decision tree learning algorithm is based on the test algorithm, for example, the greedy algorithm that implements the local optimal value on each node. Such an algorithm cannot guarantee that a global optimal decision tree is returned. Multiple decision trees can be trained by randomly selecting features and samples to solve this problem.
Some problems are difficult to learn, because decision trees are difficult to express. For example: exception or problem, parity or multiplexing Problem
If some factors are dominant, the decision tree is biased. Therefore, we recommend that you balance the data impact factors before fitting the decision tree.

2. Classification

DecisionTreeClassifier can implement multi-category classification. Input two vectors: vector X with the size of [n_samples, n_features], used to record training samples, vector Y with the size of [n_samples], used to store class labels of training samples.

from sklearn import treeX = [[0, 0], [1, 1]]Y = [0, 1]clf = tree.DecisionTreeClassifier()clf = clf.fit(X, Y) clf.predict([[2., 2.]])clf.predict_proba([[2., 2.]])

Here we use the iris Dataset:

from sklearn.datasets import load_irisfrom sklearn import treeiris = load_iris()clf = tree.DecisionTreeClassifier()clf = clf.fit(iris.data, iris.target) # export the tree in Graphviz format using the export_graphviz exporterwith open("iris.dot", 'w') as f:    f = tree.export_graphviz(clf, out_file=f) # predict the class of samplesclf.predict(iris.data[:1, :])# the probability of each classclf.predict_proba(iris.data[:1, :])

Install Graphviz and add it to the environment variable. Use dot to create a PDF file. Dot-Tpdf iris. dot-o iriso

For instructions on installing Graphviz methods, see: http://blog.csdn.net/lanchunhui/article/details/49472949

The running result will be in the folder:

These two files. Open iris.pdf

You can also install the pydotplus package. Installation Method: pip install pydotplus. Generate directly in Python:

import pydotplus dot_data = tree.export_graphviz(clf, out_file=None) graph = pydotplus.graph_from_dot_data(dot_data) graph.write_pdf("iris.pdf")

Note: An error occurs when you run this code. I have not solved the problem for a long time. See http://stackoverflow.com/questions/31209016/python-pydot-and-decisiontree/36456995#36456995

The following code is the Demo code on the Sklearn Official Website:

import numpy as npimport matplotlib.pyplot as pltfrom sklearn.datasets import load_irisfrom sklearn.tree import DecisionTreeClassifier# Parametersn_classes = 3plot_colors = "bry"plot_step = 0.02# Load datairis = load_iris()for pairidx, pair in enumerate([[0, 1], [0, 2], [0, 3],                                [1, 2], [1, 3], [2, 3]]):    # We only take the two corresponding features    X = iris.data[:, pair]    y = iris.target    # Train    clf = DecisionTreeClassifier().fit(X, y)    # Plot the decision boundary    plt.subplot(2, 3, pairidx + 1)    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1    xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),                         np.arange(y_min, y_max, plot_step))    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])    Z = Z.reshape(xx.shape)    cs = plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)    plt.xlabel(iris.feature_names[pair[0]])    plt.ylabel(iris.feature_names[pair[1]])    plt.axis("tight")    # Plot the training points    for i, color in zip(range(n_classes), plot_colors):        idx = np.where(y == i)        plt.scatter(X[idx, 0], X[idx, 1], c=color, label=iris.target_names[i],                    cmap=plt.cm.Paired)    plt.axis("tight")plt.suptitle("Decision surface of a decision tree using paired features")plt.legend()plt.show()

Code running result:

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Sklearn database example-Decision Tree Classification and sklearn database example Decision-Making

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Sklearn database example-Decision Tree Classification and sklearn database example Decision-Making

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support