Sklearn database example-Decision Tree Classification and sklearn database example Decision-Making

Source: Internet
Author: User

Sklearn database example-Decision Tree Classification and sklearn database example Decision-Making

Introduction of decision tree algorithm on Sklearn: http://scikit-learn.org/stable/modules/tree.html

1. Decision Tree: A non-parametric supervised learning method, mainly used for classification and regression. The goal of an algorithm is to create a model that predicts the target variables by inferring data features and learning decision-making rules. As shown in the following figure, the decision tree uses a series of if-then-else decision rules to approximate a sine curve.

  

Advantages of decision tree:

  • Easy to understand, clear principles, and visualization of Decision Trees

  • Easy data preparation. Other methods require Data normalization, creating virtual variables, and deleting blank variables. (Note: This module does not support missing values)

  • The cost of using a decision tree is the log level of the data point.

  • Ability to process numeric and classified data

  • Ability to handle multiple outputs

  • Use the white box model (internal structures can be directly observed ). A given situation can be observed, so this result can be interpreted using boolean logic. On the contrary, if a black box model (ANN) is used, the results may be hard to explain.

  • The model can be verified through statistical tests. This makes it possible to calculate the reliability of the model.

  • Even if the model assumes that the actual model that violates the data generated, the performance is still good.

 

Disadvantages of decision tree:

  • An overly complex rule, that is, overfitting, may be created. To avoid this problem, pruning, setting the minimum number of samples for leaf nodes, and setting the maximum depth of decision trees are sometimes necessary.

  • Decision Trees are sometimes unstable because of small changes in data, which may generate completely different decision trees. This problem can be mitigated by the overall average (ensemble. It should refer to multiple experiments.

  • Learning the optimal decision tree is a complete NP problem. Therefore, the actual decision tree learning algorithm is based on the test algorithm, for example, the greedy algorithm that implements the local optimal value on each node. Such an algorithm cannot guarantee that a global optimal decision tree is returned. Multiple decision trees can be trained by randomly selecting features and samples to solve this problem.

  • Some problems are difficult to learn, because decision trees are difficult to express. For example: exception or problem, parity or multiplexing Problem

  • If some factors are dominant, the decision tree is biased. Therefore, we recommend that you balance the data impact factors before fitting the decision tree.

 

2. Classification

DecisionTreeClassifier can implement multi-category classification. Input two vectors: vector X with the size of [n_samples, n_features], used to record training samples, vector Y with the size of [n_samples], used to store class labels of training samples.

from sklearn import treeX = [[0, 0], [1, 1]]Y = [0, 1]clf = tree.DecisionTreeClassifier()clf = clf.fit(X, Y) clf.predict([[2., 2.]])clf.predict_proba([[2., 2.]])     

Here we use the iris Dataset:

from sklearn.datasets import load_irisfrom sklearn import treeiris = load_iris()clf = tree.DecisionTreeClassifier()clf = clf.fit(iris.data, iris.target) # export the tree in Graphviz format using the export_graphviz exporterwith open("iris.dot", 'w') as f:    f = tree.export_graphviz(clf, out_file=f) # predict the class of samplesclf.predict(iris.data[:1, :])# the probability of each classclf.predict_proba(iris.data[:1, :])

Install Graphviz and add it to the environment variable. Use dot to create a PDF file. Dot-Tpdf iris. dot-o iriso

For instructions on installing Graphviz methods, see: http://blog.csdn.net/lanchunhui/article/details/49472949

The running result will be in the folder:

These two files. Open iris.pdf

You can also install the pydotplus package. Installation Method: pip install pydotplus. Generate directly in Python:

import pydotplus dot_data = tree.export_graphviz(clf, out_file=None) graph = pydotplus.graph_from_dot_data(dot_data) graph.write_pdf("iris.pdf")

Note: An error occurs when you run this code. I have not solved the problem for a long time. See http://stackoverflow.com/questions/31209016/python-pydot-and-decisiontree/36456995#36456995

The following code is the Demo code on the Sklearn Official Website:

import numpy as npimport matplotlib.pyplot as pltfrom sklearn.datasets import load_irisfrom sklearn.tree import DecisionTreeClassifier# Parametersn_classes = 3plot_colors = "bry"plot_step = 0.02# Load datairis = load_iris()for pairidx, pair in enumerate([[0, 1], [0, 2], [0, 3],                                [1, 2], [1, 3], [2, 3]]):    # We only take the two corresponding features    X = iris.data[:, pair]    y = iris.target    # Train    clf = DecisionTreeClassifier().fit(X, y)    # Plot the decision boundary    plt.subplot(2, 3, pairidx + 1)    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1    xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),                         np.arange(y_min, y_max, plot_step))    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])    Z = Z.reshape(xx.shape)    cs = plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)    plt.xlabel(iris.feature_names[pair[0]])    plt.ylabel(iris.feature_names[pair[1]])    plt.axis("tight")    # Plot the training points    for i, color in zip(range(n_classes), plot_colors):        idx = np.where(y == i)        plt.scatter(X[idx, 0], X[idx, 1], c=color, label=iris.target_names[i],                    cmap=plt.cm.Paired)    plt.axis("tight")plt.suptitle("Decision surface of a decision tree using paired features")plt.legend()plt.show()

Code running result:

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.