Sklearn database example-Decision Tree Classification and sklearn database example Decision-Making
Introduction of decision tree algorithm on Sklearn: http://scikit-learn.org/stable/modules/tree.html
1. Decision Tree: A non-parametric supervised learning method, mainly used for classification and regression. The goal of an algorithm is to create a model that predicts the target variables by inferring data features and learning decision-making rules. As shown in the following figure, the decision tree uses a series of if-then-else decision rules to approximate a sine curve.
Advantages of decision tree:
Easy to understand, clear principles, and visualization of Decision Trees
Easy data preparation. Other methods require Data normalization, creating virtual variables, and deleting blank variables. (Note: This module does not support missing values)
The cost of using a decision tree is the log level of the data point.
Ability to process numeric and classified data
Ability to handle multiple outputs
Use the white box model (internal structures can be directly observed ). A given situation can be observed, so this result can be interpreted using boolean logic. On the contrary, if a black box model (ANN) is used, the results may be hard to explain.
The model can be verified through statistical tests. This makes it possible to calculate the reliability of the model.
Even if the model assumes that the actual model that violates the data generated, the performance is still good.
Disadvantages of decision tree:
An overly complex rule, that is, overfitting, may be created. To avoid this problem, pruning, setting the minimum number of samples for leaf nodes, and setting the maximum depth of decision trees are sometimes necessary.
Decision Trees are sometimes unstable because of small changes in data, which may generate completely different decision trees. This problem can be mitigated by the overall average (ensemble. It should refer to multiple experiments.
Learning the optimal decision tree is a complete NP problem. Therefore, the actual decision tree learning algorithm is based on the test algorithm, for example, the greedy algorithm that implements the local optimal value on each node. Such an algorithm cannot guarantee that a global optimal decision tree is returned. Multiple decision trees can be trained by randomly selecting features and samples to solve this problem.
Some problems are difficult to learn, because decision trees are difficult to express. For example: exception or problem, parity or multiplexing Problem
If some factors are dominant, the decision tree is biased. Therefore, we recommend that you balance the data impact factors before fitting the decision tree.
2. Classification
DecisionTreeClassifier can implement multi-category classification. Input two vectors: vector X with the size of [n_samples, n_features], used to record training samples, vector Y with the size of [n_samples], used to store class labels of training samples.
from sklearn import treeX = [[0, 0], [1, 1]]Y = [0, 1]clf = tree.DecisionTreeClassifier()clf = clf.fit(X, Y) clf.predict([[2., 2.]])clf.predict_proba([[2., 2.]])
Here we use the iris Dataset:
from sklearn.datasets import load_irisfrom sklearn import treeiris = load_iris()clf = tree.DecisionTreeClassifier()clf = clf.fit(iris.data, iris.target) # export the tree in Graphviz format using the export_graphviz exporterwith open("iris.dot", 'w') as f: f = tree.export_graphviz(clf, out_file=f) # predict the class of samplesclf.predict(iris.data[:1, :])# the probability of each classclf.predict_proba(iris.data[:1, :])
Install Graphviz and add it to the environment variable. Use dot to create a PDF file. Dot-Tpdf iris. dot-o iriso
For instructions on installing Graphviz methods, see: http://blog.csdn.net/lanchunhui/article/details/49472949
The running result will be in the folder:
These two files. Open iris.pdf
You can also install the pydotplus package. Installation Method: pip install pydotplus. Generate directly in Python:
import pydotplus dot_data = tree.export_graphviz(clf, out_file=None) graph = pydotplus.graph_from_dot_data(dot_data) graph.write_pdf("iris.pdf")
Note: An error occurs when you run this code. I have not solved the problem for a long time. See http://stackoverflow.com/questions/31209016/python-pydot-and-decisiontree/36456995#36456995
The following code is the Demo code on the Sklearn Official Website:
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.datasets import load_irisfrom sklearn.tree import DecisionTreeClassifier# Parametersn_classes = 3plot_colors = "bry"plot_step = 0.02# Load datairis = load_iris()for pairidx, pair in enumerate([[0, 1], [0, 2], [0, 3], [1, 2], [1, 3], [2, 3]]): # We only take the two corresponding features X = iris.data[:, pair] y = iris.target # Train clf = DecisionTreeClassifier().fit(X, y) # Plot the decision boundary plt.subplot(2, 3, pairidx + 1) x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1 y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1 xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step), np.arange(y_min, y_max, plot_step)) Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape) cs = plt.contourf(xx, yy, Z, cmap=plt.cm.Paired) plt.xlabel(iris.feature_names[pair[0]]) plt.ylabel(iris.feature_names[pair[1]]) plt.axis("tight") # Plot the training points for i, color in zip(range(n_classes), plot_colors): idx = np.where(y == i) plt.scatter(X[idx, 0], X[idx, 1], c=color, label=iris.target_names[i], cmap=plt.cm.Paired) plt.axis("tight")plt.suptitle("Decision surface of a decision tree using paired features")plt.legend()plt.show()
Code running result: