http://blog.csdn.net/lsldd/article/details/41223147
From this chapter, we begin to enter into formal algorithmic learning.
First, we study the classical and effective classification algorithm: decision tree Classification algorithm .
1. Decision Tree Algorithm
The tree structure of decision tree is used to classify the properties of samples, which is the most intuitive classification algorithm, and can also be used for regression. However, some special logic classifications can be difficult. Typical such as XOR (XOR) logic, decision trees are not good at solving such problems.
Decision tree Construction is not unique, unfortunately, the construction of the optimal decision tree belongs to NP problem. So how to build a good decision tree is the focus of research.
J. Ross Quinlan introduced the concept of information entropy into decision tree construction in 1975, which is the famous ID3 algorithm. Subsequent C4.5, C5.0, and cart are all improvements in this approach.
Entropy is the degree of "disorder, chaos." It may be a bit confusing to just touch the concept. To quickly learn how to divide attributes with information entropy gain, refer to this brother's article: http://blog.csdn.net/alvine008/article/details/37760639
If you don't understand it, take a look at the example below.
Let's say I'm going to build a decision tree that automatically chose Apple, and for simplicity, I'll just ask him to learn the following 4 samples:
[Plain]View Plaincopy
- Sample Red Apple
- 0 1 1 1
- 1 1 0 1
- 2 0 1 0
- 3 0 0 0
There are 2 properties in the sample, A0 indicates whether the red Apple. A1 indicates whether the Big Apple.
Then the information entropy of this sample before classification is S =-(the * * log) + the * log (1/2)) = 1.
An information entropy of 1 indicates that the current state is in the most chaotic and disordered condition.
This example has only 2 properties. Then it would be natural to have only 2 decision trees, as shown:
It is obvious that the decision tree with A0 (red) on the left is better than the decision tree with A1 (size) on the right.
Of course, this is intuitive cognition. Quantitative investigation, it is necessary to calculate the information entropy gain of each partitioning situation.
First select A0 for division, the information entropy of each sub-node is calculated as follows:
There are 2 positive cases and 0 negative examples in 0,1 leaf nodes. The information entropy is: e1 =-(2/2 * LOG (2/2) + 0/2 * log (0/2)) = 0.
There are 0 positive cases and 2 negative examples in 2,3 leaf nodes. The information entropy is: E2 =-(0/2 * log (0/2) + 2/2 * log (2/2)) = 0.
Therefore, the entropy of information entropy for each sub-node is chosen as the weighted sum of the A0 divided by the following: E = e1*2/4 + E2*2/4 = 0.
Select A0 to do the information entropy gain of the division G (S, A0) =s-e = 1-0 = 1.
In fact, the decision tree leaf node indicates that it already belongs to the same category, so the information entropy must be 0.
Similarly, if A1 is selected first, the information entropy of each sub-node is calculated as follows:
There are 1 positive cases and 1 negative examples in the 0,2 sub-node. The information entropy is: e1 =-(The * * log) + the * log (1/2)) = 1.
1, 3 child nodes have 1 positive cases and 1 negative examples. The information entropy is: E2 =-(the * * log) + the * log (1/2)) = 1.
Therefore, the entropy of information entropy for each sub-node is chosen as the weighted sum of the A1 divided by the following: E = E1*2/4 + E2*2/4 = 1. In other words, the points are the same as no points!
Select A1 to do the information entropy gain of the division G (S, A1) =s-e = 1-1 = 0.
Therefore, before each division, we only need to calculate the maximum information entropy gain of the kind of division.
2. Data set
To facilitate interpretation and understanding, we use the following extremely simple test data set:
[Plain]View Plaincopy
- 1.5 Thin
- 1.5 fat
- 1.6 Thin
- 1.6 Fat
- 1.7 Thin
- 1.7 Fat
- 1.8 Thin
- 1.8 fat
- 1.9 Thin
- 1.9 Fat
There are 10 samples of this data, each with 2 attributes, height and weight, and a category label for "fat" or "thin". This data is stored in 1.txt.
Our task is to train a decision tree classifier, enter the height and weight, and the classifier can give the person is fat or thin.
(The data is the author's subjective, with some logic, but please disregard its rationality)
The branch of the decision tree for the two-value logic of "non-" is quite natural. In this data set, how is height and weight continuous value?
Although this is a bit of a hassle, it's not a problem, it's just a matter of finding the intermediate points that divide these successive values into different intervals, which translates into two-value logic.
The task of this decision tree is to find some critical values in height and weight, classify their sample 22 according to logic greater than or less than these thresholds, and build a decision tree from top to bottom.
The use of Python's machine learning Library is quite simple and elegant to implement.
3. Python implementation
The Python code is implemented as follows:
[Python]View Plaincopy
- #-*-Coding:utf-8-*-
- Import NumPy as NP
- Import scipy as SP
- From Sklearn import tree
- From Sklearn.metrics import Precision_recall_curve
- From Sklearn.metrics import Classification_report
- From sklearn.cross_validation import train_test_split
- "'data read into ' '
- data = []
- Labels = []
- With open ("Data\\1.txt") as IFile:
- For line in ifile:
- tokens = Line.strip (). Split (")
- Data.append ([Float (TK) for tk in tokens[:-1]])
- Labels.append (tokens[-1])
- x = Np.array (data)
- Labels = Np.array (labels)
- y = Np.zeros (labels.shape)
- "'"label converted to 0/1 "
- y[labels==' fat ']=1
- "'split training data and test data '
- X_train, X_test, y_train, y_test = Train_test_split (x, y, test_size = 0.2)
- ""use information entropy as a dividing standard to train a decision tree "
- CLF = tree. Decisiontreeclassifier (criterion=' entropy ')
- Print (CLF)
- Clf.fit (X_train, Y_train)
- "'writes the decision tree structure to the file '
- With open ("Tree.dot", ' W ') as F:
- f = Tree.export_graphviz (CLF, out_file=f)
- The ""factor reflects the influence of each feature. The larger the feature is, the greater the role it plays in the classification.
- Print (Clf.feature_importances_)
- "'print ' of test results '
- Answer = clf.predict (X_train)
- Print (X_train)
- Print (answer)
- Print (Y_train)
- Print (Np.mean (answer = = Y_train))
- "accuracy and recall rate "
- Precision, recall, thresholds = Precision_recall_curve (Y_train, Clf.predict (X_train))
- Answer = Clf.predict_proba (x) [:,1]
- Print (Classification_report (y, answer, target_names = [' thin ', ' fat ')])
The output looks similar to the following:
[0.2488562 0.7511438]
Array ([[ 1.6, 60.],
[ 1.7, 60. ],
[ 1.9, 80.],
[ 1.5, 50.],
&nbs P [ 1.6, 40.],
[ 1.7, 80.],
&NB Sp [ 1.8, 90.],
[ 1.5, 60.])
Array ([1., 0., 1., 0., 0., 1., 1., 1.])
Array ([1., 0., 1., 0., 0., 1., 1., 1.])
1.0
Precision Recall F1-score Support
Thin 0.83 1.00 0.91 5
Fat 1.00 0.80 0.89 5
Avg/total 1.00 1.00 1.00 8
Array ([0., 1., 0., 1., 0., 1., 0., 1., 0., 0.])
Array ([0., 1., 0., 1., 0., 1., 0., 1., 0., 1.])
It can be seen that the training of the data to test, the accuracy rate is 100%. But finally, all the data will be tested, and there will be 1 Test sample classification errors.
It shows that the decision tree of this example absorbs the rules of the training set very well, but the predictability is slightly.
Here are 3 points to note, which will be used later in machine learning.
1, split training data and test data .
This is done to facilitate cross-examination. The cross-examination is to fully test the stability of the classifier.
0.2 of the code indicates that 20% of the data is randomly taken for testing purposes. The remaining 80% is used to train decision trees.
That is to say, 10 samples randomly take 8 training. The data set in this paper is small, and the purpose here is to see that the decision tree is different for each build because of the random training data taken.
2 . Different influencing factors of characteristics.
The influence of different characteristics of the sample on the classification weight difference will be very large. It is also important to look at the impact of each sample on the classification at the end of the classification.
In this case, the weight of the height is 0.25 and the weight is 0.75, which can be seen as much more important than height. This is also quite logical for the determination of fat and thin.
3. accurate rate and recall rate .
These 2 values are an important criterion for judging the classification accuracy rate. For example, the final result of the code is to test all 10 sample input classifiers:
Test results: Array ([0., 1., 0., 1., 0., 1., 0., 1., 0., 0.])
Real results: Array ([0., 1., 0., 1., 0., 1., 0., 1., 0., 1.])
The accuracy rate of thin is 0.83. It is because the classifier has separated 6 thin, of which 5 are correct, so the accuracy of thin is 5/6=0.83.
The recall rate, divided into thin, was 1.00. is because there are 5 thin in the dataset, and the classifier divides them (though a fat is divided into thin! ), recall rate 5/5=1.
The accuracy rate of the fat is 1.00. Don't dwell on it.
The recall rate, which is divided into fat, is 0.80. is because there are 5 fat in the dataset, and the classifier only divides 4 (a fat is divided into thin! ), recall rate 4/5=0.80.
Many times, especially the difficulty of data classification, accuracy and recall rate is often contradictory. You may need to find the best balance for your needs.
In this example, your goal is to make sure that the fat you find is really fat (accurate), or to make sure you find as many fat as possible (recall rate).
The code also writes the structure of the decision tree to Tree.dot. Opening the file makes it easy to draw a decision tree and see more categorical information for the decision tree.
The Tree.dot of this article are as follows:
[Plain]View Plaincopy
- digraph Tree {
- 0 [label= "x[1] <= 55.0000\nentropy = 0.954434002925\nsamples = 8", shape= "box"];
- 1 [label= "entropy = 0.0000\nsamples = 2\nvalue = [2]. 0.] ", shape=" box "];
- 0-1;
- 2 [label= "x[1] <= 70.0000\nentropy = 0.650022421648\nsamples = 6", shape= "box"];
- 0-2;
- 3 [label= "x[0] <= 1.6500\nentropy = 0.918295834054\nsamples = 3", shape= "box"];
- 2-3;
- 4 [label= "entropy = 0.0000\nsamples = 2\nvalue = [0]. 2.] ", shape=" box "];
- 3-4;
- 5 [label= "entropy = 0.0000\nsamples = 1\nvalue = [1]. 0.] ", shape=" box "];
- 3-5;
- 6 [label= "entropy = 0.0000\nsamples = 3\nvalue = [0]. 3.] ", shape=" box "];
- 2-6;
- }
Based on this information, the decision tree should look like this:
Start machine learning with Python (2: Decision tree Classification algorithm)