Decision Tree in Accord.net
Decision Tree Introduction
Decision tree is a kind of machine learning algorithm, which can classify and predict data sets. Please read my other blog (http://www.cnblogs.com/twocold/p/5424517.html) for details.
Accord.net
Accord.net (http://accord-framework.net/) is an open source. NET environment to implement the machine learning algorithm library. And also includes computer vision, image processing, data analysis and many other algorithms, and is basically written in C #, for. NET programmers are very friendly. The code is hosted on GitHub and is still in maintenance. (Https://github.com/accord-net/framework). No details are given here, and interested can go to the official website or github to download the documentation and code in-depth. The implementation and usage of the decision Tree section are simply described here.
Decision tree structure
The decision tree, as the name implies, is definitely one and tree structure, as one of the most basic data structures, we know the flexibility of tree structure. So how does accord.net achieve this structure? See class Diagram
First look at the most important structure in the tree structure, the class diagram of the node class is as follows:
A brief introduction to the main attribute methods.
properties |
< P align= "center" meaning |
isleaf |
|
isroot |
is the root node |
output |
|
value |
|
branches |
|
There are also tree structures:
Properties, Methods |
Meaning |
Root |
Root node |
Attributes |
Identifying information for each feature (continuous, discrete, range) |
Inputcount |
Number of features |
Outputclasses |
Number of output categories |
Compute () |
Calculate the category information for a sample |
Load (), Save () |
Store the decision tree in a file or read |
Toassembly () |
Storing to a DLL assembly |
There are other dependencies that are no longer introduced, and are more clearly explained in Accord's official documentation.
The main want to say is id3learning and c45learning two classes. This is the accord.net implementation of the two decision tree Learning (Training) algorithm, ID3 algorithm and C4.5 algorithm (ID iterative dichotomiser abbreviation, iterative splitter; c is the abbreviation for classifier, that is, the 4.5 generation classifier). The difference between the two is described later.
Decision Tree Learning algorithm:
Here is a classic example of playing tennis, introducing the learning process of the ID3 algorithm. To understand that the following code may need to have a basic understanding of the learning process of the decision tree, you can refer to the basic concepts of the decision tree under the link learning from the beginning.
Mitchell ' s tennis Example |
Day |
Outlook |
Temperature |
Humidity |
Wind |
Playtennis |
D1 |
Sunny |
Hot |
High |
Weak |
No |
D2 |
Sunny |
Hot |
High |
Strong |
No |
D3 |
Overcast |
Hot |
High |
Weak |
Yes |
D4 |
Rain |
Mild |
High |
Weak |
Yes |
D5 |
Rain |
Cool |
Normal |
Weak |
Yes |
D6 |
Rain |
Cool |
Normal |
Strong |
No |
D7 |
Overcast |
Cool |
Normal |
Strong |
Yes |
D8 |
Sunny |
Mild |
High |
Weak |
No |
D9 |
Sunny |
Cool |
Normal |
Weak |
Yes |
D10 |
Rain |
Mild |
Normal |
Weak |
Yes |
D11 |
Sunny |
Mild |
Normal |
Strong |
Yes |
D12 |
Overcast |
Mild |
High |
Strong |
Yes |
D13 |
Overcast |
Hot |
Normal |
Weak |
Yes |
D14 |
Rain |
Mild |
High |
Strong |
No |
First of all, in order to further construct the decision tree, we need to simplify the above data, the string storage and comparison will consume a lot of memory space, and reduce efficiency. Considering that all features are discrete features, you can directly use the simplest integer representation on the line, as long as the corresponding relationship between the number and the string is saved. Accord.net with the codebook to achieve, here is not specifically introduced. Some properties of the tree need to be initialized, such as the number of features (Inputcount), and the number of categories (outputclasses). There are also the possible number of values for each feature. It is then possible to construct the sample data that is Codebook escaped above.
The following is a pseudo-code of the recursive method in the ID3 algorithm, roughly explaining its implementation logic (note: This code deletes a lot of details, so it can't run, only about its implementation logic.) )。
<summary>/////Decision Tree Learning Recursive method////</summary>//<param name= "root" > Current recursion Junction </param>//<param name= "input" > Input sample Feature </param>//<param name= "Output" > sample corresponding category < ;/param>//<param name= "height" > Current node layer </param> private void split (Decisionnode root, int[][ ] input, int[] output, int height) {//recursive return condition//1. If output[] is equal, that is, all current sample categories are the same, the recursion ends. The nodes are marked as leaf node, and the output value is identified as the sample category value Double entropy = Statistics.Tools.Entropy (output, outputclasses); if (entropy = = 0) {if (output. Length > 0) root. Output = output[0]; Return }//2. If all the features on the current path have been used once, that is to say, all the samples now have the same value on all the features, it is impossible to divide them; recursion ends. The node is marked as a leaf, and the output value identifies the one that has the most value for the sample category./This variable stores the number of features that have not been used, int candidatecount = Attributeusagecount.count (x = x < 1); if (Candidatecount = = 0) {root. Output = Statistics.Tools.Mode (output); Return }//If you need to continue splitting, first look for the optimal split feature//store all the remaining information gain size of the feature double[] scores = new Double[candidatecou NT]; The loop calculates the information gain for each feature split when stored in scores parallel.for (0, scores. Length, i = {Scores[i] = computegainratio (input, output, candidates[i], Entropy, out Partitions[i], out outputsubs[i]); }//Gets the maximum information gain corresponding to the characteristic int maxgainindex = scores. Max (); Next, you need to split the current dataset by the value of the feature and pass it to the child node recursive decisionnode[] children = new Decisionnode[maxgainpartition.length]; for (int i = 0; I < children. Length; i++) {int[][] Inputsubset = input. Submatrix (Maxgainpartition[i]); Split (Children[i], inputsubset, outputsubset, height + 1); Recursively each child node} root. Branches.addrange (children); }
This code is only easy to understand, specific implementation details please download accord source code to read, I believe you will have a lot of harvest.
The implementation of C4.5 is basically the same as the ID3 algorithm, there are several differences
1) When selecting the optimal segmentation feature, the ID3 algorithm uses the information gain, and the C4.5 uses the gain rate.
2) C4.5 supports continuous features, so before recursion, we should use dichotomy to calculate the n-1 candidate partition points, and treat these dividing points as discrete variables and the ID3 process is consistent. It is also because of continuous variables, so that continuous features can be used multiple times in a single path, while discrete features are used only once.
3) C4.5 supports the processing of missing values, but unfortunately accord does not include this feature.
A simple pruning algorithm is given in Accord.net, which is interesting to read by itself.
In the example above, here is a code example for constructing and training a decision tree in Accord.net.
Data input is stored as DataTable datatable data = new DataTable ("Mitchell ' s Tennis Example"); Data. Columns.Add ("Day"); Data. Columns.Add ("Outlook"); Data. Columns.Add ("Temperature"); Data. Columns.Add ("humidity"); Data. Columns.Add ("Wind"); Data. Columns.Add ("Playtennis"); Data. Rows.Add ("D1", "Sunny", "hot", "High", "Weak", "No"); Data. Rows.Add ("D2", "Sunny", "hot", "high", "strong", "No"); Data. Rows.Add ("D3", "Overcast", "hot", "High", "Weak", "Yes"); Data. Rows.Add ("D4", "Rain", "Mild", "High", "Weak", "Yes"); Data. Rows.Add ("D5", "Rain", "Cool", "Normal", "Weak", "Yes"); Data. Rows.Add ("D6", "Rain", "Cool", "Normal", "strong", "No"); Data. Rows.Add ("D7", "Overcast", "Cool", "Normal", "strong", "Yes"); Data. Rows.Add ("D8", "Sunny", "Mild", "High", "Weak", "No"); Data. Rows.Add ("D9", "Sunny", "Cool", "Normal", "Weak", "Yes"); Data. Rows.Add ("D10", "Rain", "Mild", "Normal", "Weak", "Yes"); Data. Rows.Add ("D11", "Sunny", "Mild", "Normal", "strong", "Yes"); Data. Rows.Add ("D12", "Overcast", "Mild", "High", "strong", "Yes"); Data. Rows.Add ("D13", "Overcast", "hot", "Normal", "Weak", "Yes"); Data. Rows.Add ("D14", "Rain", "Mild", "High", "strong", "No"); Creates a Codebook object that is used to "translate" the string in data into an integer codification codebook = new Codification (data, "Outlook", "T Emperature "," humidity "," wind "," Playtennis "); The sample Feature data section and category information in data are converted to the array DataTable symbols = codebook. Apply (data); int[][] Inputs = matrix.toarray<double> (symbols, "Outlook", "Temperature", "humidity", "wind"); int[] outputs = matrix.toarray<int> (symbols, "Playtennis"); Analyze the information for each feature, such as the number of desirable values for each feature. decisionvariable[] attributes = Decisionvariable.fromcodebook (codebook, "Outlook", "Temperature", "HumiDity "," Wind "); int classcount = 2; Two possible outputs, playing tennis and not playing//initializing a tree structure according to the parameters DecisionTree tree = new DecisionTree (attributes, ClassCount); Create a ID3 training method id3learning id3learning = new id3learning (tree); Train the decision tree id3learning. Run (inputs, outputs); You can now use the completed decision tree to predict a sample and come back with the codebook "translate" string answer = Codebook. Translate ("Playtennis", Tree.compute (codebook. Translate ("Sunny", "hot", "high", "Strong")));
Stick to a small example using a decision tree.
Tags: machine learning,. Net
Decision Tree in Accord.net