Implementation of decision tree algorithms in Data Mining -- Bash

Source: Internet
Author: User

 
Decision Trees are the most advanced classification algorithms for data mining.
Decision tree, as its name implies, is the decision tree. A branch is a decision process.
 
Each decision process involves only one data attribute. Then recursively, greedy until the decision conditions are met (that is, clear decision results can be obtained ).
 
The implementation of decision trees requires some prior (historical results already known) data for training, and the impact of each attribute on the results is obtained by analyzing the training data, here we use a theory called information gain to describe it. entropy is also involved in the process. You can also refer to the article on information gain and entropy.
 
The following describes the key concepts in decision tree implementation using examples:
 
Suppose we have the following data:
 
Age job house credit class
1 0 0 1 0
1 0 0 2 0
1 1 0 2 1
1 1 1 1
1 0 0 1 0
2 0 0 1 0
2 0 0 2 0
2 1 1 2 1
2 0 1 3 1
2 0 1 3 1
3 0 1 3 1
3 0 1 2 1
3 1 0 2 1
3 1 0 3 1
3 0 0 1 0
(1)
First, we need to find out which attribute of all attribute values can better express the different class fields. Through calculation, we found that the property values of house can best represent the different class fields. This measure is actually information gain. The calculation method is as follows: Calculate the entropy of all data, traverse other attributes except class one by one, and find the house with the smallest entropy ), then, the entropy of all data is subtracted from the entropy of the data after the data is divided by house attribute.
 
If the value meets the condition (> 0.1), we believe that data should be split according to this node. That is to say, this attribute (house) constitutes a decision-making process.
 
(2)
Then
On each dataset split by house, other attributes (except house) are processed in the same process as (1) until the information gain is insufficient to meet the conditions for data split.
 
In this way, we get a tree about Attribute Data Division. It can be used as a basis for decision-making of data with unknown class fields.
 
 
Ii. Decision Tree Code implementation:
 
The specific calculation code is as follows: --- assume that the above data is saved as the descision. dat file, and bash4.0 or above is required for running.
 
Bash code
#! /Home/admin/bin/bash_bin/bash_4
 
Input = $1;
 
If [-z $ input]; then
Echo "please input the traning file ";
Exit 1;
Fi
 
# Pre calculate the log2 value for the later calculate operation
Declare-a log2;
Logi = 0;
Records = $ (cat $ input | wc-l );
For I in 'awk-v n = $ records 'BEGIN {for (I = 1; I <n; I ++) print log (I)/log (2 );}''
Do
(Logi + = 1 ));
Log2 [$ logi] = $ I;
Done
 
 
# Function for calculating the entropy for the given distribution of the class
Function getEntropy {
Local input = 'echo $1 ';
If [[$ input = * "" *]; then
Local current_entropy = 0;
Local sum = 0;
Local I;
For I in $ input
Do
(Sum + = $ I ));
Current_entropy = $ (awk-v n = $ I-v l =$ {log2 [$ I]}-v o = $ current_entropy 'in in {print n * l + o }' );
Done
Current_entropy = $ (awk-v n = $ current_entropy-v B = $ sum-v l =$ {log2 [$ sum]} 'begin {print n/B *-1 + l ;} ')
Eval $2 = $ current_entropy;
Else
Eval $2 = 0;
Fi
}
 
 
### The header title of the input data
Declare-A header_info;
Header = $(head-1 $ input );
Headers = ($ {header //,/})
Length =$ {# headers [@]};
For (I = 0; I <length; I ++ ))
Do
Attr =$ {headers [$ I]};
Header_info [$ attr] = $ I;
Done
 
 
 
### The data content of the input data
Data =$ {input} _ dat;
Sed-n' 2, $ P' $ input> $ data
 
 
 
# Use an array to store the information of a descision tree
# The node structure is {child, slibling, parent, attr, attr_value, leaf, class}
# The root is a virtual node with none used attribute
# Only the leaf node has class flag and the "leaf, class" is meaningfull
# The descision_tree
Declare-a descision_tree;
 
# The root node with no child \ slibing and anythings else
Descision_tree [0] = "0: 0: 0: N: 0: 0 ";
 
 
# Use recursive algrithm to build the tree
# So we need a trace_stack to record the call level infomation
Declare-a trace_stack;
 
# Push the root node into the stack
Trace_stack [0] = 0;
Stack_deep = 1;
 
# Begin to build the tree until the trace_stack is empty
While [$ stack_deep-ne 0]
Do
(Stack_deep-= 1 ));
Current_node_index =$ {trace_stack [$ stack_deep]};
Current_node =$ {descision_tree [$ current_node_index]};
Current_node_struct = ($ {current_node //:/});
 
# Select the current data set
# Get used attr and their values
Attrs =$ {current_node_struct [3]};
Attrv =$ {current_node_struct [4]};
 
Declare-a grepstra = ();
 
If [$ attrs! = "N"]; then
Attr = ($ {attrs //,/});
Attrvs = ($ {attrv //,/});
Attrc =$ {# attr [@]};
For (I = 0; I <attrc; I ++ ))
Do
A =$ {attr [$ I]};
Index =$ {header_info [$ a]};
Grepstra [$ index] =$ {attrvs [$ I]};
Done
Fi
 
For (I = 0; I <length; I ++ ))
Do
If [-z $ {grepstra [$ I]}]; then
Grepstra [$ I] = ".*";
Fi
Done
Grepstrt =$ {grepstra [*]};
Grepstr =$ {grepstrt ///,};
Grep $ grepstr $ data> current_node_data
 
# Calculate the entropy before split the records
Entropy = 0;
Input = 'cat current_node_data | cut-d ", "-f 5 | sort | uniq-c | sed's/^ \ + // G' | cut-d" "-f 1'
GetEntropy "$ input" entropy;
 
# Calculate the entropy for each of the rest attrs
# And select the min one
Min_attr_entropy = 1;
Min_attr_name = "";
Min_attr_index = 0;
For (I = 0; I <length-1; I ++ ))
Do
# Just use the rest attrs
If [["$ attrs "! = * "$ {Headers [$ I]}" *]; then
# Calculate the entropy for the current attr
### Get the different values for the headers [$ I]
J = $ (I + 1 ));
Cut-d ","-f $ j, $ length current_node_data> tmp_attr_ds
Dist_values = 'cut-d,-f 1 tmp_attr_ds | sort | uniq-c | sed's/^ \ + // G' | sed's //, /g '';
Totle = 0;
Totle_entropy_attr = 0;
For k in $ dist_values
Do
Info = ($ {k //,/});
(Totle + =$ {info [0]});
Cur_class_input = 'grep "^ $ {info [1]}," tmp_attr_ds | cut-d ", "-f 2 | sort | uniq-c | sed's/^ \ + // G' | cut-d" "-f 1'
Cur_attr_value_entropy = 0;
GetEntropy "$ cur_class_input" cur_attr_value_entropy;
Totle_entropy_attr =$ (awk-v c =$ {info [0]}-v e = $ cur_attr_value_entropy-v o = $ totle_entropy_attr 'in in {print c * e + o ;} ');
Done
Attr_entropy = $ (awk-v e = $ totle_entropy_attr-v c = $ totle 'in in {print e/c ;}');
If [$ (echo "$ attr_entropy <$ min_attr_entropy" | bc) = 1]; then
Min_attr_entropy = $ attr_entropy;
Min_attr_name = "$ {headers [$ I]}";
Min_attr_index = $ j;
Fi
Fi
Done
 
# Calculate the gain between the original entropy of the current node
# And the entropy after split by the attribute which has the min_entropy
Gain = $ (awk-v B = $ entropy-v a = $ min_attr_entropy 'in in {print B-;}');
 
# When the gain is large than 0.1 and then put it as a branch
# And add the child nodes to the current node and push the index to the trace_stack
# Otherwise make it as a leaf node and get the class flag
# And do not push trace_stack
If [$ (echo "$ gain> 0.1" | bc) = 1]; then
### Get the attribute values
Attr_values_str = 'cut-d,-f $ min_attr_index current_node_data | sort | uniq ';
Attr_values = ($ attr_values_str );
 
### Generate the node and add to the tree and add their index to the trace_stack
Tree_store_length =$ {# descision_tree [@]};
Current_node_struct [0] = $ tree_store_length;
Parent_node_index = $ current_node_index;

Attr_value_c =$ {# attr_values [@]};
For (I = 0; I <attr_value_c; I ++ ))
Do
Tree_store_length =$ {# descision_tree [@]};
Slibling = 0;
If [$ I-lt $ (attr_value_c-1)]; then
Slibling = $ (tree_store_length + 1 ));
Fi
 
New_attr = "";
New_attrvalue = "";
If [$ attrs! = "N"]; then
New_attr = "$ attrs, $ min_attr_name ";
New_attrvalue = "$ attrv, $ {attr_values [$ I]}";
Else
New_attr = "$ min_attr_name ";
New_attrvalue = "$ {attr_values [$ I]}";
Fi
New_node = "0: $ slibling: $ parent_node_index: $ new_attr: $ new_attr_value: 0: 0 ";
Descision_tree [$ tree_store_length] = "$ new_node ";
Trace_stack [$ stack_deep] = $ tree_store_length;
(Stack_deep + = 1 ));
Done
Current_node_update =$ {current_node_struct [*]};
Descision_tree [$ current_node_index] =$ {current_node_update ///:};
Else # current node is a leaf node
Current_node_struct [5] = 1;
Current_node_struct [6] = 'cut-d, -f $ length current_node_data | sort | uniq-c | sort-n-r | head-1 | sed's/^ \ + [^] * // g '';
Current_node_update =$ {current_node_struct [*]};
Descision_tree [$ current_node_index] =$ {current_node_update ///:};
Fi

# Output the descision tree after every step for split or leaf node generater
Echo $ {descision_tree [@]};
Done
 
Run the Code:
 
Bash code
./Descision. sh descision. dat
The execution result is:
 
Java code
1: 0: 0: N: 0: 0 0: 2: 0: house: 0: 0: 0 0: 0: 0: house: 1: 0: 0
1: 0: 0: N: 0: 0 0: 2: 0: house: 0: 0: 0 0: 0: 0: house: 1: 1: 1
1: 0: 0: N: 0: 0 3: 2: 0: house: 0: 0: 0 0: 0: 0: house: 1: 1: 1 0: 4: 1: house, job: 0: 0: 0 0: 0: 1: house, job: 0: 0: 0
1: 0: 0: N: 0: 0 3: 2: 0: house: 0: 0: 0 0: 0: 0: house: 1: 1: 1 0: 4: 1: house, job: 0: 0: 0 0: 0: 1: house, job: 0, 1: 1: 1
1: 0: 0: N: 0: 0 3: 2: 0: house: 0: 0: 0 0: 0: 0: house: 1: 1: 1 0: 4: 1: house, job: 0: 1: 0 0: 0: 1: house, job: 0, 1: 1: 1
The output results show the details of the decision tree structure generation process and the tree change process during the generation process.
 
This Code uses a one-dimensional array structure to store the entire decision tree. The output order is output by array subscript.
 
The last line in the output result indicates the final decision tree. The tree structure is actually:
 

In this way, it will be much better.
 
Note:
Some of the above decision tree results are misleading:
By default, the root node is placed at the first position of the array, that is, the index value is 0. When the child and sibling values in the subnode are 0, it does not indicate pointing to the followed node, but indicates meaningless, that is, there are no subnodes or sibling nodes.
 
Classification Rules represented by the decision tree:
According to the decision tree output, the mining rules are as follows:
First, based on the house attribute, when the house attribute is 1, it goes to the node with the index of 2. At this time, the node is a leaf node and the predicted class is 1.
When the house attribute is 0, it is determined by the job attribute. When the job attribute is 0, it goes to the node with the index of 3, and the predicted class is 0. If the job property is 1 and goes to the node with the index value 4, the predicted class is 1.


Author: pingpang

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.