Implementation of decision tree algorithms in Data Mining -- Bash

Last Update:2013-12-09 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Decision Trees are the most advanced classification algorithms for data mining.
Decision tree, as its name implies, is the decision tree. A branch is a decision process.

Each decision process involves only one data attribute. Then recursively, greedy until the decision conditions are met (that is, clear decision results can be obtained ).

The implementation of decision trees requires some prior (historical results already known) data for training, and the impact of each attribute on the results is obtained by analyzing the training data, here we use a theory called information gain to describe it. entropy is also involved in the process. You can also refer to the article on information gain and entropy.

The following describes the key concepts in decision tree implementation using examples:

Suppose we have the following data:

Age job house credit class
1 0 0 1 0
1 0 0 2 0
1 1 0 2 1
1 1 1 1
1 0 0 1 0
2 0 0 1 0
2 0 0 2 0
2 1 1 2 1
2 0 1 3 1
2 0 1 3 1
3 0 1 3 1
3 0 1 2 1
3 1 0 2 1
3 1 0 3 1
3 0 0 1 0
(1)
First, we need to find out which attribute of all attribute values can better express the different class fields. Through calculation, we found that the property values of house can best represent the different class fields. This measure is actually information gain. The calculation method is as follows: Calculate the entropy of all data, traverse other attributes except class one by one, and find the house with the smallest entropy ), then, the entropy of all data is subtracted from the entropy of the data after the data is divided by house attribute.

If the value meets the condition (> 0.1), we believe that data should be split according to this node. That is to say, this attribute (house) constitutes a decision-making process.

(2)
Then
On each dataset split by house, other attributes (except house) are processed in the same process as (1) until the information gain is insufficient to meet the conditions for data split.

In this way, we get a tree about Attribute Data Division. It can be used as a basis for decision-making of data with unknown class fields.

Ii. Decision Tree Code implementation:

The specific calculation code is as follows: --- assume that the above data is saved as the descision. dat file, and bash4.0 or above is required for running.

Bash code
#! /Home/admin/bin/bash_bin/bash_4

Input = $1;

If [-z $ input]; then
Echo "please input the traning file ";
Exit 1;
Fi

# Pre calculate the log2 value for the later calculate operation
Declare-a log2;
Logi = 0;
Records = $ (cat $ input | wc-l );
For I in 'awk-v n = $ records 'BEGIN {for (I = 1; I <n; I ++) print log (I)/log (2 );}''
Do
(Logi + = 1 ));
Log2 [$ logi] = $ I;
Done

# Function for calculating the entropy for the given distribution of the class
Function getEntropy {
Local input = 'echo $1 ';
If [[$ input = * "" *]; then
Local current_entropy = 0;
Local sum = 0;
Local I;
For I in $ input
Do
(Sum + = $ I ));
Current_entropy = $ (awk-v n = $ I-v l =$ {log2 [$ I]}-v o = $ current_entropy 'in in {print n * l + o }' );
Done
Current_entropy = $ (awk-v n = $ current_entropy-v B = $ sum-v l =$ {log2 [$ sum]} 'begin {print n/B *-1 + l ;} ')
Eval $2 = $ current_entropy;
Else
Eval $2 = 0;
Fi
}

### The header title of the input data
Declare-A header_info;
Header = $(head-1 $ input );
Headers = ($ {header //,/})
Length =$ {# headers [@]};
For (I = 0; I <length; I ++ ))
Do
Attr =$ {headers [$ I]};
Header_info [$ attr] = $ I;
Done

### The data content of the input data
Data =$ {input} _ dat;
Sed-n' 2, $ P' $ input> $ data

# Use an array to store the information of a descision tree
# The node structure is {child, slibling, parent, attr, attr_value, leaf, class}
# The root is a virtual node with none used attribute
# Only the leaf node has class flag and the "leaf, class" is meaningfull
# The descision_tree
Declare-a descision_tree;

# The root node with no child \ slibing and anythings else
Descision_tree [0] = "0: 0: 0: N: 0: 0 ";

# Use recursive algrithm to build the tree
# So we need a trace_stack to record the call level infomation
Declare-a trace_stack;

# Push the root node into the stack
Trace_stack [0] = 0;
Stack_deep = 1;

# Begin to build the tree until the trace_stack is empty
While [$ stack_deep-ne 0]
Do
(Stack_deep-= 1 ));
Current_node_index =$ {trace_stack [$ stack_deep]};
Current_node =$ {descision_tree [$ current_node_index]};
Current_node_struct = ($ {current_node //:/});

# Select the current data set
# Get used attr and their values
Attrs =$ {current_node_struct [3]};
Attrv =$ {current_node_struct [4]};

Declare-a grepstra = ();

If [$ attrs! = "N"]; then
Attr = ($ {attrs //,/});
Attrvs = ($ {attrv //,/});
Attrc =$ {# attr [@]};
For (I = 0; I <attrc; I ++ ))
Do
A =$ {attr [$ I]};
Index =$ {header_info [$ a]};
Grepstra [$ index] =$ {attrvs [$ I]};
Done
Fi

For (I = 0; I <length; I ++ ))
Do
If [-z $ {grepstra [$ I]}]; then
Grepstra [$ I] = ".*";
Fi
Done
Grepstrt =$ {grepstra [*]};
Grepstr =$ {grepstrt ///,};
Grep $ grepstr $ data> current_node_data

# Calculate the entropy before split the records
Entropy = 0;
Input = 'cat current_node_data | cut-d ", "-f 5 | sort | uniq-c | sed's/^ \ + // G' | cut-d" "-f 1'
GetEntropy "$ input" entropy;

# Calculate the entropy for each of the rest attrs
# And select the min one
Min_attr_entropy = 1;
Min_attr_name = "";
Min_attr_index = 0;
For (I = 0; I <length-1; I ++ ))
Do
# Just use the rest attrs
If [["$ attrs "! = * "$ {Headers [$ I]}" *]; then
# Calculate the entropy for the current attr
### Get the different values for the headers [$ I]
J = $ (I + 1 ));
Cut-d ","-f $ j, $ length current_node_data> tmp_attr_ds
Dist_values = 'cut-d,-f 1 tmp_attr_ds | sort | uniq-c | sed's/^ \ + // G' | sed's //, /g '';
Totle = 0;
Totle_entropy_attr = 0;
For k in $ dist_values
Do
Info = ($ {k //,/});
(Totle + =$ {info [0]});
Cur_class_input = 'grep "^ $ {info [1]}," tmp_attr_ds | cut-d ", "-f 2 | sort | uniq-c | sed's/^ \ + // G' | cut-d" "-f 1'
Cur_attr_value_entropy = 0;
GetEntropy "$ cur_class_input" cur_attr_value_entropy;
Totle_entropy_attr =$ (awk-v c =$ {info [0]}-v e = $ cur_attr_value_entropy-v o = $ totle_entropy_attr 'in in {print c * e + o ;} ');
Done
Attr_entropy = $ (awk-v e = $ totle_entropy_attr-v c = $ totle 'in in {print e/c ;}');
If [$ (echo "$ attr_entropy <$ min_attr_entropy" | bc) = 1]; then
Min_attr_entropy = $ attr_entropy;
Min_attr_name = "$ {headers [$ I]}";
Min_attr_index = $ j;
Fi
Fi
Done

# Calculate the gain between the original entropy of the current node
# And the entropy after split by the attribute which has the min_entropy
Gain = $ (awk-v B = $ entropy-v a = $ min_attr_entropy 'in in {print B-;}');

# When the gain is large than 0.1 and then put it as a branch
# And add the child nodes to the current node and push the index to the trace_stack
# Otherwise make it as a leaf node and get the class flag
# And do not push trace_stack
If [$ (echo "$ gain> 0.1" | bc) = 1]; then
### Get the attribute values
Attr_values_str = 'cut-d,-f $ min_attr_index current_node_data | sort | uniq ';
Attr_values = ($ attr_values_str );

### Generate the node and add to the tree and add their index to the trace_stack
Tree_store_length =$ {# descision_tree [@]};
Current_node_struct [0] = $ tree_store_length;
Parent_node_index = $ current_node_index;

Attr_value_c =$ {# attr_values [@]};
For (I = 0; I <attr_value_c; I ++ ))
Do
Tree_store_length =$ {# descision_tree [@]};
Slibling = 0;
If [$ I-lt $ (attr_value_c-1)]; then
Slibling = $ (tree_store_length + 1 ));
Fi

New_attr = "";
New_attrvalue = "";
If [$ attrs! = "N"]; then
New_attr = "$ attrs, $ min_attr_name ";
New_attrvalue = "$ attrv, $ {attr_values [$ I]}";
Else
New_attr = "$ min_attr_name ";
New_attrvalue = "$ {attr_values [$ I]}";
Fi
New_node = "0: $ slibling: $ parent_node_index: $ new_attr: $ new_attr_value: 0: 0 ";
Descision_tree [$ tree_store_length] = "$ new_node ";
Trace_stack [$ stack_deep] = $ tree_store_length;
(Stack_deep + = 1 ));
Done
Current_node_update =$ {current_node_struct [*]};
Descision_tree [$ current_node_index] =$ {current_node_update ///:};
Else # current node is a leaf node
Current_node_struct [5] = 1;
Current_node_struct [6] = 'cut-d, -f $ length current_node_data | sort | uniq-c | sort-n-r | head-1 | sed's/^ \ + [^] * // g '';
Current_node_update =$ {current_node_struct [*]};
Descision_tree [$ current_node_index] =$ {current_node_update ///:};
Fi

# Output the descision tree after every step for split or leaf node generater
Echo $ {descision_tree [@]};
Done

Run the Code:

Bash code
./Descision. sh descision. dat
The execution result is:

Java code
1: 0: 0: N: 0: 0 0: 2: 0: house: 0: 0: 0 0: 0: 0: house: 1: 0: 0
1: 0: 0: N: 0: 0 0: 2: 0: house: 0: 0: 0 0: 0: 0: house: 1: 1: 1
1: 0: 0: N: 0: 0 3: 2: 0: house: 0: 0: 0 0: 0: 0: house: 1: 1: 1 0: 4: 1: house, job: 0: 0: 0 0: 0: 1: house, job: 0: 0: 0
1: 0: 0: N: 0: 0 3: 2: 0: house: 0: 0: 0 0: 0: 0: house: 1: 1: 1 0: 4: 1: house, job: 0: 0: 0 0: 0: 1: house, job: 0, 1: 1: 1
1: 0: 0: N: 0: 0 3: 2: 0: house: 0: 0: 0 0: 0: 0: house: 1: 1: 1 0: 4: 1: house, job: 0: 1: 0 0: 0: 1: house, job: 0, 1: 1: 1
The output results show the details of the decision tree structure generation process and the tree change process during the generation process.

This Code uses a one-dimensional array structure to store the entire decision tree. The output order is output by array subscript.

The last line in the output result indicates the final decision tree. The tree structure is actually:

In this way, it will be much better.

Note:
Some of the above decision tree results are misleading:
By default, the root node is placed at the first position of the array, that is, the index value is 0. When the child and sibling values in the subnode are 0, it does not indicate pointing to the followed node, but indicates meaningless, that is, there are no subnodes or sibling nodes.

Classification Rules represented by the decision tree:
According to the decision tree output, the mining rules are as follows:
First, based on the house attribute, when the house attribute is 1, it goes to the node with the index of 2. At this time, the node is a leaf node and the predicted class is 1.
When the house attribute is 0, it is determined by the job attribute. When the job attribute is 0, it goes to the node with the index of 3, and the predicted class is 0. If the job property is 1 and goes to the node with the index value 4, the predicted class is 1.

Author: pingpang

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Implementation of decision tree algorithms in Data Mining -- Bash

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Implementation of decision tree algorithms in Data Mining -- Bash

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support