Examples of random forest Algorithm Implementation Using Python and python algorithm examples

Last Update:2017-09-08 Source: Internet

Author: User

Tags uci machine learning repository

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Examples of random forest Algorithm Implementation Using Python and python algorithm examples

The high variance makes the decision tree (sew.tress) weak in processing specific training data sets. The bagging (bootstrap aggregating) algorithm establishes a composite model from the training data sample, which can effectively reduce the variance of decision trees, but the tree is highly correlated with the tree (not the ideal state of the tree ).

The Random forest algorithm (Random forest algorithm) is an extension of the bagging algorithm. In addition to building a composite model based on training data samples, the random forest imposes certain limitations on the data features used to build a tree so that no association is made between the generated decision trees, this improves the algorithm performance.

This tutorial describes how to use Python to implement the random forest algorithm.

Differences between bagged demo-trees and random forest algorithms;
How to construct a bag-filling decision tree with more variance;
How to Apply random forest algorithms to Prediction Model problems.

Algorithm Description

This section briefly introduces the random forest algorithm itself and the Sonar dataset used in the algorithm experiment in this tutorial.

Random forest Algorithm

Each step of decision tree operation involves greedy selection on the best split point in the dataset ).

This mechanism makes decision trees prone to produce high variance without being pruned. The integration of the composite tree built by extracting different samples in the training database (different forms of a problem) and its generated Forecast values can be stable and reduce such a high variance. This method is called bootstrap aggregating. bagging is a bag-filling algorithm 」. The limitation of this algorithm is that because the greedy algorithms for generating each tree are the same, the split points selected by each tree may be the same or extremely similar, this eventually leads to convergence between different trees (tree-to-Tree Association ). Correspondingly, in turn, this also produces similar predicted values, reducing the variance originally required.

We can use the feature-limiting method to create different decision trees, so that the greedy algorithm can evaluate each split point while building. This is the Random Forest algorithm (Random Forest algorithm ).

Like the bag-filling algorithm, the random forest algorithm extracts compound samples from the training set and trains them. The difference is that data is completely split at each split point and added to the corresponding decision tree. You can only consider a fixed subset used to store attributes.

The number of attributes to be considered for splitting is limited to the square root of the number of input features. The Code is as follows:

num_features_for_split = sqrt(total_input_features)

This small change will make the generated Decision Trees different (not associated), thus increasing the variety of predicted values. Diverse combinations of predicted values tend to have better performance than a single decision tree or a single bag-filling algorithm.

Sonar dataset)

We will use the sonar dataset as the input data in this tutorial. This is a data set that describes the different values returned after the sonar is reflected to different objects. 60 input variables indicate the intensity returned by the sonar from different angles. This is a binary classification problem (binary classification problem) that requires the model to be able to distinguish the different materials and shapes of the rock and metal cylinder, with a total of 208 observed samples.

This dataset is easy to understand-each variable has a continuity between 0 and 1, Facilitating data processing. As the output variable, the string 'M' indicates the metal minerals, and 'R' indicates the rock. The two must be converted to integers 1 and 0 respectively.

The Zero Rule Algorithm (Zero Rule Algorithm) provides 53% accuracy by predicting the classes with the most observed values in a dataset (M or metallic minerals.

For more information about this dataset, see UCI Machine Learning repository: https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+ (Sonar, + Mines + vs. + Rocks)

Download the dataset for free, name it a sonar.all-data.csv, and store it in the working directory that needs to be operated.

Tutorial

This tutorial is divided into two steps.

1. Calculate the number of splits.

2. Sonar dataset Case Study

These steps help you understand how to implement your own predictive modeling problems and the basis for applying random forest algorithms.

1. Calculation of split count

In the decision tree, we locate some specific attributes and attribute values to determine the split point. Such specific attributes need to be shown as at the lowest cost.

The cost function of the classification problem is usually the Gini index, that is, the purity of the Data Group generated by the split point ). For such binary classification problems, 0 indicates absolute purity, indicating that the class value is perfectly divided into two groups.

To find the best split point from a decision tree, you need to evaluate the cost of each input variable value in the training dataset.

In the Bagging Algorithm and random forest, this process is executed and replaced (put back) on the sample of the training set. Because the random forest needs to sample the input data rows and columns. For row sampling, there is a replacement method, that is, the same row may be selected and placed more than once in the sample.

We can consider creating a sample that can input attributes on our own, rather than enumerating all INPUT attribute values to find the lowest-cost splitting point, so as to optimize this process.

This input property sample can be randomly selected without replacement, which means that each input property is selected only once when looking for the lowest cost split point.

As shown in the following code, the get_split () function implements the above process. It takes a certain number of input features from the data to be evaluated and a dataset as a parameter, which can be a sample in the actual training set. The auxiliary function test_split () is used to split the dataset by the Split points of candidates. The function gini_index () is used to evaluate the group of rows created) to determine the cost of a split point.

We can see that the feature list is generated by randomly selecting a feature index. By enumerating this feature list, we can evaluate a specific value in the training set as a qualified split point.

# Select the best split point for a datasetdef get_split(dataset, n_features): class_values = list(set(row[-1] for row in dataset)) b_index, b_value, b_score, b_groups = 999, 999, 999, None features = list() while len(features) < n_features:  index = randrange(len(dataset[0])-1)  if index not in features:   features.append(index) for index in features:  for row in dataset:   groups = test_split(index, row[index], dataset)   gini = gini_index(groups, class_values)   if gini < b_score:    b_index, b_value, b_score, b_groups = index, row[index], gini, groups return {'index':b_index, 'value':b_value, 'groups':b_groups}

So far, we know how to transform a decision tree for the random forest algorithm. We can apply this algorithm to a real dataset.

2. Case Study on sonar Dataset

In this section, we will apply the random forest algorithm to the sonar dataset. This example assumes that the csv format copy of the sonar dataset already exists in the current working directory and the file name is sonar.all-data.csv.

Load the dataset, convert the string to a number, and convert the output column from the string to values 0 and 1. this process is implemented by the auxiliary functions load_csv (), str_column_to_float (), and str_column_to_int () respectively.

We will use K-fold cross-validation (k-fold cross validatio) to predict the performance of the Learning Model on unknown data. This means that we will create and evaluate K models and estimate the mean error of these K models. Each model is evaluated based on classification accuracy. The auxiliary functions cross_validation_split (), accuracy_metric (), and evaluate_algorithm () Implement the preceding functions respectively.

The bag-Filling Algorithm is satisfied by classification and regression tree algorithms. The auxiliary function test_split () divides the dataset into different groups; gini_index () evaluates each split point; the improved get_split () function mentioned above is used to obtain the split point; the function to_terminal (), split () and build_tree () are used to create a single decision tree; predict () is used for prediction; subsample () is used to create a subsample set for the training set; bagging_predict () is used to forecast the decision tree list.

The new named function random_forest () first creates a decision tree list from the Child sample of the training set and then predicts it.

As we said at the beginning, the key difference between random forest and decision tree is the small change in the construction method of the former, which is embodied in the running function get_split.

The complete code is as follows:

# Random Forest Algorithm on Sonar Datasetfrom random import seedfrom random import randrangefrom csv import readerfrom math import sqrt# Load a CSV filedef load_csv(filename): dataset = list() with open(filename, 'r') as file:  csv_reader = reader(file)  for row in csv_reader:   if not row:    continue   dataset.append(row) return dataset# Convert string column to floatdef str_column_to_float(dataset, column): for row in dataset:  row[column] = float(row[column].strip())# Convert string column to integerdef str_column_to_int(dataset, column): class_values = [row[column] for row in dataset] unique = set(class_values) lookup = dict() for i, value in enumerate(unique):  lookup[value] = i for row in dataset:  row[column] = lookup[row[column]] return lookup# Split a dataset into k foldsdef cross_validation_split(dataset, n_folds): dataset_split = list() dataset_copy = list(dataset) fold_size = len(dataset) / n_folds for i in range(n_folds):  fold = list()  while len(fold) < fold_size:   index = randrange(len(dataset_copy))   fold.append(dataset_copy.pop(index))  dataset_split.append(fold) return dataset_split# Calculate accuracy percentagedef accuracy_metric(actual, predicted): correct = 0 for i in range(len(actual)):  if actual[i] == predicted[i]:   correct += 1 return correct / float(len(actual)) * 100.0# Evaluate an algorithm using a cross validation splitdef evaluate_algorithm(dataset, algorithm, n_folds, *args): folds = cross_validation_split(dataset, n_folds) scores = list() for fold in folds:  train_set =a list(folds)  train_set.remove(fold)  train_set = sum(train_set, [])  test_set = list()  for row in fold:   row_copy = list(row)   test_set.append(row_copy)   row_copy[-1] = None  predicted = algorithm(train_set, test_set, *args)  actual = [row[-1] for row in fold]  accuracy = accuracy_metric(actual, predicted)  scores.append(accuracy) return scores# Split a dataset based on an attribute and an attribute valuedef test_split(index, value, dataset): left, right = list(), list() for row in dataset:  if row[index] < value:   left.append(row)  else:   right.append(row) return left, right# Calculate the Gini index for a split datasetdef gini_index(groups, class_values): gini = 0.0 for class_value in class_values:  for group in groups:   size = len(group)   if size == 0:    continue   proportion = [row[-1] for row in group].count(class_value) / float(size)   gini += (proportion * (1.0 - proportion)) return gini# Select the best split point for a datasetdef get_split(dataset, n_features): class_values = list(set(row[-1] for row in dataset)) b_index, b_value, b_score, b_groups = 999, 999, 999, None features = list() while len(features) < n_features:  index = randrange(len(dataset[0])-1)  if index not in features:   features.append(index) for index in features:  for row in dataset:   groups = test_split(index, row[index], dataset)   gini = gini_index(groups, class_values)   if gini < b_score:    b_index, b_value, b_score, b_groups = index, row[index], gini, groups return {'index':b_index, 'value':b_value, 'groups':b_groups}# Create a terminal node valuedef to_terminal(group): outcomes = [row[-1] for row in group] return max(set(outcomes), key=outcomes.count)# Create child splits for a node or make terminaldef split(node, max_depth, min_size, n_features, depth): left, right = node['groups'] del(node['groups']) # check for a no split if not left or not right:  node['left'] = node['right'] = to_terminal(left + right)  return # check for max depth if depth >= max_depth:  node['left'], node['right'] = to_terminal(left), to_terminal(right)  return # process left child if len(left) <= min_size:  node['left'] = to_terminal(left) else:  node['left'] = get_split(left, n_features)  split(node['left'], max_depth, min_size, n_features, depth+1) # process right child if len(right) <= min_size:  node['right'] = to_terminal(right) else:  node['right'] = get_split(right, n_features)  split(node['right'], max_depth, min_size, n_features, depth+1)# Build a decision treedef build_tree(train, max_depth, min_size, n_features): root = get_split(dataset, n_features) split(root, max_depth, min_size, n_features, 1) return root# Make a prediction with a decision treedef predict(node, row): if row[node['index']] < node['value']:  if isinstance(node['left'], dict):   return predict(node['left'], row)  else:   return node['left'] else:  if isinstance(node['right'], dict):   return predict(node['right'], row)  else:   return node['right']# Create a random subsample from the dataset with replacementdef subsample(dataset, ratio): sample = list() n_sample = round(len(dataset) * ratio) while len(sample) < n_sample:  index = randrange(len(dataset))  sample.append(dataset[index]) return sample# Make a prediction with a list of bagged treesdef bagging_predict(trees, row): predictions = [predict(tree, row) for tree in trees] return max(set(predictions), key=predictions.count)# Random Forest Algorithmdef random_forest(train, test, max_depth, min_size, sample_size, n_trees, n_features): trees = list() for i in range(n_trees):  sample = subsample(train, sample_size)  tree = build_tree(sample, max_depth, min_size, n_features)  trees.append(tree) predictions = [bagging_predict(trees, row) for row in test] return(predictions)# Test the random forest algorithmseed(1)# load and prepare datafilename = 'sonar.all-data.csv'dataset = load_csv(filename)# convert string attributes to integersfor i in range(0, len(dataset[0])-1): str_column_to_float(dataset, i)# convert class column to integersstr_column_to_int(dataset, len(dataset[0])-1)# evaluate algorithmn_folds = 5max_depth = 10min_size = 1sample_size = 1.0n_features = int(sqrt(len(dataset[0])-1))for n_trees in [1, 5, 10]: scores = evaluate_algorithm(dataset, random_forest, n_folds, max_depth, min_size, sample_size, n_trees, n_features) print('Trees: %d' % n_trees) print('Scores: %s' % scores)  print('Mean Accuracy: %.3f%%' % (sum(scores)/float(len(scores))))

Here, we will describe the value assignment for each parameter after line 1.

Assign K to 5 for cross-validation, and each subsample is 208/5 = 41.6. That is, more than 40 sonar return records are used for each iteration evaluation.

The maximum depth of each tree is set to 10, and the minimum number of training lines for each node is 1. The size of the sample for creating a training set is the same as that for the original dataset, which is also the default expected value of the random forest algorithm.

We set the number of features to be considered at each split point to the square root of the total number of features, that is, sqrt (60) = 7.74, rounded to 7.

Evaluate three different numbers of trees at the same time to show that adding more trees can enable more functions implemented by the algorithm.

Finally, run the sample code to print the corresponding score of each tree and the average score of each structure. As follows:

Trees: 1Scores: [68.29268292682927, 75.60975609756098, 70.73170731707317, 63.41463414634146, 65.85365853658537]Mean Accuracy: 68.780% Trees: 5Scores: [68.29268292682927, 68.29268292682927, 78.04878048780488, 65.85365853658537, 68.29268292682927]Mean Accuracy: 69.756% Trees: 10Scores: [68.29268292682927, 78.04878048780488, 75.60975609756098, 70.73170731707317, 70.73170731707317]Mean Accuracy: 72.683%

Extension

This section lists some extensions related to this tutorial. You may be interested in exploring the truth.

Algorithm Tuning (Algorithm Tuning ). The configuration parameters used in this article, or errors that have not been corrected, are subject to discussion. The experiment results can be improved by using larger trees, different numbers of features, or even different tree structures.
More questions. This method is also applicable to other classification problems, and even to Regression Algorithms by using new cost calculation functions and the expected values of the new composite tree.

Summary

Through the discussion in this tutorial, you know how the random forest algorithm is implemented, especially:

The difference between random forest and bagging decision tree.

How to use decision trees to generate random forest algorithms.

How to apply the random forest algorithm to the prediction model in actual operations.

The above is all the content of this article. I hope it will be helpful for your learning and support for helping customers.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More