Python decision tree and random forest algorithm examples

Source: Internet
Author: User
Tags ftp file

Python decision tree and random forest algorithm examples

This article describes Python decision tree and random forest algorithms. We will share this with you for your reference. The details are as follows:

Decision Trees and random forests are both common classification algorithms. Their judgment logic is similar to that of human thinking. When people often encounter combinations of multiple conditions, A decision tree can also be drawn to help decision making. This article briefly introduces the algorithm and implementation of decision trees and random forests, and uses the random forest algorithm and decision tree algorithm to detect FTP brute-force cracking and POP3 brute-force cracking. For detailed code, see:

Https://github.com/traviszeng/MLWithWebSecurity

Decision Tree Algorithm

A decision tree represents a ing between object attributes and attribute values. Each node in the decision tree represents an object, and each forks PATH represents a possible attribute value, each leaf node corresponds to the object value represented by the path from the root node to the leaf node. In data mining, we often use decision trees for data classification and prediction.

Helloworld of decision tree

In this section, we use decision trees to classify and predict iris data sets. Here we will use graphviz of the tree under sklearn to help export the decision tree and store it in pdf format. The Code is as follows:

# The helloworld of the decision tree uses the decision tree to classify the iris dataset from sklearn. datasets import load_irisfrom sklearn import treeimport pydotplus # import iris dataset iris = load_iris () # initialize DecisionTreeClassifierclf = tree. decisionTreeClassifier () # adaption data clf = clf. fit (iris. data, iris.tar get) # visualize the decision tree in pdf Format dot_data = tree. export_graphviz (clf, out_file = None) graph = pydotplus. graph_from_dot_data (dot_data) graph. write_pdf ("iris.pdf ")

Shows the visual decision tree obtained from the iris dataset:

Through this small example, we can initially feel the process and features of decision trees. Compared with other classification algorithms, the results produced by decision trees are more intuitive and more in line with human thinking.

Use decision tree to detect POP3 brute force cracking

Here we use the POP3-related data in the KDD99 dataset to use the decision tree algorithm to learn how to identify information related to POP3 brute force cracking in the dataset. You can google the relevant content of the KDD99 dataset. The following is the source code of the decision tree algorithm:

# Use the decision tree algorithm to detect POP3 brute force cracking import reimport matplotlib. pyplot as pltfrom sklearn. feature_extraction.text import CountVectorizerfrom sklearn. model_selection import cross_val_scoreimport osfrom sklearn. datasets import load_irisfrom sklearn import treeimport pydotplus # load the kdd dataset def load_kdd99 (filename): X = [] with open (filename) as f: for line in f: line = line. strip ('\ n') line = line. split (',') X. append (line) return X # Find the training dataset def get_guess_passwdandNormal (x ): v = [] features = [] targets = [] # Find the data marked as guess-passwd and normal with POP3 protocol for x1 in x: if (x1 [41] in ['Guess _ passwd. ', 'normal. ']) and (x1 [2] = 'pop _ 3'): if x1 [41] = 'Guess _ passwd. ': targets. append (1) else: targets. append (0) # select the network features related to POP3 PASSWORD cracking and TCP Content features as the sample features x1 = [x1 [0] + x1 [] + x1 [] v. append (x1) for x1 in v: v1 = [] for x2 in x1: v1.append (float (x2) features. append (v1) return features, targetsif _ name _ = '_ main _': v = load_kdd99 (".. /.. /data/kddcup99/corrected ") x, y = get_guess_passwdandNormal (v) clf = tree. decisionTreeClassifier () print (cross_val_score (clf, x, y, n_jobs =-1, cv = 10) clf = clf. fit (x, y) dot_data = tree. export_graphviz (clf, out_file = None) graph = pydotplus. graph_from_dot_data (dot_data) graph. write_pdf ("POP3Detector.pdf ")

The following decision tree is generated to identify whether POP3 brute-force cracking is successful:

Random forest Algorithm

A random forest is a classifier that uses multiple trees to train and predict samples. Is a classifier that contains multiple decision trees, and its output category is determined by the mode of the classes output by individual trees. Each decision tree in a random Forest is not associated. After the forest is obtained, when a new input sample enters, each decision tree in the forest is used to determine the type of the sample, then, you can see which type is selected most, and the sample is predicted as that type. Generally, the decision performance of random forest is superior to that of decision tree.

Helloworld of random Forest

Next, we will use the randomly generated data to intuitively compare the accuracy of decision trees and random forests:

From sklearn. model_selection import cross_val_scorefrom sklearn. datasets import make_blobsfrom sklearn. ensemble import RandomForestClassifierfrom sklearn. ensemble import ExtraTreesClassifierfrom sklearn. tree import tables, y = make_blobs (n_samples = 10000, n_features = 10, centers = 100, random_state = 0) clf = partition (max_depth = None, min_samples_split = 2, random_state = 0) scores = cross_val_score (clf, X, y) print ("decision tree accuracy;", scores. mean () clf = RandomForestClassifier (n_estimators = 10, max_depth = None, min_samples_split = 2, random_state = 0) scores = cross_val_score (clf, X, y) print ("random forest accuracy:", scores. mean ())

Finally, we can see that the accuracy of the decision tree is slightly inferior to that of the random forest.

Use random forest algorithms to detect FTP brute-force cracking

Next, we use the ADFA-LD dataset to establish a random forest classifier using the random forest Algorithm for FTP data, and the ADFA-LD dataset records the function call sequence, the number of function calls in each file is different. For details about the dataset, google.

The detailed source code is as follows:

#-*-Coding: UTF-8-*-# Use the random forest algorithm to detect FTP brute force cracking import reimport matplotlib. pyplot as pltfrom sklearn. feature_extraction.text import CountVectorizerfrom sklearn. model_selection import cross_val_scoreimport osfrom sklearn import treeimport pydotplusimport numpy as npfrom sklearn. ensemble import RandomForestClassifierdef load_one_flle (filename): x = [] with open (filename) as f: line = f. readline () line = line. strip ('\ n') return linedef load_adfa_training_files (rootdir): x = [] y = [] list = OS. listdir (rootdir) for I in range (0, len (list): path = OS. path. join (rootdir, list [I]) if OS. path. isfile (path): x. append (load_one_flle (path) y. append (0) return x, ydef dirlist (path, allfile): filelist = OS. listdir (path) for filename in filelist: filepath = path + filename if OS. path. isdir (filepath): # handle path exceptions dirlist (filepath + '/', allfile) else: allfile. append (filepath) return allfiledef load_adfa_hydra_ftp_files (rootdir): x = [] y = [] allfile = dirlist (rootdir, []) for file in allfile: # The regular expression matches the abnormal ftp file if re. match (r ".. /.. /data/ADFA-LD/Attack_Data_Master/Hydra_FTP _ \ d +/UAD-Hydra-FTP * ", file): x. append (load_one_flle (file) y. append (1) return x, yif _ name _ = '_ main _': x1, y1 = load_adfa_training_files (".. /.. /data/ADFA-LD/Training_Data_Master/") x2, y2 = load_adfa_hydra_ftp_files (".. /.. /data/ADFA-LD/Attack_Data_Master/") x = x1 + x2 y = y1 + y2 vectorizer = CountVectorizer (min_df = 1) x = vectorizer. fit_transform (x) x = x. toarray () # clf = tree. decisionTreeClassifier () clf = RandomForestClassifier (n_estimators = 10, max_depth = None, min_samples_split = 2, random_state = 0) clf = clf. fit (x, y) score = cross_val_score (clf, x, y, n_jobs =-1, cv = 10) print (score) print ('average accuracy: ', np. mean (score ))

Finally, we can obtain a random forest classifier with an accuracy of about 98.4%.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.