The random forest algorithm and summary implemented by Python, And the python forest Algorithm

Source: Internet
Author: User
Tags natural logarithm unsupported

The random forest algorithm and summary implemented by Python, And the python forest Algorithm

This example describes the random forest algorithm implemented by Python. We will share this with you for your reference. The details are as follows:

Random forest is a frequently used classification Prediction Algorithm in Data Mining. It uses classification or regression decision trees as the basis classifier. Some basic points of the algorithm:

* If the sample size of a dataset is m, sampling with replacement is performed;
* K features are randomly sampled to form a subset of features. The sample size determination methods can be square root and natural logarithm;
* Each tree is completely generated without pruning;
* The prediction result of each sample is generated by the prediction vote of each tree (in regression, that is, the average leaf node of each tree)

The famous python Machine Learning Package scikit learn documentation has a more detailed introduction to this algorithm: http://scikit-learn.org/stable/modules/ensemble.html#random-forests

For personal research and testing purposes, models are created and evaluated based on the typical Kaggle 101 Titanic passenger dataset. Download the game page and related datasets: https://www.kaggle.com/c/titanic

The sinking of the Titanic is a very famous haishu in history. I suddenly felt that I was not dealing with cold data, but using data mining methods to study specific historical problems. The main goal of the model is to predict whether a passenger can survive based on a series of characteristics of each passenger, such as gender, age, space, and boarding location, this is a typical binary classification prediction problem. The dataset field names and examples are as follows:

PassengerId Replicated ved Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 0 3 Braund, Mr. Owen Harris Male 22 1 0 A/5 21171 7.25 S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) Female 38 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina Female 26 0 0 STON/O2. 3101282 7.925 S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) Female 35 1 0 113803 53.1 C123 S
5 0 3 Allen, Mr. William Henry Male 35 0 0 373450 8.05 S

It is worth noting that SibSp refers to sister brother spouse, that is, the number of siblings, husbands, and wives accompanied by a passenger. Parch refers to parents, children

The entire data processing and modeling process is given below. Based on ubuntu + python 3.4 (anaconda scientific computing environment has integrated a series of frequently-used packages, such as pandas numpy sklearn, which is strongly recommended here)

Too lazy to switch the input method. When writing, the main comments are in English, and the Chinese comments are supplemented :-)

#-*-Coding: UTF-8-*-"@ author: kim" from model import * # code for loading the base classifier # ETL: same procedure to training set and test settraining1_pd.read_csv('train.csv', index_col1_01_test1_pd.read_csv('test.csv ', index_col = 0) SexCode = pd. dataFrame ([], index = ['female ', 'male'], columns = ['sexcode']) # converts gender to 01 training = training. join (SexCode, how = 'left', on = training. sex) training = training. drop (['name', 'ticket ', 'barked', 'cabin ', 'Sex'], axis = 1) # delete a few variables that do not participate in modeling, including name, ticket number, and cabin number test = test. join (SexCode, how = 'left', on = test. sex) test = test. drop (['name', 'ticket', 'barked', 'cabin', 'sex'], axis = 1) print ('etl is done! ') # Model fitting #===================== parameter ajustment ================ min_leaf = 1min_dec_gini = 0.0001n _ trees = 5n_fea = int (math. sqrt (len (training. columns)-1 )) #===================================================== ======== ''' best score: 0.83min _ leaf = 30min_dec_gini = 0.001n _ trees = 20 ''' # essemble by random forestforest ={} tmp = list (training. columns) tmp. pop (tmp. index ('regionved') feaList = pd. series (tmp) for t in range (n_trees): # fea = [] feasample = feaList. sample (n = n_fea, replace = False) # select feature fea = feasample. tolist () fea. append ('regionved') # feaNew = fea. append (target) subset = training. sample (n = len (training), replace = True) # generate the dataset with replacement subset = subset [fea] # print (str (t) + 'classifier built on feature: ') # print (list (fea) FOREST [t] = tree_grow (subset, 'regionved', min_leaf, min_dec_gini) # save the tree # model prediction #==================== currentdata = trainingoutput = 'submission _ rf_201%16_30_0.00%20 '# ==================================== prediction ={} for r in currentdata. index: # a row prediction_vote = {1:0, 0: 0} row = currentdata. get (currentdata. index = r) for n in range (n_trees): tree_dict = FOREST [n] # a tree p = model_prediction (tree_dict, row) prediction_vote [p] + = 1 vote = pd. series (prediction_vote) prediction [r] = list (vote. order (ascending = False ). index) [0] # the vote resultresult = pd. series (prediction, name = 'your ved _ p') # del prediction_vote # del prediction # result. to_csv (output) t = training. join (result, how = 'left') accuracy = round (len (t [t ['unsupported ved'] = t ['unsupported ved _ P']) /len (t), 5) print (accuracy)

The above is the random Forest Code. As mentioned above, the random forest is a combination of a series of decision trees. Each time a decision tree is split, the Gini coefficient is used to measure the "non-purity" of the current node ", after dividing a dataset by a split point of a feature, the Gini of the dataset can be minimized (significantly reducing the non-purity of the dataset output variables ), this parameter is selected as the best segmentation feature and point. The Code is as follows:

# -*- coding: utf-8 -*-"""@author: kim"""import pandas as pdimport numpy as np#import sklearn as skimport mathdef tree_grow(dataframe,target,min_leaf,min_dec_gini):  tree={} #renew a tree  is_not_leaf=(len(dataframe)>min_leaf)  if is_not_leaf:    fea,sp,gd=best_split_col(dataframe,target)    if gd>min_dec_gini:      tree['fea']=fea      tree['val']=sp#      dataframe.drop(fea,axis=1) #1116 modified      l,r=dataSplit(dataframe,fea,sp)      l.drop(fea,axis=1)      r.drop(fea,axis=1)      tree['left']=tree_grow(l,target,min_leaf,min_dec_gini)      tree['right']=tree_grow(r,target,min_leaf,min_dec_gini)    else:#return a leaf      return leaf(dataframe[target])  else:    return leaf(dataframe[target])  return treedef leaf(class_lable):  tmp={}  for i in class_lable:    if i in tmp:      tmp[i]+=1    else:      tmp[i]=1  s=pd.Series(tmp)  s.sort(ascending=False)  return s.index[0]def gini_cal(class_lable):  p_1=sum(class_lable)/len(class_lable)  p_0=1-p_1  gini=1-(pow(p_0,2)+pow(p_1,2))  return ginidef dataSplit(dataframe,split_fea,split_val):  left_node=dataframe[dataframe[split_fea]<=split_val]  right_node=dataframe[dataframe[split_fea]>split_val]  return left_node,right_nodedef best_split_col(dataframe,target_name):  best_fea=''#modified 1116  best_split_point=0  col_list=list(dataframe.columns)  col_list.remove(target_name)  gini_0=gini_cal(dataframe[target_name])  n=len(dataframe)  gini_dec=-99999999  for col in col_list:    node=dataframe[[col,target_name]]    unique=node.groupby(col).count().index    for split_point in unique: #unique value      left_node,right_node=dataSplit(node,col,split_point)      if len(left_node)>0 and len(right_node)>0:        gini_col=gini_cal(left_node[target_name])*(len(left_node)/n)+gini_cal(right_node[target_name])*(len(right_node)/n)        if (gini_0-gini_col)>gini_dec:          gini_dec=gini_0-gini_col#decrease of impurity          best_fea=col          best_split_point=split_point    #print(col,split_point,gini_0-gini_col)  return best_fea,best_split_point,gini_decdef model_prediction(model,row): #row is a df  fea=model['fea']  val=model['val']  left=model['left']  right=model['right']  if row[fea].tolist()[0]<=val:#get the value    branch=left  else:    branch=right  if ('dict' in str( type(branch) )):    prediction=model_prediction(branch,row)  else:    prediction=branch  return prediction

In fact, the above Code has a lot of room for improvement in efficiency. If the dataset is not very large, if you select a large input parameter, such as generating 100 trees, at the same time, the prediction results are submitted to kaggle for evaluation. It is found that the accuracy rate in the test set is not very high, and the accuracy rate is higher than that of the corresponding packet in sklearn (0.77512) A little lower:-(if you want to improve accuracy, there are two major directions: constructing new features and adjusting the parameters of the existing model.

Here is an example. You are welcome to propose changes to my modeling ideas and algorithm implementation methods.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.