The random forest algorithm and summary implemented by Python, And the python forest Algorithm

Last Update:2018-02-07 Source: Internet

Author: User

Tags natural logarithm unsupported

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The random forest algorithm and summary implemented by Python, And the python forest Algorithm

This example describes the random forest algorithm implemented by Python. We will share this with you for your reference. The details are as follows:

Random forest is a frequently used classification Prediction Algorithm in Data Mining. It uses classification or regression decision trees as the basis classifier. Some basic points of the algorithm:

* If the sample size of a dataset is m, sampling with replacement is performed;
* K features are randomly sampled to form a subset of features. The sample size determination methods can be square root and natural logarithm;
* Each tree is completely generated without pruning;
* The prediction result of each sample is generated by the prediction vote of each tree (in regression, that is, the average leaf node of each tree)

The famous python Machine Learning Package scikit learn documentation has a more detailed introduction to this algorithm: http://scikit-learn.org/stable/modules/ensemble.html#random-forests

For personal research and testing purposes, models are created and evaluated based on the typical Kaggle 101 Titanic passenger dataset. Download the game page and related datasets: https://www.kaggle.com/c/titanic

The sinking of the Titanic is a very famous haishu in history. I suddenly felt that I was not dealing with cold data, but using data mining methods to study specific historical problems. The main goal of the model is to predict whether a passenger can survive based on a series of characteristics of each passenger, such as gender, age, space, and boarding location, this is a typical binary classification prediction problem. The dataset field names and examples are as follows:

PassengerId	Replicated ved	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
1	0	3	Braund, Mr. Owen Harris	Male	22	1	0	A/5 21171	7.25		S
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	Female	38	1	0	PC 17599	71.2833	C85	C
3	1	3	Heikkinen, Miss. Laina	Female	26	0	0	STON/O2. 3101282	7.925		S
4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	Female	35	1	0	113803	53.1	C123	S
5	0	3	Allen, Mr. William Henry	Male	35	0	0	373450	8.05		S

It is worth noting that SibSp refers to sister brother spouse, that is, the number of siblings, husbands, and wives accompanied by a passenger. Parch refers to parents, children

The entire data processing and modeling process is given below. Based on ubuntu + python 3.4 (anaconda scientific computing environment has integrated a series of frequently-used packages, such as pandas numpy sklearn, which is strongly recommended here)

Too lazy to switch the input method. When writing, the main comments are in English, and the Chinese comments are supplemented :-)

#-*-Coding: UTF-8-*-"@ author: kim" from model import * # code for loading the base classifier # ETL: same procedure to training set and test settraining1_pd.read_csv('train.csv', index_col1_01_test1_pd.read_csv('test.csv ', index_col = 0) SexCode = pd. dataFrame ([], index = ['female ', 'male'], columns = ['sexcode']) # converts gender to 01 training = training. join (SexCode, how = 'left', on = training. sex) training = training. drop (['name', 'ticket ', 'barked', 'cabin ', 'Sex'], axis = 1) # delete a few variables that do not participate in modeling, including name, ticket number, and cabin number test = test. join (SexCode, how = 'left', on = test. sex) test = test. drop (['name', 'ticket', 'barked', 'cabin', 'sex'], axis = 1) print ('etl is done! ') # Model fitting #===================== parameter ajustment ================ min_leaf = 1min_dec_gini = 0.0001n _ trees = 5n_fea = int (math. sqrt (len (training. columns)-1 )) #===================================================== ======== ''' best score: 0.83min _ leaf = 30min_dec_gini = 0.001n _ trees = 20 ''' # essemble by random forestforest ={} tmp = list (training. columns) tmp. pop (tmp. index ('regionved') feaList = pd. series (tmp) for t in range (n_trees): # fea = [] feasample = feaList. sample (n = n_fea, replace = False) # select feature fea = feasample. tolist () fea. append ('regionved') # feaNew = fea. append (target) subset = training. sample (n = len (training), replace = True) # generate the dataset with replacement subset = subset [fea] # print (str (t) + 'classifier built on feature: ') # print (list (fea) FOREST [t] = tree_grow (subset, 'regionved', min_leaf, min_dec_gini) # save the tree # model prediction #==================== currentdata = trainingoutput = 'submission _ rf_201%16_30_0.00%20 '# ==================================== prediction ={} for r in currentdata. index: # a row prediction_vote = {1:0, 0: 0} row = currentdata. get (currentdata. index = r) for n in range (n_trees): tree_dict = FOREST [n] # a tree p = model_prediction (tree_dict, row) prediction_vote [p] + = 1 vote = pd. series (prediction_vote) prediction [r] = list (vote. order (ascending = False ). index) [0] # the vote resultresult = pd. series (prediction, name = 'your ved _ p') # del prediction_vote # del prediction # result. to_csv (output) t = training. join (result, how = 'left') accuracy = round (len (t [t ['unsupported ved'] = t ['unsupported ved _ P']) /len (t), 5) print (accuracy)

The above is the random Forest Code. As mentioned above, the random forest is a combination of a series of decision trees. Each time a decision tree is split, the Gini coefficient is used to measure the "non-purity" of the current node ", after dividing a dataset by a split point of a feature, the Gini of the dataset can be minimized (significantly reducing the non-purity of the dataset output variables ), this parameter is selected as the best segmentation feature and point. The Code is as follows:

# -*- coding: utf-8 -*-"""@author: kim"""import pandas as pdimport numpy as np#import sklearn as skimport mathdef tree_grow(dataframe,target,min_leaf,min_dec_gini):  tree={} #renew a tree  is_not_leaf=(len(dataframe)>min_leaf)  if is_not_leaf:    fea,sp,gd=best_split_col(dataframe,target)    if gd>min_dec_gini:      tree['fea']=fea      tree['val']=sp#      dataframe.drop(fea,axis=1) #1116 modified      l,r=dataSplit(dataframe,fea,sp)      l.drop(fea,axis=1)      r.drop(fea,axis=1)      tree['left']=tree_grow(l,target,min_leaf,min_dec_gini)      tree['right']=tree_grow(r,target,min_leaf,min_dec_gini)    else:#return a leaf      return leaf(dataframe[target])  else:    return leaf(dataframe[target])  return treedef leaf(class_lable):  tmp={}  for i in class_lable:    if i in tmp:      tmp[i]+=1    else:      tmp[i]=1  s=pd.Series(tmp)  s.sort(ascending=False)  return s.index[0]def gini_cal(class_lable):  p_1=sum(class_lable)/len(class_lable)  p_0=1-p_1  gini=1-(pow(p_0,2)+pow(p_1,2))  return ginidef dataSplit(dataframe,split_fea,split_val):  left_node=dataframe[dataframe[split_fea]<=split_val]  right_node=dataframe[dataframe[split_fea]>split_val]  return left_node,right_nodedef best_split_col(dataframe,target_name):  best_fea=''#modified 1116  best_split_point=0  col_list=list(dataframe.columns)  col_list.remove(target_name)  gini_0=gini_cal(dataframe[target_name])  n=len(dataframe)  gini_dec=-99999999  for col in col_list:    node=dataframe[[col,target_name]]    unique=node.groupby(col).count().index    for split_point in unique: #unique value      left_node,right_node=dataSplit(node,col,split_point)      if len(left_node)>0 and len(right_node)>0:        gini_col=gini_cal(left_node[target_name])*(len(left_node)/n)+gini_cal(right_node[target_name])*(len(right_node)/n)        if (gini_0-gini_col)>gini_dec:          gini_dec=gini_0-gini_col#decrease of impurity          best_fea=col          best_split_point=split_point    #print(col,split_point,gini_0-gini_col)  return best_fea,best_split_point,gini_decdef model_prediction(model,row): #row is a df  fea=model['fea']  val=model['val']  left=model['left']  right=model['right']  if row[fea].tolist()[0]<=val:#get the value    branch=left  else:    branch=right  if ('dict' in str( type(branch) )):    prediction=model_prediction(branch,row)  else:    prediction=branch  return prediction

In fact, the above Code has a lot of room for improvement in efficiency. If the dataset is not very large, if you select a large input parameter, such as generating 100 trees, at the same time, the prediction results are submitted to kaggle for evaluation. It is found that the accuracy rate in the test set is not very high, and the accuracy rate is higher than that of the corresponding packet in sklearn (0.77512) A little lower:-(if you want to improve accuracy, there are two major directions: constructing new features and adjusting the parameters of the existing model.

Here is an example. You are welcome to propose changes to my modeling ideas and algorithm implementation methods.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More