The random forest algorithm and summary implemented by Python, And the python forest Algorithm
This example describes the random forest algorithm implemented by Python. We will share this with you for your reference. The details are as follows:
Random forest is a frequently used classification Prediction Algorithm in Data Mining. It uses classification or regression decision trees as the basis classifier. Some basic points of the algorithm:
* If the sample size of a dataset is m, sampling with replacement is performed;
* K features are randomly sampled to form a subset of features. The sample size determination methods can be square root and natural logarithm;
* Each tree is completely generated without pruning;
* The prediction result of each sample is generated by the prediction vote of each tree (in regression, that is, the average leaf node of each tree)
The famous python Machine Learning Package scikit learn documentation has a more detailed introduction to this algorithm: http://scikit-learn.org/stable/modules/ensemble.html#random-forests
For personal research and testing purposes, models are created and evaluated based on the typical Kaggle 101 Titanic passenger dataset. Download the game page and related datasets: https://www.kaggle.com/c/titanic
The sinking of the Titanic is a very famous haishu in history. I suddenly felt that I was not dealing with cold data, but using data mining methods to study specific historical problems. The main goal of the model is to predict whether a passenger can survive based on a series of characteristics of each passenger, such as gender, age, space, and boarding location, this is a typical binary classification prediction problem. The dataset field names and examples are as follows:
PassengerId |
Replicated ved |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
Male |
22 |
1 |
0 |
A/5 21171 |
7.25 |
|
S |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Thayer) |
Female |
38 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
Female |
26 |
0 |
0 |
STON/O2. 3101282 |
7.925 |
|
S |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
Female |
35 |
1 |
0 |
113803 |
53.1 |
C123 |
S |
5 |
0 |
3 |
Allen, Mr. William Henry |
Male |
35 |
0 |
0 |
373450 |
8.05 |
|
S |
It is worth noting that SibSp refers to sister brother spouse, that is, the number of siblings, husbands, and wives accompanied by a passenger. Parch refers to parents, children
The entire data processing and modeling process is given below. Based on ubuntu + python 3.4 (anaconda scientific computing environment has integrated a series of frequently-used packages, such as pandas numpy sklearn, which is strongly recommended here)
Too lazy to switch the input method. When writing, the main comments are in English, and the Chinese comments are supplemented :-)
#-*-Coding: UTF-8-*-"@ author: kim" from model import * # code for loading the base classifier # ETL: same procedure to training set and test settraining1_pd.read_csv('train.csv', index_col1_01_test1_pd.read_csv('test.csv ', index_col = 0) SexCode = pd. dataFrame ([], index = ['female ', 'male'], columns = ['sexcode']) # converts gender to 01 training = training. join (SexCode, how = 'left', on = training. sex) training = training. drop (['name', 'ticket ', 'barked', 'cabin ', 'Sex'], axis = 1) # delete a few variables that do not participate in modeling, including name, ticket number, and cabin number test = test. join (SexCode, how = 'left', on = test. sex) test = test. drop (['name', 'ticket', 'barked', 'cabin', 'sex'], axis = 1) print ('etl is done! ') # Model fitting #===================== parameter ajustment ================ min_leaf = 1min_dec_gini = 0.0001n _ trees = 5n_fea = int (math. sqrt (len (training. columns)-1 )) #===================================================== ======== ''' best score: 0.83min _ leaf = 30min_dec_gini = 0.001n _ trees = 20 ''' # essemble by random forestforest ={} tmp = list (training. columns) tmp. pop (tmp. index ('regionved') feaList = pd. series (tmp) for t in range (n_trees): # fea = [] feasample = feaList. sample (n = n_fea, replace = False) # select feature fea = feasample. tolist () fea. append ('regionved') # feaNew = fea. append (target) subset = training. sample (n = len (training), replace = True) # generate the dataset with replacement subset = subset [fea] # print (str (t) + 'classifier built on feature: ') # print (list (fea) FOREST [t] = tree_grow (subset, 'regionved', min_leaf, min_dec_gini) # save the tree # model prediction #==================== currentdata = trainingoutput = 'submission _ rf_201%16_30_0.00%20 '# ==================================== prediction ={} for r in currentdata. index: # a row prediction_vote = {1:0, 0: 0} row = currentdata. get (currentdata. index = r) for n in range (n_trees): tree_dict = FOREST [n] # a tree p = model_prediction (tree_dict, row) prediction_vote [p] + = 1 vote = pd. series (prediction_vote) prediction [r] = list (vote. order (ascending = False ). index) [0] # the vote resultresult = pd. series (prediction, name = 'your ved _ p') # del prediction_vote # del prediction # result. to_csv (output) t = training. join (result, how = 'left') accuracy = round (len (t [t ['unsupported ved'] = t ['unsupported ved _ P']) /len (t), 5) print (accuracy)
The above is the random Forest Code. As mentioned above, the random forest is a combination of a series of decision trees. Each time a decision tree is split, the Gini coefficient is used to measure the "non-purity" of the current node ", after dividing a dataset by a split point of a feature, the Gini of the dataset can be minimized (significantly reducing the non-purity of the dataset output variables ), this parameter is selected as the best segmentation feature and point. The Code is as follows:
# -*- coding: utf-8 -*-"""@author: kim"""import pandas as pdimport numpy as np#import sklearn as skimport mathdef tree_grow(dataframe,target,min_leaf,min_dec_gini): tree={} #renew a tree is_not_leaf=(len(dataframe)>min_leaf) if is_not_leaf: fea,sp,gd=best_split_col(dataframe,target) if gd>min_dec_gini: tree['fea']=fea tree['val']=sp# dataframe.drop(fea,axis=1) #1116 modified l,r=dataSplit(dataframe,fea,sp) l.drop(fea,axis=1) r.drop(fea,axis=1) tree['left']=tree_grow(l,target,min_leaf,min_dec_gini) tree['right']=tree_grow(r,target,min_leaf,min_dec_gini) else:#return a leaf return leaf(dataframe[target]) else: return leaf(dataframe[target]) return treedef leaf(class_lable): tmp={} for i in class_lable: if i in tmp: tmp[i]+=1 else: tmp[i]=1 s=pd.Series(tmp) s.sort(ascending=False) return s.index[0]def gini_cal(class_lable): p_1=sum(class_lable)/len(class_lable) p_0=1-p_1 gini=1-(pow(p_0,2)+pow(p_1,2)) return ginidef dataSplit(dataframe,split_fea,split_val): left_node=dataframe[dataframe[split_fea]<=split_val] right_node=dataframe[dataframe[split_fea]>split_val] return left_node,right_nodedef best_split_col(dataframe,target_name): best_fea=''#modified 1116 best_split_point=0 col_list=list(dataframe.columns) col_list.remove(target_name) gini_0=gini_cal(dataframe[target_name]) n=len(dataframe) gini_dec=-99999999 for col in col_list: node=dataframe[[col,target_name]] unique=node.groupby(col).count().index for split_point in unique: #unique value left_node,right_node=dataSplit(node,col,split_point) if len(left_node)>0 and len(right_node)>0: gini_col=gini_cal(left_node[target_name])*(len(left_node)/n)+gini_cal(right_node[target_name])*(len(right_node)/n) if (gini_0-gini_col)>gini_dec: gini_dec=gini_0-gini_col#decrease of impurity best_fea=col best_split_point=split_point #print(col,split_point,gini_0-gini_col) return best_fea,best_split_point,gini_decdef model_prediction(model,row): #row is a df fea=model['fea'] val=model['val'] left=model['left'] right=model['right'] if row[fea].tolist()[0]<=val:#get the value branch=left else: branch=right if ('dict' in str( type(branch) )): prediction=model_prediction(branch,row) else: prediction=branch return prediction
In fact, the above Code has a lot of room for improvement in efficiency. If the dataset is not very large, if you select a large input parameter, such as generating 100 trees, at the same time, the prediction results are submitted to kaggle for evaluation. It is found that the accuracy rate in the test set is not very high, and the accuracy rate is higher than that of the corresponding packet in sklearn (0.77512) A little lower:-(if you want to improve accuracy, there are two major directions: constructing new features and adjusting the parameters of the existing model.
Here is an example. You are welcome to propose changes to my modeling ideas and algorithm implementation methods.