[Machine Learning Python Practice (5)] Sklearn for Integration

Source: Internet
Author: User
Tags rfc

1, integrated

The Integrated (Ensemble) classification model is a comprehensive consideration of the predictions of multiple classifiers to make decisions. Generally there are two ways:
1) using the same training data at the same time to build multiple independent classification models, and then through the voting method, the majority of the principle of a few to make the final classification decision. The idea of a forest classifier is to build multiple decision trees at the same time on the same training data. Random forest classifier constructing each decision tree randomly selects the feature, rather than sorting the effect of each dimension on the predicted result.

2) build multiple classification models in a certain order. There are dependencies between these models. In general, the addition of each successive model needs to contribute to the comprehensive performance of the existing integration model, and thus continuously improve the capability of the updated integrated model, and ultimately expect to build a model with stronger classification ability by integrating classifiers with weaker classification ability. such as gradient elevation decision tree: It generates every decision tree in the process of minimizing the fit error of the integrated model on the training set.

2. Example

Data set: Prev article

Code:

#Coding=utf-8ImportPandas as PD fromSklearn.model_selectionImportTrain_test_split fromSklearn.feature_extractionImportDictvectorizer fromSklearn.treeImportDecisiontreeclassifier fromSklearn.ensembleImportRandomforestclassifier fromSklearn.ensembleImportGradientboostingclassifier fromSklearn.metricsImportClassification_report#1. Data acquisitionTitanic = Pd.read_csv ('Http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt') X= titanic[['Pclass',' Age','Sex']]y= titanic['survived']#2. Data preprocessing: Training set test set segmentation, data standardizationx[' Age'].fillna (x[' Age'].mean (), inplace=true)#Age is only 633, it needs to be supplemented, the use of average or median is the smallest strategy for model deviationX_train,x_test,y_train,y_test = Train_test_split (x,y,test_size=0.25,random_state=33)#splitting the dataVEC = Dictvectorizer (sparse=False) X_train= Vec.fit_transform (X_train.to_dict (orient='Record'))#extract the characteristics of the training dataX_test = Vec.transform (X_test.to_dict (orient='Record'))#extracting the characteristics of the test data#3. Integrated Model Training#Model Training and predictive analysis using a single decision treeDTC =decisiontreeclassifier () dtc.fit (X_train, y_train) dtc_y_pred=dtc.predict (x_test)#use random forest classifiers for training and predictive analysis of integrated models. RFC =randomforestclassifier () rfc.fit (X_train, y_train) rfc_y_pred=rfc.predict (x_test)#Training and predictive analysis of integrated models using gradient-boosted decision trees. GBC =gradientboostingclassifier () gbc.fit (X_train, y_train) gbc_y_pred=gbc.predict (x_test)#4. Get the Results report#The classification accuracy of the output single decision tree on the test set, as well as more detailed accuracy rate, recall rate, F1 index. Print 'The accuracy of decision Tree is', Dtc.score (X_test, y_test)PrintClassification_report (dtc_y_pred, y_test)#output random forest classifier classification accuracy on test set, and more detailed accuracy rate, recall rate, F1 index. Print 'The accuracy of random forest classifier is', Rfc.score (X_test, y_test)PrintClassification_report (rfc_y_pred, y_test)#The output gradient enhances the accuracy of the classification of decision trees on the test set, as well as more detailed accuracy rate, recall rate, F1 index. Print 'The accuracy of gradient tree boosting is', Gbc.score (X_test, y_test)PrintClassification_report (gbc_y_pred, Y_test)

Operational results: Single decision tree, random forest, gradient elevation decision tree performance is as follows:

The accuracy of decision tree is 0.781155015198             Precision    recall  f1-score   support          0       0.91      0.78      0.84       236          1       0.58      0.80      0.67        93avg/total       0.81      0.78      0.79       329

The accuracy of random forest classifier is 0.784194528875             Precision    recall  f1-score   support          0       0.92      0.77      0.84       239          1       0.57      0.81      0.67        90avg/total       0.82      0.78      0.79       329
The accuracy of gradient tree boosting is 0.790273556231             Precision    recall  f1-score   support          0       0.92      0.78      0.84       239          1       0.58      0.82      0.68 90avg/total 0.83 0.79      0.80       329

Conclusion:
Predictive performance: The gradient rise decision tree is larger than the random forest classifier larger than the single decision tree. The industry often uses the stochastic forest classification model as the baseline system.

[Machine Learning Python Practice (5)] Sklearn for Integration

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.