python--decision Tree Combat: California house price forecast

Last Update:2018-08-13 Source: Internet

Author: User

Tags jupyter notebook

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

python--decision Tree Combat: California house price forecast

Compilation environment: Anaconda, Jupyter Notebook

First, import the module:

1 Import Pandas as PD 2 Import Matplotlib.pyplot as Plt 3 %matplotlib Inline

Next, Import the dataset:

1  from Import fetch_california_housing 2 housing = fetch_california_housing ()3print(housing. DESCR)  #Description

Using Sklearn's own dataset california_housing, see: Python--sklearn Available data sets

Operation Result:

California Housing DataSet. The original database isAvailable fromstatlib http:lib.stat.cmu.edu/datasets/The data contains20,640 Observations on 9variables. This dataset contains the average house value as target variable andthe following input variables (features): Average income,housing average age, average rooms, average bedrooms, popula Tion,average occupation, Latitude, andLongitudeinchthat order. References----------Pace, R. Kelley andRonald Barry, Sparse Spatial autoregressions,statistics andProbability Letters, 33 (1997) 291-297.

Take a look at the data:

1 Housing.data.shape

(20640, 8)

1 housing.data[0]

Array ([   8.3252    ,    6.98412698, 1.02380952, 322.,            2.55555556,   37.88      , -122.23      ])

Tree Model Parameters:

1.criterion Gini or entropy
2.splitter best or random the former is the most optimal segmentation point in all features the latter is in the partial feature (when the data volume is large)
3.max_features None (All), when the Log2,sqrt,n feature is less than 50, the general use of all
4.max_depth when the data is small or the feature is small can be regardless of this value, if the model sample volume, features are also many cases, you can try to limit the next
5.min_samples_split if the number of samples for a node is less than Min_samples_split, then no further attempt is made to select the optimal feature to be divided if the sample size is small, this value is not required. If the sample quantity is very large, it is recommended to increase this value.
6.min_samples_leaf This value limits the minimum number of leaf nodes, if the number of a leaf node is less than the number of samples, and the sibling nodes will be pruned together, if the sample size is small, do not need to tube this value, larger than 10W but try to 5
7.min_weight_fraction_leaf This value limits the minimum value of all the sample weights of the leaf node, and if it is less than this value, it will be pruned with the sibling node by default is 0, regardless of the weight problem. In general, if we have more samples with missing values, or if the classification tree sample has a large variation in the distribution category, we will introduce the sample weights, and we should pay attention to this value.
8.max_leaf_nodes by limiting the maximum number of leaf nodes, it is possible to prevent overfitting, the default is "None", that is, the maximum number of leaf nodes is not limited. If the limit is added, the algorithm will establish the optimal decision tree in the maximum number of leaf nodes. If the feature is not many, you can not consider this value, but if the features are divided into more, you can limit the specific values can be obtained by cross-validation.
9.class_weight Specifies the weights of the sample categories, primarily to prevent the training of certain categories of samples that result in the training of decision trees too biased towards these categories. Here you can specify the weights of each sample if you use "balanced", the algorithm calculates its own weights, and the sample weights for categories with small sample sizes are high.
10.min_impurity_split This value limits the growth of the decision tree, if the node's purity (Gini coefficient, information gain, mean variance, absolute difference) is less than this threshold, the node no longer generates child nodes. That is the leaf node.
N_estimators: Number of trees to build

The next step is to instantiate the algorithm and then pass the parameters to train.

1  from Import Tree 2 DTR = tree. Decisiontreeregressor (max_depth = 2)3#  train with two columns of features to pass two parameters x, y4 dtr.fit ( housing.data[:, [6, 7]], Housing.target)

Output:

Decisiontreeregressor (criterion='MSE', max_depth=2, max_features=None,           max_ Leaf_nodes=none, min_impurity_split=1e-07,           min_samples_leaf=1, min_samples_split=2,           Min_weight_fraction_leaf=0.0, Presort=false, random_state=None,           splitter='  best')

To visualize the decision tree, first install the Graphviz. export_graphviz The export also supports a variety of aesthetic options, including the ability to use a class-shaded node (or value regression) and, if necessary, explicit variable and class names. Ipython notebooks can also render these graphs inline using the image () function:

1 #to visualize the display first you need to install Graphviz http://www.graphviz.org/Download..php2Dot_data = 3 Tree.export_graphviz (4 DTR,5Out_file =None,6Feature_names = Housing.feature_names[6:8],7filled =True,8impurity =False,9rounded =TrueTen)

Graphviz and Pydotplus installation steps: Python--graphviz and Pydotplus installation steps

Once the Python module pydotplus is installed, you can generate PNG files (or any other supported file types) directly in Python:

1 #pip Install Pydotplus2 ImportPydotplus3Graph =Pydotplus.graph_from_dot_data (dot_data)4Graph.get_nodes () [7].set_fillcolor ("#FFF2DD")5Graph.write_png ("Graph.png")6  fromIpython.displayImportImage7Image (Graph.create_png ())

Divide the datasets into training sets and test sets, and train, validate

1  from Import Train_test_split 2 X_train, X_test, y_train, y_test =3     train_test_split (Housing.data, Housing.target, Test_size = 0.1, random_state =4 DTR = tree. Decisiontreeregressor (random_state=42)5dtr.fit (X_train, Y_train)6 7 Dtr.score (X_test, Y_test)

Results:

0.637318351331017

Use random forest:

1  from Import Randomforestregressor 2 RFR = randomforestregressor (random_state = 3)rfr.fit (X_train, Y_train)4 Rfr.score (x_test, Y_test)

Results:

0.79086492280964926

Select parameters with cross-validation:

1  fromSklearn.grid_searchImportGRIDSEARCHCV2 3 #generally, the parameters are written in a dictionary format:4Tree_param_grid = {'Min_samples_split': List ((3, 6, 9)),'n_estimators': List (10,50,100))}5 6 #The first parameter is the model, the second parameter is the parameter to be selected, CV: Several cross-validation7Grid = GRIDSEARCHCV (Randomforestregressor (), Param_grid = Tree_param_grid, CV = 5)8 Grid.fit (X_train, Y_train)9Grid.grid_scores_, Grid.best_params_, grid.best_score_

The result is:

([mean:0.78795, std:0.00337, params: {' Min_samples_split ': 3, ' n_estimators ': ten},  mean:0.80463, STD:0.00308, para MS: {' Min_samples_split ': 3, ' n_estimators ': $,  mean:0.80732, std:0.00448, params: {' Min_samples_split ': 3, ' N_es Timators ': +,  mean:0.78535, std:0.00506, params: {' min_samples_split ': 6, ' n_estimators ': ten},  mean: 0.80446, std:0.00399, params: {' min_samples_split ': 6, ' n_estimators ': $,  mean:0.80688, std:0.00424, params: {' mi N_samples_split ': 6, ' n_estimators ': +,  mean:0.78754, std:0.00552, params: {' min_samples_split ': 9, ' N_ Estimators ': ten},  mean:0.80321, std:0.00487, params: {' min_samples_split ': 9, ' n_estimators ': [],  mean: 0.80553, std:0.00389, params: {' min_samples_split ': 9, ' n_estimators ': +}], {' Min_samples_split ': 3, ' n_estimators ': 1 00}, 0.8073224957136084)

Use the resulting parameters to retrain the random forest:

1 RFR = randomforestregressor (min_samples_split=3,n_estimators = 100,random_state = 2) Rfr.fit (X_train, Y_train) 3 rfr.score (x_test, Y_test)

The result is:

0.80908290496531576

1 PD. Series (rfr.feature_importances_, index = housing.feature_names). Sort_values (ascending = False)

The result is:

Medinc        0.524257AveOccup      0.137947Latitude      0.090622Longitude     0.089414HouseAge      0.053970AveRooms      0.044443Population    0.030263AveBedrms     0.029084dtype:float64

python--decision Tree Combat: California house price forecast

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More