python--decision Tree Combat: California house price forecast

Source: Internet
Author: User
Tags jupyter notebook

python--decision Tree Combat: California house price forecast

Compilation environment: Anaconda, Jupyter Notebook

First, import the module:

1 Import Pandas as PD 2 Import Matplotlib.pyplot as Plt 3 %matplotlib Inline

Next, Import the dataset:

1  from Import fetch_california_housing 2 housing = fetch_california_housing ()3print(housing. DESCR)  #Description

Using Sklearn's own dataset california_housing, see: Python--sklearn Available data sets

Operation Result:

California Housing DataSet. The original database isAvailable fromstatlib http:lib.stat.cmu.edu/datasets/The data contains20,640 Observations on 9variables. This dataset contains the average house value as target variable andthe following input variables (features): Average income,housing average age, average rooms, average bedrooms, popula Tion,average occupation, Latitude, andLongitudeinchthat order. References----------Pace, R. Kelley andRonald Barry, Sparse Spatial autoregressions,statistics andProbability Letters, 33 (1997) 291-297.

Take a look at the data:

1 Housing.data.shape
(20640, 8)
1 housing.data[0]
Array ([   8.3252    ,    6.98412698, 1.02380952, 322.,            2.55555556,   37.88      , -122.23      ])

Tree Model Parameters:
  • 1.criterion Gini or entropy

  • 2.splitter best or random the former is the most optimal segmentation point in all features the latter is in the partial feature (when the data volume is large)

  • 3.max_features None (All), when the Log2,sqrt,n feature is less than 50, the general use of all

  • 4.max_depth when the data is small or the feature is small can be regardless of this value, if the model sample volume, features are also many cases, you can try to limit the next

  • 5.min_samples_split if the number of samples for a node is less than Min_samples_split, then no further attempt is made to select the optimal feature to be divided if the sample size is small, this value is not required. If the sample quantity is very large, it is recommended to increase this value.

  • 6.min_samples_leaf This value limits the minimum number of leaf nodes, if the number of a leaf node is less than the number of samples, and the sibling nodes will be pruned together, if the sample size is small, do not need to tube this value, larger than 10W but try to 5

  • 7.min_weight_fraction_leaf This value limits the minimum value of all the sample weights of the leaf node, and if it is less than this value, it will be pruned with the sibling node by default is 0, regardless of the weight problem. In general, if we have more samples with missing values, or if the classification tree sample has a large variation in the distribution category, we will introduce the sample weights, and we should pay attention to this value.

  • 8.max_leaf_nodes by limiting the maximum number of leaf nodes, it is possible to prevent overfitting, the default is "None", that is, the maximum number of leaf nodes is not limited. If the limit is added, the algorithm will establish the optimal decision tree in the maximum number of leaf nodes. If the feature is not many, you can not consider this value, but if the features are divided into more, you can limit the specific values can be obtained by cross-validation.

  • 9.class_weight Specifies the weights of the sample categories, primarily to prevent the training of certain categories of samples that result in the training of decision trees too biased towards these categories. Here you can specify the weights of each sample if you use "balanced", the algorithm calculates its own weights, and the sample weights for categories with small sample sizes are high.

  • 10.min_impurity_split This value limits the growth of the decision tree, if the node's purity (Gini coefficient, information gain, mean variance, absolute difference) is less than this threshold, the node no longer generates child nodes. That is the leaf node.

  • N_estimators: Number of trees to build

The next step is to instantiate the algorithm and then pass the parameters to train.

1  from Import Tree 2 DTR = tree. Decisiontreeregressor (max_depth = 2)3#  train with two columns of features to pass two parameters x, y4 dtr.fit ( housing.data[:, [6, 7]], Housing.target)

Output:

Decisiontreeregressor (criterion='MSE', max_depth=2, max_features=None,           max_ Leaf_nodes=none, min_impurity_split=1e-07,           min_samples_leaf=1, min_samples_split=2,           Min_weight_fraction_leaf=0.0, Presort=false, random_state=None,           splitter='  best')

To visualize the decision tree, first install the Graphviz. export_graphviz The export also supports a variety of aesthetic options, including the ability to use a class-shaded node (or value regression) and, if necessary, explicit variable and class names. Ipython notebooks can also render these graphs inline using the image () function:

1 #to visualize the display first you need to install Graphviz http://www.graphviz.org/Download..php2Dot_data = 3 Tree.export_graphviz (4 DTR,5Out_file =None,6Feature_names = Housing.feature_names[6:8],7filled =True,8impurity =False,9rounded =TrueTen)

Graphviz and Pydotplus installation steps: Python--graphviz and Pydotplus installation steps

Once the Python module pydotplus is installed, you can generate PNG files (or any other supported file types) directly in Python:

1 #pip Install Pydotplus2 ImportPydotplus3Graph =Pydotplus.graph_from_dot_data (dot_data)4Graph.get_nodes () [7].set_fillcolor ("#FFF2DD")5Graph.write_png ("Graph.png")6  fromIpython.displayImportImage7Image (Graph.create_png ())

Divide the datasets into training sets and test sets, and train, validate

1  from Import Train_test_split 2 X_train, X_test, y_train, y_test =3     train_test_split (Housing.data, Housing.target, Test_size = 0.1, random_state =4 DTR = tree. Decisiontreeregressor (random_state=42)5dtr.fit (X_train, Y_train)6 7 Dtr.score (X_test, Y_test)

Results:

0.637318351331017

Use random forest:

1  from Import Randomforestregressor 2 RFR = randomforestregressor (random_state = 3)rfr.fit (X_train, Y_train)4 Rfr.score (x_test, Y_test)

Results:

0.79086492280964926

Select parameters with cross-validation:

1  fromSklearn.grid_searchImportGRIDSEARCHCV2 3 #generally, the parameters are written in a dictionary format:4Tree_param_grid = {'Min_samples_split': List ((3, 6, 9)),'n_estimators': List (10,50,100))}5 6 #The first parameter is the model, the second parameter is the parameter to be selected, CV: Several cross-validation7Grid = GRIDSEARCHCV (Randomforestregressor (), Param_grid = Tree_param_grid, CV = 5)8 Grid.fit (X_train, Y_train)9Grid.grid_scores_, Grid.best_params_, grid.best_score_

The result is:

([mean:0.78795, std:0.00337, params: {' Min_samples_split ': 3, ' n_estimators ': ten},  mean:0.80463, STD:0.00308, para MS: {' Min_samples_split ': 3, ' n_estimators ': $,  mean:0.80732, std:0.00448, params: {' Min_samples_split ': 3, ' N_es Timators ': +,  mean:0.78535, std:0.00506, params: {' min_samples_split ': 6, ' n_estimators ': ten},  mean: 0.80446, std:0.00399, params: {' min_samples_split ': 6, ' n_estimators ': $,  mean:0.80688, std:0.00424, params: {' mi N_samples_split ': 6, ' n_estimators ': +,  mean:0.78754, std:0.00552, params: {' min_samples_split ': 9, ' N_ Estimators ': ten},  mean:0.80321, std:0.00487, params: {' min_samples_split ': 9, ' n_estimators ': [],  mean: 0.80553, std:0.00389, params: {' min_samples_split ': 9, ' n_estimators ': +}], {' Min_samples_split ': 3, ' n_estimators ': 1 00}, 0.8073224957136084)

Use the resulting parameters to retrain the random forest:

1 RFR = randomforestregressor (min_samples_split=3,n_estimators = 100,random_state = 2) Rfr.fit (X_train, Y_train) 3 rfr.score (x_test, Y_test)

The result is:

0.80908290496531576

1 PD. Series (rfr.feature_importances_, index = housing.feature_names). Sort_values (ascending = False)

The result is:

Medinc        0.524257AveOccup      0.137947Latitude      0.090622Longitude     0.089414HouseAge      0.053970AveRooms      0.044443Population    0.030263AveBedrms     0.029084dtype:float64

python--decision Tree Combat: California house price forecast

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.