Python Big Data processing in detail

Source: Internet
Author: User
Tags square root

Key points of knowledge:
Lubridate Package Dismantling Time | Posixlt
Using decision tree Classification to make use of stochastic forest prediction
Use logarithmic for fit, and exp function restore

The training set comes from bicycle rental data in the Kaggle Washington Bike sharing program, analyzing the relationship between shared bikes and weather and time. The dataset has a total of 11 variables and 10,000 rows of data.

First look at the official data, a total of two tables, is the 2011-2012 data, the difference is that the test file is the date of each month is full, but no registered users and casual users. While the train file is only 1-20 days per month, there are two types of users.
Solution: Complete the number of users in the train file in number 21-30. The evaluation criterion is the comparison between the forecast and the real quantity.


Load files and packages first

Library (lubridate) library (randomforest) library (READR) SETWD ("E:") data<-read_csv ("Train.csv") head (data)

Here I met the hole, with R language default Read.csv dead or alive can not read out the correct file format, changed to xlsx more miserable, all the time has become 43045 such a strange number. I tried as before. Date can be converted correctly, but this time, because sometimes seconds and minutes, you can only use timestamps, but the result is not.
Finally, the download of the "Readr" package, with the Read_csv statement, a smooth interpretation.
Because test is complete with the train date, but the number of users is missing, the train and test are merged.

Test$registered=0test$casual=0test$count=0data<-rbind (Train,test)

Pick Time: You can use time stamp, the time here is relatively simple, that is, the number of hours, so you can also directly truncate the string.

DATA$HOUR1<-SUBSTR (data$datetime,12,13) Table (DATA$HOUR1)

Count the total number of uses per hour (why is it neat):


Next is the use of the box line diagram, to see the user and time, the relationship between the weeks. Why Use the box line chart without hist histogram, because the box line diagram has a discrete point expression, the following also use logarithm to find fit
It can be seen that in terms of time, registered users and non-registered users of the use of time is very different.


Next, the correlation coefficient cor is used to test the relationship between the user, temperature, body sense temperature, humidity and wind speed.

Correlation coefficient: The linear correlation measure between variables to test the correlation degree of different data.
The value range [ -1,1], the closer 0 the more irrelevant.

It can be seen from the operation results that the use of the population and wind speed negatively correlated, more than the temperature effect.


The next step is to classify the time and other factors in the decision tree, and then use the random forest to predict. Algorithms for random forest and decision trees. Sounds very big on, actually now also very commonly used, therefore must learn.

Decision tree model is a simple and easy-to-use non-parametric classifier. It does not need to have any prior assumptions about the data, the calculation is faster, the results are easy to interpret, and the robustness is strong, not afraid of noise data and missing data.
The basic calculation steps of the decision tree model are as follows: Select one from N independent variables, find the best segmentation point, and divide the data into two groups. For grouped data, repeat the above steps until a certain condition is met.
There are three important issues that need to be addressed in decision tree Modeling:
How to choose a self-variable
How to select a split point
Determine the conditions for stopping division

Make a decision tree of registered users and hours,

Train$hour1<-as.integer (TRAIN$HOUR1) D<-rpart (Registered~hour1,data=train) Rpart.plot (d)


Then it is manually categorized according to the results of the decision tree, so it is full of code ...

Train$hour1<-as.integer (TRAIN$HOUR1) data$dp_reg=0data$dp_reg[data$hour1<7.5]=1data$dp_reg[data$hour1> =22]=2data$dp_reg[data$hour1>=9.5 & data$hour1<18]=3data$dp_reg[data$hour1>=7.5 & data$hour1< 18]=4data$dp_reg[data$hour1>=8.5 & data$hour1<18]=5data$dp_reg[data$hour1>=20 & Data$hour1<20] =6data$dp_reg[data$hour1>=18 & Data$hour1<20]=7

In the same vein, make (Hours | temperature) X (Registration | Random users) and other decision trees, continue to manually classify ....


Manual classification of Year month, weekend holiday, etc.

Data$year_part=0data$month<-month (data$datatime) data$year_part[data$year== ' "]=1data$year_part[data$year = = ' + ' & data$month>3]=2data$year_part[data$year== ' + & Data$month>6]=3data$year_part[data$year = = ' + ' & data$month>9]=4
Data$day_type= "" Data$day_type[data$holiday==0 & data$workingday==0]= "Weekend" data$day_type[data$holiday==1]= "Holiday" Data$day_type[data$holiday==0 & data$workingday==1]= "Working day" Data$weekend=0data$weekend[data$day = = "Sunday" |data$day== "Saturday"]=1

Next, use random forest statements to predict

In machine learning, a random forest is a classifier that contains multiple decision trees, and its output category is determined by the number of categories the individual tree outputs.
Each division of a subtree in a random forest does not use all of the selected features, but selects a certain feature randomly from all selected features and selects the optimal feature. Such decision trees can be different from each other, improve the diversity of the system, and improve the classification performance.

NTREE Specifies the number of decision trees contained in a random forest, which defaults to 500, and is usually as large as possible with performance permitting;
MTRY Specifies the number of variables used in a binary tree in the specified node, by default the number of data set variables is two square root (categorical model) or One-third (predictive model). It is generally necessary to make a man-made successive selection to determine the best m-value-excerpted from Datacruiser notes. Here I mainly study, so although there are more than 10,000 data sets, but also only set 500. Just 500 of my little computer also ran for a long while.

Train<-dataset.seed (1234) Train$logreg<-log (train$registered+1) test$logcas<-log (train$casual+1) fit1 <-randomforest (logreg~hour1+workingday+day+holiday+day_type+temp_reg+humidity+atemp+windspeed+season+ weather+dp_reg+weekend+year+year_part,train,importance=true,ntree=250) pred1<-predict (fit1,train) train$ Logreg<-pred1

Here do not know how, my day and Day_part add in the error, only delete these two variables calculation, but also to study patching.
Then use the EXP function to restore

Train$registered<-exp (Train$logreg) -1train$casual<-exp (Train$logcas) -1train$count<-test$casual+train$ Registered

Finally, the date after 20th is truncated, write a new CSV file upload.

Train2<-train[as.integer (Day (Data$datetime)) >=20,]submit_final<-data.frame (Datetime=test$datetime, Count=test$count) write.csv (submit_final, "Submit_final.csv", row.names=f)

GitHub Code Add Group

The original example is the second section of the Kaggle course of the into Gold website, which basically follows the idea of video. Because the course does not have source code, it should be repaired and run intact. It took two or three days to finish the homework. The following amendments are:

A good understanding of three knowledge points (Lubridate packet/posixlt,log linear, decision tree and random forest);
The correlation was analyzed by woe and IV instead of COR function.
Analysis by means of other graphical representation
Random tree variable re-test

You are welcome to join the Learning Exchange Group if you encounter any problems or want to acquire learning resources in the learning process.
626062078, we learn python! together.

Completed a "vast and complete" data analysis, or a sense of accomplishment!

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.