Python Big Data processing case

Last Update:2017-06-25 Source: Internet

Author: User

Tags square root

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Key points of knowledge:
Lubridate Package Dismantling Time | Posixlt
Using decision tree Classification to make use of stochastic forest prediction
Use logarithmic for fit, and exp function restore

The training set comes from bicycle rental data in the Kaggle Washington Bike sharing program, analyzing the relationship between shared bikes and weather and time. The dataset has a total of 11 variables and 10,000 rows of data.
Https://www.kaggle.com/c/bike-sharing-demand

First look at the official data, a total of two tables, is the 2011-2012 data, the difference is that the test file is the date of each month is full, but no registered users and casual users. While the train file is only 1-20 days per month, there are two types of users.
Solution: Complete the number of users in the train file in number 21-30. The evaluation criterion is the comparison between the forecast and the real quantity.

1.png

Load files and packages first

library(lubridate)library(randomForest)library(readr)setwd("E:")data<-read_csv("train.csv")head(data)

Here I met the hole, with R language default Read.csv dead or alive can not read out the correct file format, changed to xlsx more miserable, all the time has become 43045 such a strange number. I tried as before. Date can be converted correctly, but this time, because sometimes seconds and minutes, you can only use timestamps, but the result is not.
Finally, the download of the "Readr" package, with the Read_csv statement, a smooth interpretation.
Because test is complete with the train date, but the number of users is missing, the train and test are merged.

test$registered=0test$casual=0test$count=0data<-rbind(train,test)

Pick Time: You can use time stamp, the time here is relatively simple, that is, the number of hours, so you can also directly truncate the string.

data$hour1<-substr(data$datetime,12,13)table(data$hour1)

Count the total number of uses per hour (why is it neat):

6-hour1.png

Next is the use of the box line diagram, to see the user and time, the relationship between the weeks. Why Use the box line chart without hist histogram, because the box line diagram has a discrete point expression, the following also use logarithm to find fit
It can be seen that in terms of time, registered users and non-registered users of the use of time is very different.

5-hour-regestered.png
5-hour-casual.png
4-boxplot-day.png

Next, the correlation coefficient cor is used to test the relationship between the user, temperature, body sense temperature, humidity and wind speed.

Correlation coefficient: The linear correlation measure between variables to test the correlation degree of different data.
The value range [ -1,1], the closer 0 the more irrelevant.

It can be seen from the operation results that the use of the population and wind speed negatively correlated, more than the temperature effect.

Cor.png

The next step is to classify the time and other factors in the decision tree, and then use the random forest to predict. Algorithms for random forest and decision trees. Sounds very big on, actually now also very commonly used, therefore must learn.

Decision tree model is a simple and easy-to-use non-parametric classifier. It does not need to have any prior assumptions about the data, the calculation is faster, the results are easy to interpret, and the robustness is strong, not afraid of noise data and missing data.
The basic calculation steps of the decision tree model are as follows: Select one from N independent variables, find the best segmentation point, and divide the data into two groups. For grouped data, repeat the above steps until a certain condition is met.
There are three important issues that need to be addressed in decision tree Modeling:
How to choose a self-variable
How to select a split point
Determine the conditions for stopping division

Make a decision tree of registered users and hours,

train$hour1<-as.integer(train$hour1)d<-rpart(registered~hour1,data=train)rpart.plot(d)

3-raprt-hour1.png

Then it is manually categorized according to the results of the decision tree, so it is full of code ...

Train$hour 1<-as.integer (train$hour 1)Data$DP _reg=0Data$DP _reg[data$hour 1<7.5]=1Data$DP _reg[data$hour 1>=22]=2Data$DP _reg[data$hour 1>=9.5 &Data$hour 1<18]=3Data$DP _reg[data$hour 1>=7.5 &Data$hour 1<18]=4data $DP _reg[data $hour 1>= 8.5 & data $hour 1<18]=5data $DP _reg[data $hour 1>=20 & data $hour 1<20]=6data $DP _reg[data  $hour 1>=18 & data $hour 1<20]=7

In the same vein, make (Hours | temperature) X (Registration | Random users) and other decision trees, continue to manually classify ....

3-raprt-temp.png

Manual classification of Year month, weekend holiday, etc.

Data$year _part=0Data$month <-month (Data$datatime)Data$year _part[data$year = =' 2011 ']=1data $year _part[data $year ==< Span class= "hljs-string" (') ' & data $month >3]=2data$ Year_part[data $year == & data $month >6]=3data $year _part[data  $year == ' "& data $month >9]=4

Data$day _type=""Data$day _type[data$holiday = =0 &Data$workingday = =0]="Weekend"data $day _type[data $holiday = =1]= "holiday" data $day _type[data $holiday ==0 & data $workingday ==1]=  "working day" data $weekend =< Span class= "Hljs-number" >0data $weekend [Data $day ==" Sunday "| Data $day == "Saturday"]=1

Next, use random forest statements to predict

In machine learning, a random forest is a classifier that contains multiple decision trees, and its output category is determined by the number of categories the individual tree outputs.
Each division of a subtree in a random forest does not use all of the selected features, but selects a certain feature randomly from all selected features and selects the optimal feature. Such decision trees can be different from each other, improve the diversity of the system, and improve the classification performance.

NTREE Specifies the number of decision trees contained in a random forest, which defaults to 500, and is usually as large as possible with performance permitting;
MTRY Specifies the number of variables used in a binary tree in the specified node, by default the number of data set variables is two square root (categorical model) or One-third (predictive model). It is generally necessary to make a man-made successive selection to determine the best m-value-excerpted from Datacruiser notes. Here I mainly study, so although there are more than 10,000 data sets, but also only set 500. Just 500 of my little computer also ran for a long while.

train<-dataset.seed(1234)train$logreg<-log(train$registered+1)test$logcas<-log(train$casual+1)fit1<-randomForest(logreg~hour1+workingday+day+holiday+day_type+temp_reg+humidity+atemp+windspeed+season+weather+dp_reg+weekend+year+year_part,train,importance=TRUE,ntree=250)pred1<-predict(fit1,train)train$logreg<-pred1

Here do not know how, my day and Day_part add in the error, only delete these two variables calculation, but also to study patching.
Then use the EXP function to restore

train$registered<-exp(train$logreg)-1train$casual<-exp(train$logcas)-1train$count<-test$casual+train$registered

Finally, the date after 20th is truncated, write a new CSV file upload.

train2<-train[as.integer(day(data$datetime))>=20,]submit_final<-data.frame(datetime=test$datetime,count=test$count)write.csv(submit_final,"submit_final.csv",row.names=F)

Done!
GitHub Code Add Group

The original example is the second section of the Kaggle course of the into Gold website, which basically follows the idea of video. Because the course does not have source code, it should be repaired and run intact. It took two or three days to finish the homework. The following amendments are:

A good understanding of three knowledge points (Lubridate packet/posixlt,log linear, decision tree and random forest);
The correlation was analyzed by woe and IV instead of COR function.
Analysis by means of other graphical representation
Random tree variable re-test

You are welcome to join the Learning Exchange Group if you encounter any problems or want to acquire learning resources in the learning process.
626062078, we learn python! together.

Completed a "vast and complete" data analysis, or a sense of accomplishment!

Python Big Data processing case

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More