"Furnace-smelting AI" machine learning 019-Project case: Estimating traffic flow using the SVM regression

Source: Internet
Author: User
Tags svm

"Furnace-smelting AI" machine learning 019-Project case: Estimating traffic flow using the SVM regression

(Python libraries and version numbers used in this article: Python 3.5, Numpy 1.14, Scikit-learn 0.19, matplotlib 2.2)

As we all know, SVM is a good classifier, not only for linear classification models, but also for non-linear models, but on the other hand, SVM can be used not only to solve classification problems, but also to solve regression problems.

This project intends to use the SVM regression to estimate traffic flow, the methods and procedures used with my previous article "The Furnace of AI" machine learning 018-Project case: According to the number of buildings in and out to predict whether the event is very similar, the data processing method used is very much the same.


1. Prepare the data set

The data set used in this project is derived from the UCI University DataSet, and it is coincidental that the dataset is located in the same Web page as the previous article (the number of people entering and exiting the building).

1.1 Understanding data sets

This data set counts the traffic flow through the road around the stadium during the home game of the Los Angeles Dodgers baseball team, which is stored under two documents: Dodgers.data file and Dodgers.events file, the main descriptions of the two files are:

There are 50,400 sample data in Dodgers.data, and the basic properties of each line are:

While the dodgers.events file has 81 rows of data, the basic properties of each row of data are:

1.2 Data Normalization

The data normalization of this project is mainly to integrate the contents of these two files into an available data set.

One of the 1.2.1: Read file errors and their solutions

Originally, I thought that reading these two files directly with Pd.read_csv (), as I used in a number of previous articles, however, in this project, directly call the method to read the file but error.

# 1 准备数据集# 从文件中加载数据集feature_data_path=‘E:\PyProjects\DataSet\BuildingInOut/Dodgers.data‘feature_set=pd.read_csv(feature_data_path,header=None)print(feature_set.info())# print(feature_set.head()) # print(feature_set.tail()) # 检查没有问题label_data_path=‘E:\PyProjects\DataSet\BuildingInOut/Dodgers.events‘label_set=pd.read_csv(label_data_path,header=None)print(label_set.info())# print(label_set.head())# print(label_set.tail()) # 读取没有问题,

The above code in the second Pd.read_csv () times the following error, appears to be a problem with the encoding of the original file.

-------------------------------------lose-----------------------------------------

Unicodedecodeerror: ' Utf-8 ' codec can ' t decode byte 0xa0 in position 5:invalid start byte

--------------------------------------------finished-------------------------------------

Looking at the original dodger.events file, we can see that there is an unrecognized character at the end of each line.

At this point my solution is: Open the Dodger.events file with Notepad, in "Save As", the encoding format is modified to "UTF-8", saved as a new file, there is no unknown character in the new file, as shown in.

There is no problem with pd.read_csv ().

1.2.2 Two: Delete missing data, split data

Because the sample in the Dodger.data file has missing data, its traffic volume is 1 of the missing data, there are many ways to deal with missing data, here because the whole sample size is large, so I directly delete the missing data. In addition, since the original data is not all separated by commas, you need to separate the columns with the following code:

# 删除缺失数据feature_set2=feature_set[feature_set[1]!=-1] # 只获取不是-1的DataFrame即可。# print(feature_set2) # 没有问题feature_set2=feature_set2.reset_index(drop=True)print(feature_set2.head())# 第0列既包含日期,又包含时间,故要拆分成两列need_split_col=feature_set2[0].copy()feature_set2[0]=need_split_col.map(lambda x: x.split()[0].strip())feature_set2[2]=need_split_col.map(lambda x: x.split()[1].strip())print(feature_set2.head()) # 拆分没有问题

-------------------------------------lose-----------------------------------------

0 1
0 4/11/2005 7:35 23
1 4/11/2005 7:40 42
2 4/11/2005 7:45 37
3 4/11/2005 7:50 24
4 4/11/2005 7:55 39
0 1 2
0 4/11/2005 23 7:35
1 4/11/2005 42 7:40
2 4/11/2005 37 7:45
3 4/11/2005 24 7:50
4 4/11/2005 39 7:55

--------------------------------------------finished-------------------------------------

1.2.3 Three: Unification of date format

Before we perform the merge and date comparisons for these two dataframe, we need to unify the date formats in the two dataframe, the dates read from two files are of type string, but the date format read from Dodgers.data is like 4/11/ 2005, and the date format read from dodgers.events is such as the 05/01/05 form, which is obviously difficult to compare directly between the two strings. Fortunately pandas has a built-in to_datetime function that can be used to format the two dates directly. The code is:

# 将两个DataFrame中的日期格式统一,两个DataFrame中的日期目前还是String类型,格式不统一无法比较feature_set2[0]=pd.to_datetime(feature_set2[0])print(feature_set2[0][:5]) # 打印第0列的前5行label_set[0]=pd.to_datetime(label_set[0])print(label_set[0][:5])

-------------------------------------lose-----------------------------------------

0 2005-04-11
1 2005-04-11
2 2005-04-11
3 2005-04-11
4 2005-04-11
name:0, Dtype:datetime64[ns]
0 2005-04-12
1 2005-04-13
2 2005-04-15
3 2005-04-16
4 2005-04-17
name:0, Dtype:datetime64[ns]

--------------------------------------------finished-------------------------------------

1.2.4 Structured four: Merging two files into one data set

When merging files, we need to know which feature attributes are necessary for machine learning, and the feature columns we select here include (date, time, opponent team name, whether or not during the game), So we need to select the date and time from the Dodgers.data file, select the opponent's team name from the Dodger.events file and whether or not the information from the game period is placed in a data set. The specific code is as follows:

  # merge two files into a data set feature_set2[3]= ' NoName ' # opponent team name temporarily NoName to initialize Feature_set2[4]=0 # Whether or not to replace the Def Calc_mins (TIME_STR): Nums=time_str.split (': ') return 60*int (Nums[0]) +int (nums[1]) # convert time to minutes for RO         W_id,date in Enumerate (Label_set[0]): # First remove the date from the label Temp_df=feature_set2[feature_set2[0]==date] if TEMP_DF is None: Continue # As long as there is a game on this day, whether or not the game is playing, the opponent team name is written to the 3rd column rows=temp_df.index.tolist () feature_set2.loc[rows,3]=labe l_set.iloc[row_id,4] Start_min=calc_mins (label_set.iloc[row_id,1]) stop_min=calc_mins (label_set.iloc[row_id,2]) F Or row in temp_df[2]: # Determines whether the time is in the label time Feature_min=calc_mins (Row) if Feature_min>=start_min and FE Ature_min<=stop_min:feature_row=temp_df[temp_df[2]==row].index.tolist () Feature_set2.loc[featur E_row,4]=1 # feature_set2.to_csv (' D:/feature_set2_dodgers.csv ') # Save after Printing view no problem  

Open the saved feature_set2_dodgers.csv to see the number of rows found with many noname, which means there is no match on the day and there is no name for the opponent. For the Noname sample processing method also has a variety of multi-use, depending on the specific needs of different, it can be treated as ground truth, can be directly deleted, can also be retained as a situation to train. I'll just delete it here.

1.2.5 Five: Convert dates to weeks and save datasets

This part mainly converts the date to the day of the week, and saves the dataset to the hard disk, allowing you to read the file directly next time. The code is:

feature_set3=feature_set2[feature_set2[3]!=‘NoName‘].reset_index(drop=True) # 去掉NoName的样本# 进一步处理,由于日期在以后的日子里不可重复,作为feature并不合适,而可以用星期数来代替,feature_set3[5]=feature_set3[0].map(lambda x: x.strftime(‘%w‘)) # 将日期转换为星期数feature_set3=feature_set3.reindex(columns=[0,2,5,3,4,1])print(feature_set3.tail()) # 查看转换没有问题feature_set3.to_csv(‘E:\PyProjects\DataSet\BuildingInOut/Dodgers_Sorted_Set.txt‘) # 将整理好的数据集保存,下次可以直接读取

-------------------------------------lose-----------------------------------------

0 2 5 3 4 1
22411 2005-09-29 23:35 4 Arizona 0 9
22412 2005-09-29 23:40 4 Arizona 0 13
22413 2005-09-29 23:45 4 Arizona 0 11
22414 2005-09-29 23:50 4 Arizona 0 14
22415 2005-09-29 23:55 4 Arizona 0 17

--------------------------------------------finished-------------------------------------

####################### #小 ********** Knot ###############################

1. The main difficulty of the project is also in data processing, and its main regularization method is similar to the previous article.

#################################################################


2. Build the SVM regression model

The key to build a regression model using SVM is to import the SVR module instead of the SVC module used in the classification model, and the parameters used by the SVR need to be adjusted accordingly. The code for the SVM regression model is:

from sklearn.svm import SVR # 此处不一样,导入的是SVR而不是SVCregressor = SVR(kernel=‘rbf‘,C=10.0,epsilon=0.2) # 这些参数是优化得来regressor.fit(train_X, train_y)

-------------------------------------lose-----------------------------------------

SVR (c=10.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.2, gamma= ' auto ',
Kernel= ' RBF ', Max_iter=-1, Shrinking=true, tol=0.001, Verbose=false)

--------------------------------------------finished-------------------------------------

After the model has been defined and trained, the model needs to be tested using a test set, which is the test code and output:

y_predict_test=regressor.predict(test_X)# 使用评价指标来评估模型的好坏import sklearn.metrics as metricsprint(‘平均绝对误差:{}‘.format(    round(metrics.mean_absolute_error(y_predict_test,test_y),2)))print(‘均方误差MSE:{}‘.format(    round(metrics.mean_squared_error(y_predict_test,test_y),2)))print(‘中位数绝对误差:{}‘.format(    round(metrics.median_absolute_error(y_predict_test,test_y),2)))print(‘解释方差分:{}‘.format(    round(metrics.explained_variance_score(y_predict_test,test_y),2)))print(‘R方得分:{}‘.format(    round(metrics.r2_score(y_predict_test,test_y),2)))

-------------------------------------lose-----------------------------------------

Average absolute error: 5.16
Mean square error mse:50.45
Median absolute error: 3.75
Explanatory square differential: 0.63
R-Side score: 0.62

--------------------------------------------finished-------------------------------------

It seems that the results are not very good, perhaps the parameters in the SVR need further optimization.

A lot of friends give me a message, ask me training good SVM model how to save and recall, this part of the content to me in the previous article has been introduced, specifically, see: "Furnace smelting AI" machine learning 003-Simple linear regression creation, testing, model saving and loading


Note: This section of the code has been uploaded to ( my GitHub), Welcome to download.

Resources:

1, Python machine learning classic example, Prateek Joshi, Tao Junjie, Chen Xiaoli translation

"Furnace-smelting AI" machine learning 019-Project case: Estimating traffic flow using the SVM regression

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.