The linear regression analysis and forecast of Shenzhen house price = =

Source: Internet
Author: User

1. Data Cleansing
data = Data.rename (columns={' unnamed:0 ': ' ID '})
Give the first column a name ID

data = data[data[' Rooms ']! = ' Unknown ']
data[' Rooms ' = data[' Rooms '].astype (int)
data[' Living Rooms '] = data[' Living Rooms '].astype (int)
data[' height '] = data[' height '].astype (int)
data = data[data[' Undergroud ']! = ' Unknown ']
data[' undergroud ' = data[' Undergroud '].astype (int)
Data[' Distance to underground '] = data[' Distance to Underground '].astype (int)
# Print Data.describe ()
Pick out ' Unknown ' in the column ' Datda ' and convert this column into int data, and then do the same for the numerical data like ' Living Rooms ', ' Height ' ( should be a loop?). )
# Print Data.groupby (' Rooms '). Count ()
data = data[(data[' Rooms ') > 0) & (data[' Rooms ') < 8)]
data = data[data[' Rooms ') > data[' Living Rooms ']
The ' Rooms ' count, found 0 rooms, more than 8 rooms in the room, it was removed and re-assigned to the data, and the ' Living Rooms ' than ' Rooms ' more abnormal data.
(Another, # print data.groupby (' Rooms '). Count () This ran out with a lot of columns the same, so later added [' Price ']

# Plt.plot (data[' price ', ' o ', alpha = 0.5)
# plt.show ()
# Plt.plot (data[' Height '), '-', alpha = 0.5)
# plt.show ()
data = data[(data[' price '] < 5000)]
Then ' price ', ' height ' paint, ' height ' no outliers, constrain ' price ' less than 5000

Dummy_decoration = pd.get_dummies (data[' decoration '), prefix= ' decoration ')
Dummy_region = pd.get_dummies (data[' region '), prefix= ' region ')
Dummy_orientation = pd.get_dummies (data[' Orientation '), prefix= ' Orientation ')
Dummy_floor = pd.get_dummies (data[' floor '), prefix= ' floor ')
Dummy_elevator = Pd.get_dummies (data[' has elevator '], prefix= ' elevator ')
To handle character variables, use dummy variables ( perhaps?). Can you set up a function to call directly? )

data = Pd.merge (data, dummy_decoration, Right_index=true, Left_index=true)
data = Pd.merge (data, dummy_region, Right_index=true, Left_index=true)
data = Pd.merge (data, dummy_orientation, Right_index=true, Left_index=true)
data = Pd.merge (data, Dummy_floor, Right_index=true, Left_index=true)
data = Pd.merge (data, Dummy_elevator, Right_index=true, Left_index=true)
Stick the new dummy variable on the data ( it's silly, you should write a function call )
# x=data.loc[:, (' Size ', ' Rooms ', ' Living Rooms ', ' Height ', ' Undergroud ',
# ' Distance to Underground ', ' dummy_decoration ', ' dummy_region ',
# ' dummy_orientation ', ' dummy_floor ', ' dummy_elevator ')
# Print X
To here error, later understand, because the data inside there is no dummy_decoration, but decoration_delicate/simple, so ... Abandon
X = Data
# Print X.columns
Take a look at the variable name in X
X = X.drop ([' id ', ' Built Year ', ' region ', ' decoration ', ' floor ', ' Orientation ', ' have elevator ', ' price ',
' Year of the property ', ' Tag ', Axis=1)
# Print X.info ()
Y = data[' price ']
Instead of choosing, discard some of the variables in data as X ( but why look at the null value of x?). ), assign ' price ' to Y, and here, my data is done.
Then you can do the ' train.csv ', but the ' test.csv ' still has to do the same data processing, so write a function:
def SZ (data):
。。。
The middle is the data processing process above
。。。

return x, Y
Data processing, above #


2. Linear regression
Read data:
Data1=pd.read_csv (' Train.csv ')
X_train=sz (DATA1) [0]
Y_train=sz (DATA1) [1]

Data2=pd.read_csv (' Test1.csv ')
X_test=sz (DATA2) [0]
Y_test=sz (DATA2) [1]

The linear regression of the data in train, the linear coefficients, and the x_test to get the y_pred:
Linreg=linearregression ()
Linreg.fit (X_train,y_train)
Print Linreg.get_params ()
Print Linreg.coef_
Print Linreg.intercept_
y_pred = Linreg.predict (x_test)

Find the linear model obtained by mape and other evaluation:
Print "MAE:", Metrics.mean_absolute_error (Y_test, y_pred)
M= (y_test-y_pred)/y_test
A=map (ABS,M)
P=sum (a)/len (m)
Print "MAPE:", p*100
Print "R2:", Metrics.r2_score (Y_test, y_pred)
Plt.plot (Y_test, y_pred, ' o ', alpha = 0.5)
Plt.plot ([0,4000],[0,4000], ' r--')
Plt.show ()
Some results are attached:

mae:123.090078901
mape:23.7656379369
mse:34517.8775484
rmse:185.78987472
r2:0.805982042193

Stroke ...














The linear regression analysis and forecast of Shenzhen house price = =

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.