1. Data Cleansing
data = Data.rename (columns={' unnamed:0 ': ' ID '})
Give the first column a name ID
data = data[data[' Rooms ']! = ' Unknown ']
data[' Rooms ' = data[' Rooms '].astype (int)
data[' Living Rooms '] = data[' Living Rooms '].astype (int)
data[' height '] = data[' height '].astype (int)
data = data[data[' Undergroud ']! = ' Unknown ']
data[' undergroud ' = data[' Undergroud '].astype (int)
Data[' Distance to underground '] = data[' Distance to Underground '].astype (int)
# Print Data.describe ()
Pick out ' Unknown ' in the column ' Datda ' and convert this column into int data, and then do the same for the numerical data like ' Living Rooms ', ' Height ' ( should be a loop?). )
# Print Data.groupby (' Rooms '). Count ()
data = data[(data[' Rooms ') > 0) & (data[' Rooms ') < 8)]
data = data[data[' Rooms ') > data[' Living Rooms ']
The ' Rooms ' count, found 0 rooms, more than 8 rooms in the room, it was removed and re-assigned to the data, and the ' Living Rooms ' than ' Rooms ' more abnormal data.
(Another, # print data.groupby (' Rooms '). Count () This ran out with a lot of columns the same, so later added [' Price ']
# Plt.plot (data[' price ', ' o ', alpha = 0.5)
# plt.show ()
# Plt.plot (data[' Height '), '-', alpha = 0.5)
# plt.show ()
data = data[(data[' price '] < 5000)]
Then ' price ', ' height ' paint, ' height ' no outliers, constrain ' price ' less than 5000
Dummy_decoration = pd.get_dummies (data[' decoration '), prefix= ' decoration ')
Dummy_region = pd.get_dummies (data[' region '), prefix= ' region ')
Dummy_orientation = pd.get_dummies (data[' Orientation '), prefix= ' Orientation ')
Dummy_floor = pd.get_dummies (data[' floor '), prefix= ' floor ')
Dummy_elevator = Pd.get_dummies (data[' has elevator '], prefix= ' elevator ')
To handle character variables, use dummy variables ( perhaps?). Can you set up a function to call directly? )
data = Pd.merge (data, dummy_decoration, Right_index=true, Left_index=true)
data = Pd.merge (data, dummy_region, Right_index=true, Left_index=true)
data = Pd.merge (data, dummy_orientation, Right_index=true, Left_index=true)
data = Pd.merge (data, Dummy_floor, Right_index=true, Left_index=true)
data = Pd.merge (data, Dummy_elevator, Right_index=true, Left_index=true)
Stick the new dummy variable on the data ( it's silly, you should write a function call )
# x=data.loc[:, (' Size ', ' Rooms ', ' Living Rooms ', ' Height ', ' Undergroud ',
# ' Distance to Underground ', ' dummy_decoration ', ' dummy_region ',
# ' dummy_orientation ', ' dummy_floor ', ' dummy_elevator ')
# Print X
To here error, later understand, because the data inside there is no dummy_decoration, but decoration_delicate/simple, so ... Abandon
X = Data
# Print X.columns
Take a look at the variable name in X
X = X.drop ([' id ', ' Built Year ', ' region ', ' decoration ', ' floor ', ' Orientation ', ' have elevator ', ' price ',
' Year of the property ', ' Tag ', Axis=1)
# Print X.info ()
Y = data[' price ']
Instead of choosing, discard some of the variables in data as X ( but why look at the null value of x?). ), assign ' price ' to Y, and here, my data is done.
Then you can do the ' train.csv ', but the ' test.csv ' still has to do the same data processing, so write a function:
def SZ (data):
。。。
The middle is the data processing process above
。。。
return x, Y
Data processing, above #
2. Linear regression
Read data:
Data1=pd.read_csv (' Train.csv ')
X_train=sz (DATA1) [0]
Y_train=sz (DATA1) [1]
Data2=pd.read_csv (' Test1.csv ')
X_test=sz (DATA2) [0]
Y_test=sz (DATA2) [1]
The linear regression of the data in train, the linear coefficients, and the x_test to get the y_pred:
Linreg=linearregression ()
Linreg.fit (X_train,y_train)
Print Linreg.get_params ()
Print Linreg.coef_
Print Linreg.intercept_
y_pred = Linreg.predict (x_test)
Find the linear model obtained by mape and other evaluation:
Print "MAE:", Metrics.mean_absolute_error (Y_test, y_pred)
M= (y_test-y_pred)/y_test
A=map (ABS,M)
P=sum (a)/len (m)
Print "MAPE:", p*100
Print "R2:", Metrics.r2_score (Y_test, y_pred)
Plt.plot (Y_test, y_pred, ' o ', alpha = 0.5)
Plt.plot ([0,4000],[0,4000], ' r--')
Plt.show ()
Some results are attached:
mae:123.090078901
mape:23.7656379369
mse:34517.8775484
rmse:185.78987472
r2:0.805982042193
Stroke ...
、
The linear regression analysis and forecast of Shenzhen house price = =