According to the extracted data, the analysis of data exploration, the analysis of the case, mainly missing value analysis and outlier analysis, through the observation of data, we learned that the data exists the ticket price is null, the minimum fare value is 0, the minimum discount rate is 0, the total number of kilometers flown more than 0 of the situation.
Fares are empty and may be caused by a customer's absence of a flight record, and other customers may be due to a 00 percent ticket or point redemption.
Then we calculate the property and the maximum and the minimum value of the null value for each property, then wash and transform the data, the code of the data exploration is as follows:
#-*-coding:utf-8-*-ImportPandas as Pdinputfile='F:\\python Data Mining \\chapter7\\demo\\data\\air_data.csv'outputfile='F:\\python Data Mining \\chapter7\\demo\\tmp\\tansuo.xls'Data=pd.read_csv (inputfile,encoding='Utf-8') Tansuo=data.describe (percentiles=[],include=' All'). ttansuo['NULL']=len (data)-tansuo['Count']tansuo=tansuo[['NULL','Max','min']]tansuo.columns=[u'number of NULL values', u'Maximum Value', u'Minimum Value']tansuo.to_excel (outputfile)#print (Tansuo)
The results are as follows: The number of empty values and the maximum and minimum values for each attribute are saved to the corresponding path.
Cleaning of the data:
Get rid of the ticket price is empty, save the fare is not O, then the discount is 0, the total flight route is 0 of the case.
Finally, the results are saved to the Excel document.
The corresponding code is as follows:
#-*-coding:utf-8-*-ImportPandas as Pdinputfile='F:\\python Data Mining \\chapter7\\demo\\data\\air_data.csv'outputfile='F:\\python Data Mining \\chapter7\\demo\\tmp\\clean_data.xls'Data=pd.read_csv (inputfile,encoding='Utf-8') Data=data[data['sum_yr_1'].notnull () *data['sum_yr_2'].notnull ()]index=data['sum_yr_1']!=0index1=data['sum_yr_2']!=0index2= (data['Avg_discount']==0) & (data['Seg_km_sum']==0) Clean=data[index | index1 |Index2]#print (Clean)Clean.to_excel (outputfile)
Because there is too much data to be given, data specification is required:
The data specifications are as follows:
The main influence factors are extracted, the data is regulated, and the model is constructed according to the data, and the results are obtained:
The first involves several factors, mainly the time of extraction days:
The number of days to calculate the time can be calculated according to Numpy.timedelta64:
The result is:
res = d_load- d_ffpdata['L'] = Res.map (lambda'm' ))
The data specifications are as follows:
ImportNumPy as NPImportPandas as Pdinputfile='F:\\python Data Mining \\chapter7\\demo\\tmp\\clean_data.xls'outputfile='F:\\python Data Mining \\chapter7\\demo\\tmp\\zs_data.xls'Data=pd.read_excel (inputfile,encoding='Utf-8')#data = Pd.read_excel (inputfile, encoding= ' utf-8 ')data = data[['Load_time','ffp_date','Last_to_end','Flight_count','Seg_km_sum','Avg_discount']]#data[' L ']=pd.datetime (data[' load_time ')-pd.datetime (data[' ffp_date '])#data[' L ']=int (((Parse (data[' Load_time ')-parse (data[' Ffp_adte '])). Days)/30)## # #这四行代码费了我3个小时D_FFP = Pd.to_datetime (data['ffp_date']) D_load= Pd.to_datetime (data['Load_time']) Res= D_load-d_ffpdata['L'] = Res.map (LambdaX:X/Np.timedelta64 (30 * 24 * 60,'m')) data['R'] = data['Last_to_end']data['F'] = data['Flight_count']data['M'] = data['Seg_km_sum']data['C'] = data['Avg_discount']data= data[['L','R','F','M','C']]data.to_excel (outputfile)Print('Finish')
The results of the deposit are:
The next step is to standardize the data:
and change the column name:
#-*-coding:utf-8-*-ImportNumPy as NPImportPandas as Pdinputfile='F:\\python Data Mining \\chapter7\\demo\\tmp\\zs_data.xls'outputfile='F:\\python Data Mining \\chapter7\\demo\\tmp\\zs_code_data.xls'Data=pd.read_excel (inputfile,encoding='Utf-8') Data=data-data.mean (axis=0)/data.std (axis=0) Data.columns=['Z'+i forIinchData.columns]#print (data.columns)Data.to_excel (outputfile)Print('Finish')
The results are as follows:
The next step is to build the model, because you need to judge the value of the customer, so it is divided into several customers, according to the category, the cluster Center can be assigned a value of 5
Airline Customer Value analysis