Chi-square test is a hypothesis test method for counting data with a wide application. It belongs to the category of Nonparametric test, which mainly compares the correlation analysis of two and over two sample rate (composition ratio) and two categorical variables. The fundamental idea is to compare the theoretical frequency with the actual frequency of the degree of coincidence or goodness of fit problem. (More reference: chi-square inspection, chi-square distribution)
Do not talk too much theory, mainly using Python to achieve chi-square verification. Previous selection of element/feature/attribute outliers can be filtered using histograms, box plots, Z-score methods , and so on. such as the Python Exploratory data analysis (exploratory, data Analysis,eda), while exploring the same time, you can also exclude the outliers of a single variable. For discrete attributes (or discretization) classification, you can use equidistant classification, equal frequency classification , etc., but such classification does not reflect the correlation with other properties or results.
Currently, you use Chi-square authentication to classify attributes. If the marital status of the client is still affected after the loan repayment, so the current check how to classify the discrete value better. Now there is a set of data, the customer's "marital status", more complex points of calculation, "marriage + Age + gender" interaction is relatively strong, you can also calculate these variables. There is only a marital status test here. These data are historical data, a bit of customer has been back, and some did not return money.
But which of these data are outliers, are these outliers deleted, or are they grouped into other classes?
First, the coarse classification is performed to observe their proportion:
Table 1, rough classification based on client's marital status
The ratio of the near can be divided into the same class, such as "remarriage" to "married", "first marriage" is classified as "unmarried." At this time the machine algorithm, of course, we can evaluate the experience to classify, "remarriage", "first marriage", "remarried" is actually belong to "married".
Table 2, which is the frequency with which the client is returning money for each marital situation
Each category is independent of each other and is not intersected, and is now calculated in a different way. Let's assume that the distribution of customers for the money back is consistent with the overall. Calculate "number of married unpaid customers" = 4124x5498/12405 = 1826 (calculated as the total value of the row), the final result is as follows:
Table 3, which is the expected frequency distribution (theoretical estimate) of whether a customer is returning money for each marital situation
The greater the number gap of the two tables, the greater the independence of the two tables, which means that the higher the dependency of the two tables, the better the coarse classification results. Now use the calculated chi-Square Distance:
AI is the actual observation frequency of the I level, EI is the desired frequency of the I level. Substituting formula:
This is the chi-square value, and then the Chi-square degree of Freedom V:
v= (number of rows-1) (Number of columns-1) = (2-1) (3-1) = 2
Table 4, Chi square critical Value table part of the data
Chi-Square critical value is (generally take p=0.05):
The threshold value is less than the chi-square value of 23 in the experiment, the difference is obvious, rejecting 0 hypothesis.
Using the Python script implementation:
#-*-coding:utf-8-*-#python 3.5.0#Chi-Square calculation__author__='HZC'ImportMathImportSQLAlchemyImportNumPy as NPImportPandas as PDclassChisquare:def __init__(self,d): Self.engine= Sqlalchemy.create_engine ("Mssql+pymssql://%s:%[email protected]%s/%s"% (d['User'],d['pwd'],d['ins'],d['DB'])) defget_df_from_query (self,sql): DF=pd.read_sql_query (SQL, Self.engine)returnDFdefget_variance (SELF,DF): Row_count= Df.shape[0]-1Col_count= Df.shape[1]-1v= (row_count-1) * (col_count-1) returnv#To the matrix to find the card square distance defGet_chi_square_value (SELF,DF1,DF2): Df1= Df1.drop (['Col_total']) DF2= Df2.drop (['Col_total']) deldf1['Row_total'] deldf2['Row_total'] Mtr1=df1.astype (int). As_matrix () mtr2=df2.astype (int). As_matrix () MTR= ((MTR1-MTR2) **2)/MTR2returnmtr.sum ()#Classification Frequency defget_classification (self,table_name,col_result,col_pred): SQL="Select%s,%s from%s"%(col_result,col_pred,table_name) DF=self.get_df_from_query (SQL) DF= Df.groupby ([col_result,col_pred]). Agg ({col_result:['Count']}) DF=Df.reset_index () df.columns= [Col_result,col_pred,'Count'] DF= Pd.pivot_table (DF, values ='Count', Index=col_result, columns =col_pred). Reset_index () df['Row_total'] = Df.sum (Axis=1) Df.set_index (Col_result, InPlace=True) df.loc['ratio (%)'] = df.loc[0]*100/df.loc[1] Print("==========================================================") Print("rough Classification of raw data: (percentage of similar can be divided into the same class)") Print("==========================================================") Print(df.astype (int)) DF= Df.drop (['ratio (%)']) df.loc['Col_total']=df.sum (axis=0)Print("==========================================================") Print("Classification Frequency Summary: (actual value)") Print("==========================================================") Print(df.astype (int)) DF2=df.copy () Total= df2[['Row_total']].loc[['Col_total']].values[0][0] forColinchDf2:df2[col]= df2[[col]].loc[['Col_total']].VALUES[0][0] * df2['Row_total']/Total df2= Df2.drop (['Col_total']) df2.loc['Col_total']=df2.sum (axis=0)Print("==========================================================") Print("Expected frequency Distribution: (theoretical calculation value)") Print("The higher the difference between the table and the above, the lower the independence of the two tables and the higher the dependence degree, the better the coarse classification effect .") Print("==========================================================") Print(df2.astype (int))Print("==========================================================") x= Self.get_chi_square_value (DF,DF2)#Order: (actual DF, estimated DF)v = self.get_variance (DF2)#v= (number of rows-1) (Number of columns-1) Print("Chi-square Value: χ2 =%s"%x)Print("degrees of freedom: v =%s"%v)Print("==========================================================")if __name__=="__main__": Conn= {'User':'User name','pwd':'Password','ins':'Example','DB':'Database'} CS=Chisquare (conn) cs.get_classification ("V_clientinfoall","whether to return money","Marital Status") #cs.get_classification (table or view, regression only/judging value, "category element")
Only the relationship values between the two variables can be analyzed, and the output is as follows:
Python Chi-Square inspection