Most data mining algorithms rely on numeric or categorical features, extracting numeric and categorical features from a data set, and selecting the best features.
Features can be used for modeling, and models represent reality in an approximate way that machine mining algorithms can understand
Another advantage of feature selection is that the model is easier to manipulate than reality by reducing the complexity of the real world
Feature Selection
The Variancethreshold converter in Scikit-learn can be used to remove the variance of eigenvalues that do not reach the minimum standard characteristics.
ImportNumPy as Npx= Np.arange (+). Reshape ((10,3))#10 Individual, 3-feature datasetsPrint(x) x[:,1] = 1#change the value of all second columns to 1Print(x) fromSklearn.feature_selectionImportVARIANCETHRESHOLDVT= Variancethreshold ()#Variancethreshold Converter, using it to process data setsXt =vt.fit_transform (x)Print(Xt)#The second column disappears .Print(Vt.variances_)#output variance for each columnresults: [[01 2] [ 3 4 5] [ 6 7 8] [ 9 10 11] [12 13 14] [15 16 17] [18 19 20] [21 22 23] [24 25 26] [27 28 29]][[01 2] [ 3 1 5] [ 6 1 8] [ 9 1 11] [12 1 14] [15 1 17] [18 1 20] [21 1 23] [24 1 26] [27 1 29]][[02] [ 3 5] [ 6 8] [ 9 11] [12 14] [15 17] [18 20] [21 23] [24 26] [27 29]][ 74.25 0. 74.25]
Example: Predicting whether a person earns more than $50,000 a year with a adult dataset using features to model complex real-world models
ImportOSImportPandas as Pddata_folder= Os.path.join (OS.GETCWD (),'Data','Adult') Adult_filename= Os.path.join (Data_folder,'Adult.data.txt') Adult= Pd.read_csv (adult_filename,header=None, Names=[" Age","Work-class","FNLWGT", "Education","Education-num", "Marital-status","Occupation", "Relationship","Race","Sex", "Capital-gain","Capital-loss", "Hours-per-week","Native-country", "Earnings-raw"]) Adult.dropna ( how=' All', inplace=true)#We need to delete the row that contains the invalid number (the set inplace parameter is true, which means to change the current data frame instead of creating a new one). #print (adult["Work-class"].unique ()) #数据框的unique函数就能得到所有的工作情况adult["longhours"] = adult["Hours-per-week"] > 40#Converting a continuous value to a class-type feature by converting it to a class-type feature in a discretization process#test the performance of a single feature on a adult dataset,X = adult[[" Age","Education-num","Capital-gain","Capital-loss","Hours-per-week"]].valuesy= (adult["Earnings-raw"] =='>50k'). Values fromSklearn.feature_selectionImportselectkbest fromSklearn.feature_selectionImportChi2transformer= Selectkbest (Score_func=chi2, k=3)#using the Selectkbest Converter class, score with Chi Square function, initialize the converterXt_chi2 = Transformer.fit_transform (X, y)#call the Fit_transform method to preprocess and transform the same data setPrint(Transformer.scores_)#correlation of each column fromSklearn.treeImportDecisiontreeclassifier fromSklearn.cross_validationImportCROSS_VAL_SCORECLF= Decisiontreeclassifier (random_state=14) Scores_chi2= Cross_val_score (CLF, Xt_chi2, y, scoring='accuracy')Print(SCORES_CHI2)
Results:
[8.60061182e+03 2.40142178e+03 8.21924671e+07 1.37214589e+06
6.47640900E+03]
[0.82577851 0.82992445 0.83009306] #正确率达到83%
Create a feature
A strong correlation between features, or feature redundancy, increases the difficulty of algorithmic processing. For this reason, create the feature.
fromCollectionsImportdefaultdictImportOSImportNumPy as NPImportPandas as Pddata_folder= Os.path.join (OS.GETCWD (),"Data") Data_filename= Os.path.join (Data_folder,"Adult","Ad.data.txt")#The first few features are numeric values, but pandas will treat them as strings. To fix this problem, we need to write a function that converts a string into a number, which converts a string containing only a number to a number and converts the rest to "NaN"defconvert_number (x):Try: returnfloat (x)exceptValueError:returnnp.nanconverters=defaultdict (convert_number) converters[1558] =LambdaX:1ifX.strip () = ="AD." Else0#convert category values from string to numeric value forIinchRange (1558):#This is defined so that the dictionary is preceded by a definitionconverters[i]=Lambdax:convert_number (x) Ads= Pd.read_csv (Data_filename, Header=none, converters=converters)#print (Ads[:5])Ads.dropna (Inplace=true)#Delete empty lines#extracting X-matrices and Y-arrays for classification algorithmsX = Ads.drop (1558, Axis=1). Valuesy= ads[1558] fromSklearn.decompositionImportPca#The purpose of principal component analysis (Principal Component ANALYSIS,PCA) is to find a combination of features that can be used to describe data sets with less information, to create a model based on the data of PCA, and not only to approximate the original data set, but also to improve the accuracy of the classification task. PCA = PCA (n_components=5) Xd=pca.fit_transform (X) np.set_printoptions (Precision=3, suppress=True)Print(Pca.explained_variance_ratio_)#variance of each feature fromSklearn.treeImportDecisiontreeclassifier fromSklearn.cross_validationImportCROSS_VAL_SCORECLF= Decisiontreeclassifier (random_state=14) scores_reduced= Cross_val_score (CLF, Xd, y, scoring='accuracy')Print(scores_reduced)#make a graph of the first two features returned by PCA fromMatplotlibImportPyplot as Pltclasses=set (y) colors= ['Red','Green'] forCur_class, ColorinchZip (classes, colors): Mask= (Y = =cur_class). Values Plt.scatter (xd[mask, 0], Xd[mask,1], marker='o', color=Color, Label=Int (cur_class)) Plt.legend () Plt.show () results: [0.854 0.145 0.0010.0. ][ 0.944 0.924 0.925]
Python data Mining (extracting features from a data set)