Python data Mining (extracting features from a data set)

Source: Internet
Author: User

Most data mining algorithms rely on numeric or categorical features, extracting numeric and categorical features from a data set, and selecting the best features.

Features can be used for modeling, and models represent reality in an approximate way that machine mining algorithms can understand

Another advantage of feature selection is that the model is easier to manipulate than reality by reducing the complexity of the real world

Feature Selection

The Variancethreshold converter in Scikit-learn can be used to remove the variance of eigenvalues that do not reach the minimum standard characteristics.

ImportNumPy as Npx= Np.arange (+). Reshape ((10,3))#10 Individual, 3-feature datasetsPrint(x) x[:,1] = 1#change the value of all second columns to 1Print(x) fromSklearn.feature_selectionImportVARIANCETHRESHOLDVT= Variancethreshold ()#Variancethreshold Converter, using it to process data setsXt =vt.fit_transform (x)Print(Xt)#The second column disappears .Print(Vt.variances_)#output variance for each columnresults: [[01 2] [ 3 4 5] [ 6 7 8] [ 9 10 11] [12 13 14] [15 16 17] [18 19 20] [21 22 23] [24 25 26] [27 28 29]][[01 2] [ 3 1 5] [ 6 1 8] [ 9 1 11] [12 1 14] [15 1 17] [18 1 20] [21 1 23] [24 1 26] [27 1 29]][[02] [ 3 5] [ 6 8] [ 9 11] [12 14] [15 17] [18 20] [21 23] [24 26] [27 29]][ 74.25 0. 74.25]

Example: Predicting whether a person earns more than $50,000 a year with a adult dataset using features to model complex real-world models

ImportOSImportPandas as Pddata_folder= Os.path.join (OS.GETCWD (),'Data','Adult') Adult_filename= Os.path.join (Data_folder,'Adult.data.txt') Adult= Pd.read_csv (adult_filename,header=None, Names=[" Age","Work-class","FNLWGT",                           "Education","Education-num",                           "Marital-status","Occupation",                           "Relationship","Race","Sex",                           "Capital-gain","Capital-loss",                           "Hours-per-week","Native-country",                           "Earnings-raw"]) Adult.dropna ( how=' All', inplace=true)#We need to delete the row that contains the invalid number (the set inplace parameter is true, which means to change the current data frame instead of creating a new one). #print (adult["Work-class"].unique ()) #数据框的unique函数就能得到所有的工作情况adult["longhours"] = adult["Hours-per-week"] > 40#Converting a continuous value to a class-type feature by converting it to a class-type feature in a discretization process#test the performance of a single feature on a adult dataset,X = adult[[" Age","Education-num","Capital-gain","Capital-loss","Hours-per-week"]].valuesy= (adult["Earnings-raw"] =='>50k'). Values fromSklearn.feature_selectionImportselectkbest fromSklearn.feature_selectionImportChi2transformer= Selectkbest (Score_func=chi2, k=3)#using the Selectkbest Converter class, score with Chi Square function, initialize the converterXt_chi2 = Transformer.fit_transform (X, y)#call the Fit_transform method to preprocess and transform the same data setPrint(Transformer.scores_)#correlation of each column fromSklearn.treeImportDecisiontreeclassifier fromSklearn.cross_validationImportCROSS_VAL_SCORECLF= Decisiontreeclassifier (random_state=14) Scores_chi2= Cross_val_score (CLF, Xt_chi2, y, scoring='accuracy')Print(SCORES_CHI2)
Results:

[8.60061182e+03 2.40142178e+03 8.21924671e+07 1.37214589e+06
6.47640900E+03]
[0.82577851 0.82992445 0.83009306] #正确率达到83%

Create a feature

A strong correlation between features, or feature redundancy, increases the difficulty of algorithmic processing. For this reason, create the feature.

 fromCollectionsImportdefaultdictImportOSImportNumPy as NPImportPandas as Pddata_folder= Os.path.join (OS.GETCWD (),"Data") Data_filename= Os.path.join (Data_folder,"Adult","Ad.data.txt")#The first few features are numeric values, but pandas will treat them as strings. To fix this problem, we need to write a function that converts a string into a number, which converts a string containing only a number to a number and converts the rest to "NaN"defconvert_number (x):Try:        returnfloat (x)exceptValueError:returnnp.nanconverters=defaultdict (convert_number) converters[1558] =LambdaX:1ifX.strip () = ="AD." Else0#convert category values from string to numeric value forIinchRange (1558):#This is defined so that the dictionary is preceded by a definitionconverters[i]=Lambdax:convert_number (x) Ads= Pd.read_csv (Data_filename, Header=none, converters=converters)#print (Ads[:5])Ads.dropna (Inplace=true)#Delete empty lines#extracting X-matrices and Y-arrays for classification algorithmsX = Ads.drop (1558, Axis=1). Valuesy= ads[1558] fromSklearn.decompositionImportPca#The purpose of principal component analysis (Principal Component ANALYSIS,PCA) is to find a combination of features that can be used to describe data sets with less information, to create a model based on the data of PCA, and not only to approximate the original data set, but also to improve the accuracy of the classification task. PCA = PCA (n_components=5) Xd=pca.fit_transform (X) np.set_printoptions (Precision=3, suppress=True)Print(Pca.explained_variance_ratio_)#variance of each feature fromSklearn.treeImportDecisiontreeclassifier fromSklearn.cross_validationImportCROSS_VAL_SCORECLF= Decisiontreeclassifier (random_state=14) scores_reduced= Cross_val_score (CLF, Xd, y, scoring='accuracy')Print(scores_reduced)#make a graph of the first two features returned by PCA fromMatplotlibImportPyplot as Pltclasses=set (y) colors= ['Red','Green'] forCur_class, ColorinchZip (classes, colors): Mask= (Y = =cur_class). Values Plt.scatter (xd[mask, 0], Xd[mask,1], marker='o', color=Color, Label=Int (cur_class)) Plt.legend () Plt.show () results: [0.854 0.145 0.0010.0. ][ 0.944 0.924 0.925]

Python data Mining (extracting features from a data set)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.