Machine learning-Automatically filter features

Source: Internet
Author: User
Tags true true

Select Attributes (Weka's book to translate features into attributes, then the attributes here are actually referred to as features)
The purpose of selecting attributes:
is by searching all the possible combinations of attributes in the data to find the best subset of attributes to predict.
That is, assuming that the attribute currently has 6, the accuracy rate is 80%, assuming that only 5 attributes (this is a subset), but the accuracy rate is changed to 90%

The selection attribute is different from the PCA dimensionality reduction.
The purpose of selecting attributes and PCA is the same, both to reduce the number of features and to reduce the amount of computation
However, PCA is a map compression of the original features, which makes the original features unrecognizable and generate new features. In other words, the original 20 features, PCA after
The new 5 features (these 5 features that I don't know are representative of the original). But sometimes these 5 characteristics do not completely replace the original 20 features, but the 98% of the information, which is also able to receive the range.

In the Weka Explorer page, there is a menu of select Attributes Selection properties
After entering select attributes
There are attribute Evaluator (property evaluator), search method two options, both of which need to be combined with

Property evaluators
Subset Evaluators
Cfssubseteval evaluates the predictive capability of each attribute and its redundancy, preferring to choose attributes that are related to the category attribute, but that have a low degree of direct correlation with each other.
Wrappersubseteval uses a classifier (which can be any classification algorithm) to evaluate the attribute set, which uses cross-validation of each subset to estimate

Single Property evaluator
Relieffattributeeval is an instance-based evaluator that randomly extracts instance samples and examines adjacent instances with the same and different classes. It can run on
For discrete categories and continuous categories of data, the parameters include the number of specified sampled instances, the number of neighboring instances to check, whether the distance to the nearest neighbor is weighted, and the control
exponential function of how weights are attenuated by distance

The Infogainattributeeval evaluator evaluates attributes by measuring the property information gain of the category
The Gainratioattributeeval evaluator evaluates the properties by measuring the gain rate for the corresponding class
The Sysmmtricaluncertattributeeval evaluator evaluates attributes by measuring the symmetry of the corresponding class of uncertainties
The Onerattributeeval evaluator uses the accuracy metric used by the Oner classifier.

Search method
The search method traverses the property space to search for a good subset, and then uses the evaluator to measure its quality.
The Bestfirst search method performs a greedy mountain climbing method with backtracking.
Greedystepwise Search Method A subset space for greedy search properties

Ranker is not a method of searching for a subset of attributes, but rather a single attribute ranking

How to automatically filter features in Sklearn
There are two ways to do this, and the Sklearn has the appropriate modules:
1.sklearn.feature_selection
2.sklearn.ensemble (Integration module, there will be a property feature_importances_ in the decision tree model, you can filter the features by this sort)

Sklearn.feature_selection

#coding =utf-8 ' Created on 2018-3-1 ' # RFE and Rfecv # function: Feature filtering by filtering subsets + classifiers # parameter details please refer to # HTTP://SCIKIT-LEARN.ORG/STABLE/MO Dules/generated/sklearn.feature_selection. Rfe.html#sklearn.feature_selection. RFE # http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection. Rfecv.html#sklearn.feature_selection. RFECV # This method has a drawback, that is, when the amount of data is very large but the memory is very small when there is no Shard training function from Sklearn Import datasets from sklearn.feature_selection import RFE from Sklearn.linear_model Import logisticregression DataSet =datasets.load_iris () # Load data Print Dataset.data.shape # (4) m Odel = Logisticregression () # build Classifier RFE = RFE (model) RFE = Rfe.fit (dataset.data,dataset.target) print (Rfe.n_features_) #选出 To the number of features # 2 print (Rfe.support_) #查看哪些特征是被选择的 # [False True false true], result description 2nd, 4 features selected print (Rfe.ranking_) #查看特征排名 # [3 1 2 1], the results show that the feature rankings are 3,1,2,1, of which two are tied first print rfe.get_support (True) # [1 3], which returns the subscript of the filtered feature from Sklearn.feature_selection Import RFECV # uses RFECV, compared to RFE, where the classifier's verification result is cross-validation rfecv = RFECV (model) # limit NumbeR of variables to three rfecv = Rfecv.fit (dataset.data,dataset.target) print (Rfecv.n_features_) #选出来特征的数量 # 3 print (RFECV. Support_) #查看哪些特征是被选择的 # [False true True], the result shows that the 2,3,4 feature is selected by print (rfecv.ranking_) #查看特征排名 # [2 1 1 1], the results show that the feature rank is 2 , 1,1,1, which has three tie # selectpercentile # functions: The higher the score, the better the feature is by calculating the score value of each feature. # parameter details please refer to # http://scikit-learn.org/stable/ Modules/generated/sklearn.feature_selection. Selectpercentile.html#sklearn.feature_selection. Selectpercentile from Sklearn Import datasets from sklearn.feature_selection import selectpercentile, F_classif import nu Mpy as NP DataSet =datasets.load_iris () # Load data Print Dataset.data.shape # (4) # F_classif is a built-in computational function function, there are a number of calculation functions to choose from; Keep 10% The most significant feature of Selectorper = Selectpercentile (F_classif, percentile=10) Selectorper = Selectorper.fit (Dataset.data, Dataset.target) print (Selectorper.scores_) #特征的得分值 # [119.26450218 47.3644614 1179.0343277 959.32440573] Print (s Electorper.pvalues_) #特征的p-value value # [1.66966919e-31 1.32791652e-16 3.05197580e-91 4.37695696e-85], the results illustrate 2nd, 4 features are selected scores =-np.log10 (Selectorper.pvalues_) Print scores # [30.77736957 15.876          82923 90.51541891 84.35882772] scores/= Scores.max () Print scores # [0.3400235 0.17540469 1. 0.93198296] # selectkbest # function: Select features according to K highest score from Sklearn import datasets from sklearn.feature_selection import Selectkbes T DataSet =datasets.load_iris () # Load data Print Dataset.data.shape # (4) # F_classif for the built-in computational function function, there are a number of calculation functions to choose from; Select the top K highest scoring feature s ELECTKB = Selectkbest (f_classif,k=2) selectkb = Selectkb.fit (dataset.data,dataset.target) print (SelectKB.scores_) # Score value of the feature # [119.26450218 47.3644614 1179.0343277 959.32440573] Print (Selectkb.pvalues_) #特征的得分值 # [1.66966919e-3 1 1.32791652e-16 3.05197580e-91 4.37695696e-85] Print (Selectkb.get_support ()) #特征的得分值 # [False false True] PR int Selectkb.get_support (True) # [2 3], which returns the subscript of the filtered feature, but one drawback is that sorting by score values is not sorted

Sklearn.ensemble

#coding =utf-8 "
Created on
2018-3-1" from

sklearn import datasets from
Sklearn import metrics From
sklearn.ensemble import extratreesclassifier

DataSet =datasets.load_iris () # load data
model = Extratreesclassifier () # Generate decision Tree
Model.fit (dataset.data,dataset.target)
print (Model.feature_importances_) # Get key Features
# [0.13826514  0.09158596  0.28037233  0.48977657]

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.