Feature Selection One--filter approach

Source: Internet
Author: User

This section will only involve the method of selecting a subset of features, and the method of mapping high latitude feature space to low dimensional space (such as PCA) is not involved.

I. Single-variable

Advantages: Fast operation, independent of classifier

Cons: Ignoring the relationship between features, ignoring the association with the classifier (can not be adjusted to improve performance when training the model)

1. Chi-Square inspection

Main content reference from http://blog.sina.com.cn/s/blog_6622f5c30101datu.html

The thought of chi-square test is to determine whether the theory is correct or not by observing the deviation between the actual value and the theoretical value. The original hypothesis H0 (null hypothesis) assumes that the observed values do not differ from the theoretical values. First of all, assuming the original hypothesis, based on this to calculate the chi-square value, which represents the degree of deviation between the observed value and the theoretical value. Based on the chi-square distribution and degrees of freedom, we can determine the current statistics and the more extreme situation p in the case of H0 hypothesis. If the P-value is small, the invalid hypothesis should be rejected. Otherwise, invalid assumptions cannot be rejected.

In feature selection, we can assume that the original hypothesis H0: The first feature is not related to category C, so the larger the calculated Chi square value, the more relevant the feature is to category C, which means that this feature is more important.

Feature Selection

belongs to dna-binding protein

Not belonging to dna-binding protein

Total

Contains "AA"

A

B

A+b

Does not contain "AA"

C

D

C+d

Total

A+c

B+d

N

A: Indicates the number of dna-binding protein containing the AA fragment

B: The number of non dna-binding protein that contain the AA fragment

C: Number of dna-binding protein that do not contain AA fragments

D: Indicates the number of non dna-binding protein that do not contain AA fragments

The original hypothesis: AA fragments are not related to dna-binding protein.

According to the original hypothesis, the proportion of AA that appears in Dna-bindig protein should be the same as all documents that contain AA, so the theoretical value should be:

Similarly, you can calculate d12,d21,d22.

Because we only need relative values, so:

Comment: Because the frequency of a fragment in a protein is not taken into account in the calculation, the chi-square value that occurs once in all samples of a certain type of protein is greater than that of 10 fragments in a sample of 99% of such proteins. This is the "low frequency word defect".

I think it is possible to apply chi-square tests in bioinformatics to test the role of a particular trait in a specific category. But there is a question, is not both positive and negative samples will contain AA this fragment? Just the difference in frequency? If this is the case, then this method will not be OK. Because it does not take into account the frequency of a particular characteristic in a protein sequence. But I still think this method can be researched, we need to check the feature extraction method of our group to see if it is applicable.

Feature Selection One--filter approach

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.