This section will only involve the method of selecting a subset of features, and the method of mapping high latitude feature space to low dimensional space (such as PCA) is not involved.
I. Single-variable
Advantages: Fast operation, independent of classifier
Cons: Ignoring the relationship between features, ignoring the association with the classifier (can not be adjusted to improve performance when training the model)
1. Chi-Square inspection
Main content reference from http://blog.sina.com.cn/s/blog_6622f5c30101datu.html
The thought of chi-square test is to determine whether the theory is correct or not by observing the deviation between the actual value and the theoretical value. The original hypothesis H0 (null hypothesis) assumes that the observed values do not differ from the theoretical values. First of all, assuming the original hypothesis, based on this to calculate the chi-square value, which represents the degree of deviation between the observed value and the theoretical value. Based on the chi-square distribution and degrees of freedom, we can determine the current statistics and the more extreme situation p in the case of H0 hypothesis. If the P-value is small, the invalid hypothesis should be rejected. Otherwise, invalid assumptions cannot be rejected.
In feature selection, we can assume that the original hypothesis H0: The first feature is not related to category C, so the larger the calculated Chi square value, the more relevant the feature is to category C, which means that this feature is more important.
Feature Selection |
belongs to dna-binding protein |
Not belonging to dna-binding protein |
Total |
Contains "AA" |
A |
B |
A+b |
Does not contain "AA" |
C |
D |
C+d |
Total |
A+c |
B+d |
N |
A: Indicates the number of dna-binding protein containing the AA fragment
B: The number of non dna-binding protein that contain the AA fragment
C: Number of dna-binding protein that do not contain AA fragments
D: Indicates the number of non dna-binding protein that do not contain AA fragments
The original hypothesis: AA fragments are not related to dna-binding protein.
According to the original hypothesis, the proportion of AA that appears in Dna-bindig protein should be the same as all documents that contain AA, so the theoretical value should be:
Similarly, you can calculate d12,d21,d22.
Because we only need relative values, so:
Comment: Because the frequency of a fragment in a protein is not taken into account in the calculation, the chi-square value that occurs once in all samples of a certain type of protein is greater than that of 10 fragments in a sample of 99% of such proteins. This is the "low frequency word defect".
I think it is possible to apply chi-square tests in bioinformatics to test the role of a particular trait in a specific category. But there is a question, is not both positive and negative samples will contain AA this fragment? Just the difference in frequency? If this is the case, then this method will not be OK. Because it does not take into account the frequency of a particular characteristic in a protein sequence. But I still think this method can be researched, we need to check the feature extraction method of our group to see if it is applicable.
Feature Selection One--filter approach