A few days ago, I asked a question on Weibo: "One of two categories of classification problems, 5400 training samples, 600 test samples (the test and training samples do not overlap), and 10000 dimensions of features, using RBF kernel SVM training, the test error of the model obtained in the test set is 50% (the same as that of random conjecture ). If the linear kernel is used, the accuracy can reach 80%. Is this normal ?"
Many Daniel, including Yu Kai, instructor Mu, and Shan shiguang, answered the question enthusiastically. I would like to thank you here and I will not name them one by one.
The features of the RBF kernel, the spatial distribution of features, the high-dimensional geometric intuition, and the Processing Methods of the human brain are dazzled and benefited a lot.
It turns out that I am using RBF
Training caused by improper settings of the Gamma parameter. My modifications and results are as follows: I normalized the feature vectors of each sample by L2 (at this time, the value of Gamma is near 1, which has little impact on performance ), the accuracy of RBF features is 86%. The linear kernel result without normalization can reach 84% (compared with the Weaver kernel version, the features are adjusted above, so the linear kernel performance is improved. In addition, l2 changes the relative values of the same feature of different samples. Therefore, linear versions are not normalized using L2 ).
What you need to pay attention to here is the physical meaning of Gamma. We mention a lot of RBF width, which affects the range of Gaussian corresponding to each support vector, thus affecting the generalization performance. My understanding: if it is too large, it will only apply to the vicinity of the SVM sample. The classification effect for unknown samples is very poor, there is a possibility that the training accuracy can be very high, but the test accuracy is not high, it is usually said to be trained. If it is set too small, it will lead to a large smoothing effect. It will not be able to get a particularly high accuracy in the training set, but also affect the accuracy of the test set. In addition, pay attention to the relationship between sigma and gamma in the RBF formula: the RBF kernel is typically defined as k (x, z) = exp (-(d (x, z) ^ 2)/(2 * Sigma ^ 2), which can be re-defined in terms of Gamma as K (x, z) = exp (-Gamma * (d (x, z) ^ 2), where Gamma = 1/(2 * Sigma ^ 2 ). In addition, there are two clear conclusions: Conclusion 1: if the number of samples is smaller than the feature dimension, it may not necessarily lead to over-fitting. For more information, see the comments from instructor Yu Kai: "That's not the reason, huh, huh. Use RBF
Kernel, the system dimension does not actually exceed the number of samples, and there is no trivial relationship with the feature dimension ."
Conclusion 2: the RBF kernel should be similar to the linear kernel (according to the theory, the RBF kernel can simulate the linear kernel), which may be better than or worse than the linear kernel, there shouldn't be too many differences. Of course, in many problems, for example, when the dimension is too high or the number of samples is large, we prefer to use a linear kernel, because the effect is equivalent, but in terms of speed and model size, linear kernels have better performance.
Instructor Mu has another comment, which can help beginners better understand SVM: "NOTE: RBF actually refers to a number of examples, which are equivalent to each other in Sv. The weight learned by the linear kernel is feature.
Weighting function or feature selection ."
Reprinted: http://blog.sina.com.cn/s/blog_6ae183910101cxbv.html