[ML] VC Dimension

Last Update:2018-12-05 Source: Internet

Author: User

Tags svm rbf kernel

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

VC Dimension ( Vapnik-Chervonenkis dimension) Is an important indicator of function set learning performance defined by statistical learning theory to study the speed and promotion of consistent convergence in the learning process. The traditional definition is: For an indicator function set, if H samples exist, functions in the function set can be separated by all possible forms of K power of 2, the function set can scatter H samples. The VC Dimension of the function set is the maximum number of samples H that it can scatter. If a function can scatter any number of samples, the VC Dimension of the function set is infinite, the VC dimension of a bounded real function can be defined by converting it into an indicator function with a certain threshold value. The VC Dimension reflects the learning capability of the function set. The larger the VC dimension, the more complex the learning machine (the larger the capacity). Unfortunately, currently, there is no general theory about VC Dimension computing for any function set, and only the VC dimension is known for some special function sets. For example, in the n-dimensional space, the VC Dimension of the midline classifier and linear functions is n + 1. ========================================================== ======================================

"The SVM method is based on the VC Dimension Theory of the Statistical Learning Theory and the minimum structure risk principle"

Structured risks

Structured risk = experience Risk + Confidence risk

Empirical risk = Error of classifier on a given sample

Confidence risk = Error of classification result of classifier on unknown text

Confidence risk factors:

The number of samples. The larger the number of given samples, the more likely the learning results are to be correct. The lower the confidence risk;
Obviously, the larger the VC dimension, the worse the promotion capability, and the larger the confidence risk.

Increase the number of samples, reduce the VC dimension, and reduce the confidence risk.

Introduction to Machine Learning: Assume that we have a dataset Containing N points. The N points can be marked as positive and negative (and do not belong to a class) in a way ). Therefore, n data points can define different learning problems. Assume that (,) is before the comma, and none of the comma. For example, one point can be divided into (a,) (, a); two points can be divided into (AB,) (, AB) (a, B) (B,) four, and so on... for any of these questions, we can find a hypothesis, hε, that separates the positive and negative examples, then we are called the shatter N points. That is to say, any learning problem that can be defined with N points can be learned without error based on the assumption extracted from it. The maximum number of points that can be hashed is the VC Dimension (Vapnik and Cortes), which is recorded as VC (), which measures the learning capability (capactiy) of the hypothesis class ).

========================================================== ======================================

Therefore, it is concluded that only three points in the plane can be split in a straight line, but not 4th points.

I understand this conclusion as follows:

(1) only three points in the plane can be split in a straight line: a straight line can only divide a pile of points into two heaps. For three vertices, there are 23 orders to divide them into two heaps and add order. Where A, B, and C represent three vertices, + 1,-1 represent the heap category, {A →-1, BC → + 1} indicates that A is in the heap marked as-1, and B and C are in the heap marked as + 1. This is a type of distribution. And so on. There are eight methods:

{A →-1, BC → + 1}, {A → + 1, BC →-1}

{B →-1, AC → + 1}, {B → + 1, BC →-1}

{C →-1, AB → + 1}, {C → + 1, BC →-1}

{ABC →-1}, {ABC → + 1}

(2) No four points can be found. If yes, there should be a 24 = 16 method, but the four points are divided into two heaps: one heap, one vertex, and the other one, three vertices (); the two vertices are evenly divided ); there are no (0, 4) Three cases in one heap and four heap. In the first case, you can create a pile at a time for each of the four points. In addition, there are eight orders:

{A →-1, BCD → + 1}, {A → + 1, BCD →-1}

{B →-1, ACD → + 1}, {B → + 1, ACD →-1}

{C →-1, Abd → + 1}, {C → + 1, Abd →-1}

{D →-1, ABC → + 1}, {D → + 1, ABC →-1 };

In the second case, there are four types:

{AB →-1, CD → + 1}, {AB → + 1, CD →-1}

{AC →-1, BD → + 1}, {AC → + 1, BD →-1}

There is no straight line that can make ad in a pile, while BC in a pile, because A and D are in the diagonal position, and B and C are in the diagonal position. (This is what I found on the graph)

In the third case, there are two types;

{ABCD →-1}

{ABCD → + 1}

Therefore, there are only 8 + 4 + 2 = 14 methods in total, which do not meet the 24 = 16 method. Therefore, no four points in the plane can be split in a straight line.

========================================================== ======================================

The essence and Structure Risk of VC dimension are minimized.

Http://hi.baidu.com/jipyjyxfkubhtxq/item/a04bc83f71f33f9cb90c03f9

In the case of limited training samples, when the number of samples is N, the VC Dimension of the learning machine increases the complexity of the learning machine. The VC Dimension reflects the learning capability of the function set. The larger the VC dimension, the more complex the learning machine (the larger the capacity ).

The so-called structure risk minimization is to reduce the VC Dimension of the learning machine while ensuring the classification accuracy (experience risk), so that the expected risk of the learning machine in the entire sample set can be controlled.

Scope of promotion (relationship between experience Risk and actual risk. What is the reason for the introduction? Because the training error is smaller, that is, in this training set, the overfitting problem may occur if the promotion capability is insufficient. Therefore, we should introduce the confidence range, that is, the relationship between empirical error and actual expected error.): Expected error R (ω) ≤ remp (ω) + PHI (N/h) note that remp (ω) is an empirical error, that is, a training error (so that all training is correct in a linear manner), and PHI (N/h) is a confidence range, it is related to the number of samples and the VC dimension. In the above formula, the confidence range (PHI) increases with N/h and decreases monotonically. That is to say, when N/H is small and the confidence range is large, there is a large error when the empirical risk is similar to the actual risk. Therefore, the empirical risk minimization principle is used, the obtained optimal solution may have a poor generalization. If the number of samples is large and N/H is large, the confidence range will be small and the empirical risk minimization principle will be adopted, the obtained optimal solution is close to the actual optimal solution. We can see that there are two factors that affect the upper bound of expected risks: the first is the scale N of the training set, and the second is the VC dimension H. It can be seen that, while ensuring classification accuracy (experience risk), reducing the VC Dimension of the learning machine can control the expected risk of the learning machine in the entire sample set, this is the origin of structure risk minimization (SRM. When the number of samples is n fixed for a limited number of training samples, the VC Dimension of the learning machine increases (the complexity of the learning machine increases), and the confidence interval increases, the greater the difference between real and empirical risks is, that is why learning has occurred. In the machine learning process, we should not only minimize the experience risk, but also minimize the VC dimension to narrow down the confidence scope so as to obtain a smaller actual risk, that is, it has a better promotion for future samples, it is related to the VC Dimension of the learning machine and the number of training samples.

The linear differentiation problem is that the classification surface can not only correctly separate the two types of samples (the training error rate is 0) to meet the requirements of the optimal classification surface ), what's more, we need to set the maximum interval between the two categories (what's the problem? It turned out to be well-founded. It has been depressing for a long time. At the @ interval, the VC dimension H of the hyperplane set satisfies the following relationship:

H = f (1 /@*@)

F () is a monotonic increasing function, that is, H is inversely proportional to the square. Therefore, when the training sample is scheduled, the larger the classification interval, the smaller the VC Dimension of the corresponding classification hyperplane set .). Based on the principle of minimizing structural risks, the former ensures that empirical risks (empirical risks and expected risks depend on the selection of machine learning function families) are minimized, while the latter minimizes the classification interval, resulting in the smallest VC dimension, in fact, it is to minimize the confidence range in the promotional world to minimize the real risk. Note: A large confidence scope indicates a large difference between real and empirical risks.

After explaining this, I finally got a little eye on my face. Oh, that's the case. To sum up, the training samples can be correctly classified when they can be linearly divided (w * Xi + B is not the legendary Yi * (w * Xi + B)> = 1), that is, when the empirical risk remp is 0, the classification interval is maximized (w) = (1/2) * w), so that the classifier can achieve the best promotion performance.

After the linear division is explained, we know that linear division is not possible many times. So what is the difference? Of course there will be a difference in nonsense. What is the essential difference? The essential difference is that you do not know whether to linearly divide the samples, but the samples with incorrect scores are allowed (HOHO is not clear at all), but it is precisely because the samples with incorrect scores are allowed, at this time, the soft interval classification hyperplane indicates the hyperplane with the maximum classification interval after removing the error samples. Here we see the new word Relaxation Factor. Why? It is used to control the error sample. In this case, the empirical risk will be associated with the songchi factor. C is the coefficient before the relaxation factor, and C> 0 is a custom penalty factor, which controls the punishment for the error sample, it is used to control the compromise between sample deviations and machine promotion capabilities. The smaller the value of C, the smaller the penalty, the larger the training error, and the larger the structural risk. The larger the value of C, the larger the penalty and the greater the degree of constraints on the error sample, however, this will increase the weight of the second confidence range, then the weight of the classification interval will be relatively smaller, and the generalization capability of the system will be deteriorated. It is necessary to select the appropriate C.

Select the kernel function.

There are many types of kernel functions, such as linear kernel, polynomial kernel, sigmoid kernel, and radial basis function kernel. In this paper, the kernel function (RBF kernel K (x, y) = exp (-Gamma | x-y | square), gamma> 0) with RBF kernel As SVM is selected ). Because the RBF core can map samples to a higher-dimensional space, it can process the samples when the relationship between class labels and features is non-linear. Keerthi and other [25] prove that a linear kernel with the penalty parameter C is the same as a parameter (C, gamma) (where c is the penalty factor and gamma is the kernel parameter) has the same performance. For some parameters, the sigmoid kernel has similar performance as the RBF kernel [26]. In addition, the RBF kernel has fewer parameters than the polynomial kernel. Because the number of parameters directly affects the complexity of model selection. It is very important that 0 <kij ≤ 1 is opposite to the polynomial kernel, and the kernel value may tend to be infinite (Gamma Xi XJ + r> 1) or 0 <Gamma Xi XJ + r <1, the span is very large. Furthermore, it must be noted that the sigmoid kernel is incorrect under some parameters (for example, there is no inner product of two vectors ).

(4) use cross-validation to find the best parameters C and Gamma. When using the RBF kernel, two parameters C and Gamma should be considered. Because parameter selection does not have a certain degree of prior knowledge, you must select a certain type of model (parameter search ). The objective is to make the classifier correctly predict unknown data (that is, Test Set Data) with a high classification accuracy rate. It is worth noting that a high training accuracy rate is the accuracy rate of the training data that is known by the classifier prediction class label. It cannot guarantee a high prediction accuracy in the test set. Therefore, cross-validation is often used to improve prediction accuracy. K-fold cross validation)

Is to divide the training set into k subsets of the same size. One subset is used for testing and the other K-1 subsets are used to train the classifier. In this way, each subset of the entire training set is predicted once, and the accuracy rate of cross-validation is the average value of the percentage of data correctly classified K times. It can prevent the issue of overfitting.

========================================================== ============================================ Http://www.caogenit.com/caogenxueyuan/yingyongfangxiang/rengongzhineng/1194.htmlhttp://blog.csdn.net/marising/article/details/5888531http://liujian.jigsy.com/entries/blog/vc%E7%BB%B4-vapnik-chervonekis-theorem

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More