For more information, please refer to http://www.blogjava.net/zhenandaci/archive/2009/02/13/254578.html
Support Vector Machines (SVM), which was first proposed by Cortes and Vapnik in 1995, shows many unique advantages in solving small sample, nonlinear and high dimensional pattern recognition, and can be applied to other machine learning problems such as function fitting (10).
The support vector machine method is based on the VC dimension Theory of statistical learning theory and the minimum structure risk principle, and the best tradeoff between the complexity of the model (that is, the learning accuracy of a given training sample, accuracy) and learning ability (i.e. the ability to identify any sample without error) is based on the limited sample information. With the aim of obtaining the best promotion ability [14] (or generalization ability)
VC dimension is a measure of the function class, which can be understood as the complexity of the problem.
Statistical learning therefore introduces the concept of generalized error bounds, that is, the real risk should be described by two parts, one is the empirical risk, which represents the error of the classifier on a given sample, the second is the confidence risk, which represents the extent to which we can trust the classifier to classify the unknown text. Obviously, the second part is no way to accurately calculate, so can only give an estimate of the interval, also makes the entire error can only calculate the upper bound, and can not calculate the exact value (so called generalization error bounds, not called generalization error).
The confidence risk is related to two quantities, one is the sample quantity, obviously the larger the given sample quantity, the more likely our learning result is correct, at this time the confidence risk is smaller, the second is the VC dimension of the classification function, obviously the VC dimension is bigger, the promotion ability is worse, the confidence risk will become bigger.
The formula for the generalized error bounds is:
R (W) ≤remp (w) +ф (n/h)
The formula R (W) is the real risk, remp (W) is the empirical risk, Ф (n/h) is the confidence risk. The goal of statistical learning is to minimize the risk of experience and the minimum of the risk of confidence, that is, the minimum of structural risks.
SVM is good at dealing with the linear non-division of sample data, mainly through relaxation variables (also called penalty variables) and nuclear function technology to achieve
In the case of text categorization, we can let the computer look at the training samples we provide to it, each of which consists of a vector (which is the vector of the text features) and a marker (which identifies which category the sample belongs to). As follows:
Di= (Xi,yi)
Xi is the text vector (the number of dimensions is very high), Yi is the classification mark.
In a two-dollar linear classification, the tag representing the classification has only two values, 1 and 1 (to indicate whether it belongs to or does not belong to this class). With this notation, we can define the interval of a sample point to a certain hyper-plane:
Δi=yi (WXI+B)
This formula at first glance no mysterious, also can not say any reason, just a definition, but we do change, we can see some interesting things.
First notice that if a sample belongs to that category, then Wxi+b>0 (remember?). This is because our chosen g (x) =wx+b is classified by more than 0 or less than zero, and Yi is greater than 0, and if it is not in that category, then wxi+b<0, and Yi is less than 0, which means that Yi (wxi+b) is always greater than 0, and its value is equal to |wxi+b|! (aka |g (xi) |)
Now let's get the W and b normalized, i.e. with w/| | w| | and b/| | w| | Instead of the original W and B, the intervals can be written
Does this formula look a bit familiar? Yes, this is not the distance formula of the midpoint XI to the Straight line G (x) =0 of analytic geometry! (Promotion, is to the ultra-planar g (x) =0 distance, g (x) =0 is referred to in the previous section of the classification of the super-plane)
Little tips:| | w| | What is the symbol? | | w| | The norm, called the vector W, is a measure of the length of a vector. We often say that the length of the vector is actually refers to its 2-norm, the norm is the most general form of the P-norm, can be written as the following expression
Vector w= (W1, W2, W3,...... WN)
Its P-norm is
What is the traditional vector length when you change p to 2? When we don't specify P, it's like | | w| | When this is used, it means that we do not care about the value of P, it can be in a few norms, or the value of P has been mentioned above, for the sake of narrative convenience is not repeated.
When the normalized W and b in lieu of the original value of the interval has a special name, called the geometric interval, the geometric interval represents the point to the plane of the Euclidean distance, we hereinafter referred to as the geometric interval "distance." The above is the distance from a single point to a plane (that is, the interval, which no longer distinguishes between the two words), and it is also possible to define a set of points (that is, a set of samples) to the distance from the nearest point of the plane to a particular plane. The following diagram shows the realistic meaning of geometric intervals more intuitively:
H is a classification surface, while H1 and H2 are parallel to H, and the distance between the two most recent samples of H, H1 and H,h2 and H is the geometric interval.
The reason for this concern for geometric spacing is that there is a relationship between the geometrical interval and the number of errors in the sample:
where δ is the interval of the sample collection to the classifier, R=max | | xi| | I=1,..., N, that is, R is the longest value (that is, how wide the distribution of the sample is represented) in all samples (xi is the first sample of vectors). It is not necessary to investigate the specific definition and derivation process of the number of errors, as long as you remember that the number of errors to some extent represents the error of the classifier. As can be seen from the above, the upper bound of the number of errors is determined by the geometric interval! (Of course, when the sample is known)
So we can see why we should use geometrical interval as an index to evaluate the merits and demerits, the larger the geometrical interval, the lower the upper error.
SVM Learning Experience