Getting Started with SVM (i)--SVM stereotyped introduction
Support Vector Machines (SVM), which was first proposed by Cortes and Vapnik in 1995, shows many unique advantages in solving small sample, nonlinear and high dimensional pattern recognition, and can be applied to other machine learning problems such as function fitting.
The support vector machine method is based on the VC dimension Theory of statistical learning theory and the minimum structure risk principle, and the best tradeoff between the complexity of the model (that is, the learning accuracy of a given training sample, accuracy) and learning ability (i.e. the ability to identify any sample without error) is based on the limited sample information. In order to obtain the best promotion ability (or generalization ability).
The above is often referred to the academic literature about SVM, a little stereotyped, I would like to decompose and explain.
Vapnik is a statistical machine learning Daniel, this must not be said, he published the "Statistical Learning theory" is a complete exposition of statistical machine learning ideas of the classics. In this book, it is demonstrated in detail that the essence of statistical machine learning is different from the traditional machine learning, which is that the statistical machine learning can accurately give the learning effect, and can answer a series of questions such as the number of samples needed. Compared with the precise thinking of the statistical machine learning, the traditional machine learning is basically stones, and the traditional machine learning method constructs the classification system completely becomes a kind of skill, one person does the result may be very good, the other person's similar method does very badly, lacks the guidance and the principle.
The so-called VC dimension is a measure of the function class, can be simply understood as the complexity of the problem, the higher the VC dimension, the more complex a problem. It is because SVM is concerned about the VC dimension, we can see later, SVM solves the problem, and the dimension of the sample is irrelevant (even the sample is tens of thousands of dimensions can be, which makes the SVM is suitable for solving the problem of text classification, of course, there is the ability to introduce a kernel function).
Minimal structural risk sounds genteel, actually said is nothing but the following.
machine learning is essentially an approximation of the real model of the problem (we choose an approximate model that we think is better, which is called a hypothesis), but there is no doubt , the real model must not know (if you know, why do we need machine learning?) Is it possible to solve a problem directly with a real model? Yes, haha) since the real model does not know, then we can not find out how much difference between the assumptions we choose and the real solutions to the problem. For example, we think that the universe was born 15 billion years ago in a big bang, this hypothesis can describe many of the phenomena we observe, but it and the real model of the universe how much difference? No one can tell, because we don't know exactly what the real universe model is.
The error between this and the real solution of the problem is called risk (more strictly, the accumulation of errors is called risk). After we have chosen a hypothesis (more intuitively, we get a classifier later), the true error is unknown, but we can approximate it with some amount we can grasp. The most intuitive idea is to use the difference between the results of the classification of the classifier on the sample data and the actual result (because the sample is the data that has already been labeled, the data is accurate). This difference is called Empirical risk remp (W). Previous machine learning methods have minimized empirical risk as a goal, but later discovered that many of the classification functions were able to easily reach 100% of the correct rate on the sample set, while the real classification was a mess (i.e., the so-called generalization ability is poor, or the generalization ability is poor). The situation here is to choose a sufficiently complex classification function (its VC dimension is very high), to be able to accurately remember each sample, but the data outside the sample is classified incorrectly. Look back at the experience risk minimization principle we will find that this principle applies to the premise that the empirical risk should indeed be able to approximate the real risk (the jargon is consistent), but can actually be approached? The answer is no, because the number of samples is simply bucket relative to the number of texts in the real world, and the empirical risk minimization principle only achieves no error in this small percentage of samples, and certainly not in a larger proportion of true text.
Statistical learning therefore introduces the concept of generalized error bounds, that is, the real risk should be described by two parts, one is the empirical risk, which represents the error of the classifier on a given sample, the second is the confidence risk, which represents the extent to which we can trust the classifier to classify the unknown text. Obviously, the second part is no way to accurately calculate, so can only give an estimate of the interval, also makes the entire error can only calculate the upper bound, and can not calculate the exact value (so called generalization error bounds, not called generalization error).
The confidence risk is related to two quantities, one is the sample quantity, obviously the larger the given sample quantity, the more likely our learning result is correct, at this time the confidence risk is smaller, the second is the VC dimension of the classification function, obviously the VC dimension is bigger, the promotion ability is worse, the confidence risk will become bigger.
The formula for the generalized error bounds is:
R (W) ≤remp (w) +ф (n/h)
The formula R (W) is the real risk, remp (W) is the empirical risk, Ф (n/h) is the confidence risk. The goal of statistical learning is to minimize the risk of experience and the minimum of the risk of confidence, that is, the minimum of structural risks.
SVM is such an effort to minimize the structural risk of the algorithm.
Other features of SVM are relatively easy to understand.
Small sample is not to say that the absolute number of samples is small (in fact, for any algorithm, more samples will almost always bring better results), but rather than the complexity of the problem, the SVM algorithm requires a relatively small number of samples.
Non-linearity refers to the fact that SVM is good at dealing with the linear non-division of sample data, mainly through relaxation variables (also called penalty variables) and nuclear function technology, which is the essence of SVM, which will be discussed in detail later. More to say, about the text classification of the question is not linear can be divided, there is no conclusion, it is not easy to think it is linear can be divided into simplified processing, before the bottom of the problem, we have to first when it is linear non-divided (anyway, linear can be divided is a special case of linear irreducible only, We have never been afraid of ways to be too generic).
High-dimensional pattern recognition refers to a high number of sample dimensions, such as the vector representation of text, if not through another series of articles ("Introduction to Text Classification") mentioned in the dimensionality reduction processing, tens of thousands of-dimensional situation is normal, the other algorithms are not able to cope with, SVM can, mainly because the SVM generated classifier is very concise, The sample information used is very small (only those that are called "support vectors", this is something), so that even if the sample dimension is very high, it will not give the storage and computing a lot of trouble (compared to the KNN algorithm in the classification of all samples, the number of large samples, each sample dimension of a high, this day will not be over ... ... )。
The following section begins the formal discussion of SVM. Don't think I'm too detailed.
Getting Started with SVM (ii)--linear classifier Part 1
A linear classifier (in a sense, also known as a perceptual machine) is the simplest and most effective classifier form. In a linear classifier, we can see the idea of SVM formation, and touch many core concepts of SVM.
A small example is given for the classification of only two types of samples in one-dimensional space.
C1 and C2 are the two categories to differentiate, and their samples are shown in the two-dimensional plane. The middle line is a classification function that separates the two types of samples completely. In general, if a linear function is able to separate the sample perfectly correctly, it is said that the data is linearly separable, otherwise it is called non-linear separable.
What do you mean by linear functions? In one-dimensional space is a point, in two-dimensional space is a straight line, three-dimensional space is a plane, you can imagine, if you do not pay attention to the dimension of space, this linear function has a unified name-super-plane (Hyper Plane)!
In fact, a linear function is a real-valued function (that is, the value of a function is a sequential real number), and our classification problems (such as the two-tuple problem here-answering a question of a kind that belongs to or not a category) require discrete output values, such as 1 to indicate that a sample belongs to the category C1, and 0 means that does not belong to (does not belong to C1 is also meant to belong to C2), this time only need to simply attach a threshold value on the basis of the real value function, the classification function when the execution of the value is greater or less than the threshold to determine the category attribution. For example we have a linear function:
g (x) =wx+b
We can take a threshold value of 0, so that when a sample XI needs to be judged, we look at the value of G (xi). If G (xi) >0, it is classified as the category C1, if G (xi) <0, then the classification of C2 (equal to the time we refused to judge, hehe). This is also equivalent to giving the function g (x) a symbolic function, SGN (), i.e. f (x) =SGN [G (x)] is our true discriminant function.
About g (x) =wx+b This expression should pay attention to three points: one, the X in the formula is not the horizontal axis in the two-dimensional coordinate system, but the vector representation of the sample, such as the coordinates of a sample point is (3,8), then xt= (3,8), rather than x=3 (generally said that the vector is said to be a column vector, so Plus transpose). Second, this form is not confined to the two-dimensional situation, this expression can still be used in n-dimensional space, except that the W in the formula is an n-dimensional vector (in this example of two dimensions, W is a two-dimensional vector, in order to make it convenient and concise, the following do not distinguish between the column vector and its transpose, the Smart Reader is aware); three, g (x) is not the expression of the middle line, the expression of the middle line is g (x) = 0, that is, wx+b=0, we also call this function classification surface.
In fact it is easy to see, the middle of the dividing line is not the only one, we rotate it a little bit, as long as the two types of data can not be divided into a wrong, still achieve the above-mentioned effect, a little translation, can also. At this point, it involves a problem, which function is better when there are multiple classification functions for the same problem? Obviously, it is necessary to find an indicator to quantify the "good", usually using an indicator called "Classification interval". In the next section we will talk about the classification interval, and also fill in the relevant mathematical knowledge.
Introduction to SVM (iii)--linear classifier Part 2
On the last mention of the ill-posed problem of text categorization (a problem with more than one solution called ill-posed), an indicator is needed to measure the quality of the solution (i.e., the classification model that we have trained to build), and the classification interval is a good indicator.
In the case of text categorization, we can let the computer look at the training samples we provide to it, each of which consists of a vector (which is the vector of the text features) and a marker (which identifies which category the sample belongs to). As follows:
Di= (Xi,yi)
Xi is the text vector (the number of dimensions is very high), Yi is the classification mark.
In a two-dollar linear classification, the tag representing the classification has only two values, 1 and 1 (to indicate whether it belongs to or does not belong to this class). With this notation, we can define the interval of a sample point to a certain hyper-plane:
Δi=yi (WXI+B)
This formula at first glance no mysterious, also can not say any reason, just a definition, but we do change, we can see some interesting things.
First notice that if a sample belongs to that category, then Wxi+b>0 (remember?). This is because our chosen g (x) =wx+b is classified by more than 0 or less than zero, and Yi is greater than 0, and if it is not in that category, then wxi+b<0, and Yi is less than 0, which means that Yi (wxi+b) is always greater than 0, and its value is equal to |wxi+b|! (aka |g (xi) |)
Now let's get the W and b normalized, i.e. with w/| | w| | and b/| | w| | Instead of the original W and B, the intervals can be written
Does this formula look a bit familiar? Yes, this is not the distance formula of the midpoint XI to the Straight line G (x) =0 of analytic geometry! (To generalize, it is the distance to the =0 g (x), g (x) =0 is the categorical super-plane mentioned in the previous section).
Little tips:| | w| | What is the symbol? | | w| | The norm, called the vector W, is a measure of the length of a vector. We often say that the length of the vector is actually refers to its 2-norm, the norm is the most general form of the P-norm, can be written as the following expressions:
Vector w= (W1, W2, W3,...... WN), whose p-norm is:
What is the traditional vector length when you change p to 2? When we don't specify P, it's like | | w| | When this is used, it means that we do not care about the value of P, it can be in a few norms, or the value of P has been mentioned above, for the sake of narrative convenience is not repeated.
When the normalized W and b in lieu of the original value of the interval has a special name, called the geometric interval, the geometric interval represents the point to the plane of the Euclidean distance, we hereinafter referred to as the geometric interval "distance." The above is the distance from a single point to a plane (that is, the interval, which no longer distinguishes between the two words), and it is also possible to define a set of points (that is, a set of samples) to the distance from the nearest point of the plane to a particular plane. The following diagram shows the realistic meaning of geometric intervals more intuitively:
H is a classification surface, while H1 and H2 are parallel to H, and the distance between the two most recent samples of H, H1 and H,h2 and H is the geometric interval.
The reason for this concern for geometric spacing is that there is a relationship between the geometrical interval and the number of errors in the sample:
< Span style= "White-space:pre" > where δ is the interval of the sample collection to the classifier plane, R=max | | xi| | I=1,..., n r is in all samples ( xi i sample) The longest value of the vector (that is, how broad the sample is distributed). It is not necessary to investigate the specific definition and derivation process of the number of errors, as long as you remember that the number of errors to some extent represents the error of the classifier. As can be seen from the above, the upper bound of the number of errors is determined by the geometric interval! (Of course, when the sample is known)
at this point we understand why we should choose the geometrical interval as an index to evaluate a solution, the larger the geometric interval, the smaller the upper bounds of the error. Thus maximizing the geometric interval is the goal of our training phase, and, unlike the two knife authors, maximizing the classification interval is not a patent for SVM, but rather an idea that has been in place since the linear classification period.
SVM Learning Notes