SVM (1) to (3) Refresh

Last Update:2018-12-04 Source: Internet

Author: User

Tags svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

(1) Overview of SVM

Support vector machine was first proposed by Cortes and Vapnik in 1995. It has many unique advantages in solving small samples, non-linear and high-dimensional pattern recognition, and can be applied to function fitting and other machine learning problems [10].
The SVM method is based on the VC Dimension Theory of the Statistical Learning Theory and the minimum structure risk principle, based on the limited sample information, we can find the best compromise between the complexity of the model (that is, the learning accuracy of specific training samples, accuracy) and the learning ability (that is, the ability to identify arbitrary samples without error, in order to obtain the best promotion capability [14] (or generalized ability ).

The above is an introduction that is often referenced by the academic literature about SVM. I will break it down and explain it one by one.

Vapnik is the master of statistical machine learning. Needless to say, his published statistical learning theory is a masterpiece that describes the idea of statistical machine learning. This book demonstrates in detail the difference between statistical machine learning and traditional machine learning lies in the fact that statistical machine learning can precisely give learning results and answer a series of questions, such as the number of required samples. Compared with the precision thinking of statistical machine learning, traditional machine learning is basically a method of crossing the river by feeling the stones. It is completely a skill to construct a classification system using traditional machine learning methods, one person may have good results, while the other person's approaches may have poor results. They lack guidance and principles.

The so-called VC dimension is a measure of the function class. It can be simply understood as the complexity of the problem. The higher the VC dimension, the more complicated a problem. It is precisely because SVM focuses on the VC dimension that we can see later that when SVM solves the problem, it is irrelevant to the dimension of the sample (or even the sample is of the tens of thousands of dimensions, this makes SVM suitable for solving text classification problems. Of course, kernel functions are introduced to support this capability ).

The minimum risk of the structure is simply the following.

Machine Learning is essentially an approximation of the real model of the problem (we choose an approximate model that we think is better. This approximate model is called a hypothesis), but there is no doubt that, real models must not be known (if so, why do we need machine learning? Can I solve the problem directly using a real model? Right, haha) since we don't know the real model, we can't know the gap between the assumptions we choose and the real solution of the problem. For example, we think that the universe was born 15 billion years ago. This assumption can describe many of the phenomena we have observed, but how much difference does it have with real cosmic models? No one can tell, because we have no idea what the real cosmic model is.

The error between this and the actual solution is called risk (more strictly speaking, the accumulation of error is called risk ). After we select a hypothesis (more directly, after we get a classifier), we cannot know the real error, but we can approach it with some amount that can be mastered. The most intuitive idea is to use the classifier to represent the difference between the classification result of the sample data and the actual result (because the sample is labeled data and accurate data. This difference is called experience Risk remp (W ). In the past, machine learning used to minimize empirical risks as a goal. However, it was found that many classification functions can easily achieve a 100% accuracy rate in the sample set, in actual classification, it is a mess (that is, poor promotion ability or poor generalization ability ). In this case, a complex enough classification function (with a high VC dimension) is selected to precisely remember each sample, but the data outside the sample is incorrectly classified. Looking back at the principle of experience Risk Minimization, we will find that the main premise of this principle is that experience risk must indeed be able to approach the real risk (the line is called the same), but can it actually approach it? The answer is no, because the number of samples is nothing more than the number of texts to be classified in the real world. The principle of minimizing empirical risk is only applicable to a small proportion of samples, of course, it cannot be guaranteed that there is no error in a larger proportion of real text.

Therefore, statistical learning introduces the concept of generalized error bounds, that is, the real risk should be characterized by two parts. First, empirical risk represents the error of the classifier on a given sample; second, confidence risk represents the extent to which we can trust the classifier's classification results on unknown texts. Obviously, the second part cannot be accurately calculated. Therefore, only one estimated range can be provided, and the entire error can only be calculated for the upper limit, but cannot be accurately calculated (so it is called generalized error bounds, but not generalized error ).

Confidence risk is related to two quantities. One is the number of samples. Obviously, the larger the given number of samples, the more likely the learning results are to be correct. At this time, the lower the confidence risk; the second is the VC Dimension of the classification function. Obviously, the larger the VC dimension, the worse the promotion capability, and the larger the confidence risk.

The formula for Generalized Error Bounds is:

R (w) ≤ remp (w) + round (N/h)

In the formula, R (W) is a real risk, remp (W) is an empirical risk, and confidence (N/h) is a confidence risk. The goal of statistical learning is to minimize empirical risk to seek the sum and minimum of empirical risk and confidence risk, that is, to minimize structural risk.

SVM is such an algorithm that strives to minimize structural risks.

Other features of SVM are easy to understand.

A small sample doesn't mean that the absolute logarithm of the sample is small (in fact, more samples can always bring better results for any algorithm), but compared to the complexity of the problem, the SVM algorithm requires a relatively small number of samples.

Non-linearity refers to the situation where SVM is good at coping with the inability to Score data lines of samples, mainly through relaxation variables (also called penalty variables) and kernel function technology. This part is the essence of SVM, it will be discussed in detail later. To put it bluntly, there is still no conclusion on whether or not the text classification problem can be linearly divided. Therefore, we cannot simply consider it to be linearly segmented for simplified processing, we had to first consider it linear (linear differentiation is just a special case of linear division, we have never been afraid that the method is too common ).

High-dimensional pattern recognition refers to a high sample dimension. For example, if the vector representation of a text is not processed by dimensionality reduction mentioned in another series of articles (getting started with text classification, when tens of thousands of dimensions occur, it is normal that other algorithms are basically incapable of coping, but SVM is acceptable, mainly because the classifier generated by SVM is simple, the sample information used is very small (only samples called "SVM" are used, which is later), so that even if the sample dimension is very high, this will not cause much trouble in storage and Computing. (In contrast, KNN uses all samples for classification. The number of samples is huge, and the dimension of each sample is high, this day cannot be passed ......).

The next section will officially discuss SVM. Don't bother me.

SVM entry (2) linear classifier Part 1

A linear classifier (or a sensor) is the simplest and most effective classifier. in a linear classifier, we can see the idea of SVM formation and come into contact with many core concepts of SVM.

Here is a small example of the classification problem of only two types of samples in a two-dimensional space.

C1 and c2 are the two types to be distinguished, as shown in the sample of them in a two-dimensional plane. The straight line in the middle is a classification function, which can completely separate the two types of samples. Generally, if a linear function can completely and correctly separate samples, the data is linearly segmented. Otherwise, the data is non-linear.

What is a linear function? In a one-dimensional space is a point, in a two-dimensional space is a straight line, in a three-dimensional space is a plane, as you can imagine, if you do not focus on the dimension of the space, this linear function also has a unified name-hyper plane )!

In fact, a linear function is a real-value function (that is, the value of a function is a continuous real number ), however, our classification question (for example, the binary classification question here -- to answer a question about whether a sample belongs to or does not belong to a category) requires discrete output values, for example, 1 indicates that a sample belongs to the Class C1, and 0 indicates that it does not belong to (if it does not belong to C1, it means it belongs to C2 ), in this case, you only need to simply attach a threshold value to the real-value function. You can determine the category attribution by calculating whether the value obtained by the classification function is greater than or less than the threshold value. For example, we have a linear function.

G (x) = wx + B

We can set the threshold to 0, so that when a sample Xi needs to be identified, we will look at the value of G (XI. If G (xi)> 0, it is classified as Class C1. If G (xi) <0, it is classified as Class C2 (when it is equal to, we refuse to judge, huh, huh ). In this case, it is equivalent to attaching a symbol function SGN () to function g (x), that is, f (x) = SGN [g (x)] is our real discriminant function.

Note three points for the expression g (x) = wx + B: 1. X in the formula is not the horizontal axis in the two-dimensional coordinate system, but the vector representation of the sample, for example, if the coordinate of a sample point is (3, 8), XT = (3, 8), instead of X = 3 (generally, vectors are column vectors, so they are expressed as row vectors, add transpose ). 2. This form is not limited to two-dimensional conditions. We can still use this expression in the n-dimensional space, but w in the formula becomes an n-dimensional vector (in the two-dimensional example, W is a two-dimensional vector. In order to make it easy to express and concise, the column vector and its transpose are no different below. Smart readers can see it at a Glance. 3.g (x) it is not the expression of the line in the middle. The expression of the line in the middle is g (x) = 0, that is, wx + B = 0. We also call this function a classification surface.

In fact, it is easy to see that the demarcation line in the middle is not the only one. We will rotate it a bit, as long as the two types of data are not divided incorrectly, the above results can still be achieved, you can also translate it a little. This involves a problem. Which function is better when there are multiple classification functions for the same problem? Obviously, you must first find an indicator to quantify the "good" degree. Generally, indicators called "Classification interval" are used. Next, let's take a closer look at the classification interval and make up the relevant mathematical knowledge.

SVM entry (3) linear classifier Part 2

The last time I talked about the problem of discomfort in text classification (more than one problem is called the problem of discomfort ), there is a need for an indicator to measure the quality of the solution (that is, the classification model we establish through training), and the classification interval is a good indicator.

When classifying texts, we can let Computers Look at the training samples we provide to them. Each sample is composed of a vector (that is, a vector composed of those text features) and a tag (indicating the category of the sample. As follows:

DI = (XI, Yi)

Xi is the text vector (with a high dimension), Yi is the classification mark.

In a binary linear classification, this indicates that the classification tag has only two values: 1 and-1 (used to indicate whether the classification belongs to or not ). With this notation, we can define the interval between a sample point and a super plane:

Delta I = Yi (wxi + B)

At first glance, this formula does not seem mysterious, nor can it be justified. It is just a definition. But we can see some interesting things when doing transformations.

First notice that if a sample belongs to this category, then wxi + B> 0 (remember? This is because the g (x) = wx + B We selected determines the classification by greater than 0 or less than 0), and Yi is greater than 0. If it does not belong to this category, then wxi + B <0, while Yi is smaller than 0, which means that Yi (wxi + B) is always greater than 0, and its value is equal to | wxi + B |! (That is, | G (xi) |)

Now we normalize W and B, that is, W/| w | and B/| w | replace w and B respectively, so the interval can be written

Does this formula seem familiar? That's right. This is not the formula for resolving the distance from the point XI to the line g (x) = 0! (For more information, see g (x) = 0.g (x) = 0 is the class superplane mentioned in the previous section)

Small tips: | w | what is the symbol? | W | the norm of vector W, which is a measure of vector length. We often say that the vector length actually refers to its 2-norm. The most common expression of the norm is P-norm. The following expressions can be written:

Vector W = (W1, W2, W3 ,...... Wn)

Its p-norm is

When we change P to 2, isn't it the traditional vector length? When we do not specify P, such as | w | in this case, it means that we do not care about the P value. We can use several norms; or the P value has been mentioned above, so it is not repeated for the sake of convenience.

When the normalized W and B are used to replace the original value, there is a special name for the interval, which is called the geometric interval. The geometric interval represents the point-to-superplane Euclidean distance, the Geometric Distance is "distance ". The above is the distance between a single point and a Super Plane (that is, the interval, which is no longer the difference between the two words). We can also define a set of points (that is, a group of samples) the distance from a superplane to the point closest to the superplane in this set. The following figure intuitively shows the actual meaning of geometric intervals:

H is the classification surface, while H1 and H2 are parallel to h, and the straight lines of the two samples closest to h. The distance between H1 and H, H2 and H is the geometric interval.

The reason why we are so concerned about the geometric interval is that there is a relationship between the geometric interval and the number of incorrect scores of the sample:

Delta is the interval from the sample set to the classification surface. r = max | Xi | I = 1 ,..., n, that is, r is the longest value of the vector length in all samples (Xi is the first sample represented by a vector) (that is, how wide the sample distribution is ). We do not need to investigate the specific definition and derivation process of the number of false scores, as long as we remember that the number of false scores represents the error of the classifier to a certain extent. As can be seen from the above formula, the upper limit of the number of mis-scores is determined by the geometric interval! (Of course, when the sample is known)

So far, we have understood why we should choose a geometric interval as an indicator to evaluate the merits and demerits of a solution. The larger the original geometric interval, the smaller the upper bound of the error. Therefore, maximizing geometric spacing is our goal in the training phase. In addition, unlike what the author of the second knife writes, maximizing the classification interval is not the patent of SVM, it is the idea that has existed as early as the linear classification period.

Link: http://blog.sina.com.cn/s/blog_5e310be90100yxtd.html

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More