Examples of SVM classification in relation extraction: unbalance data solution-relaxation variables and penalty Factors

Source: Internet
Author: User
Tags svm

1. Problem Description

Link extraction is to extract the target phrase describing the product feature item from the product comments and the opinion phrase that modifies the target, which is an important task in Opinion Mining, many paper related to DM and NLP are doing this. The basic idea is:

(1) select the candidate target node and candidate opinion node from the sentence parse tree (such as Stanford parser), and then select features for all the candidate targets and opinion combinations, use SVM for training classification.

(2) Given the seeds opinion words, and using the path rules in the parse tree, the new target and opinion are extracted multiple iterations, which are called double propagation (Qiu et al, ijcai '09)

(3) link Extraction Using tree kernel (Wu et al, emnlp '09)

If we adopt the first approach, we will encounter imbalance between positive and negative samples when using SVM classification. The reason is that even if the number of recall and presicion extracted by the candidate target and opinion can reach 100%, many samples are negative in different combinations, with a very small proportion of positive samples. In my experiments, the ratio of positive and negative samples is. When SVM is used for classification, all test samples are classified as negative samples. This problem is called unbalanced data classification.

2. Solution

Set different penalty factors for different categories. Libsvm-C and-W options are used. For detailed explanations, see section 6th unbalanced data and solving the two-variable sub-problem in libsvm: a library for support vector machines. C is the penalty factor. Different penalty values can be used for positive and negative samples. It can be considered that this penalty value refers to the punishment for a false positive judgment on the classifier to be trained.

For example, for unbalanced datasets, assume that positive sample + 1 accounts for 10%, negative sample-1 accounts for 90%, and positive and negative sample ratio is. in principle, when we train an SVM classifier, If we misjudge the + 1 sample as-1, we need to increase the penalty. Combined with the example in the libsvm FAQ, "SVM-train-S 0-C 10-W1 1-w-15 data_file. The penalty for Class "-1" is larger. Note that this-W option is for C-SVC only ".
So to solve the problem of unbalance data in relation extraction, the command should be: SVM-train-S 0-C 10-W1 9-w-1 1 data_file, equivalent to using the C-SVC mode, for + 1, the penalty for misjudgment is 10*9 = 90, and for-1, the penalty is 10*1 = 10. This setting will make the classifier tend to tell more examples, in this way, the classification result is no longer-1. In addition, we can further test the classification accuracy and recall rate when different penalty factors are set.

3. Relaxation variables and penalty factors in SVM

This part is transferred from slave.

3.1 relaxation variable

Now, we have mapped a linear text classification problem that cannot be divided to a high-dimensional space and changed it to a linear and segmented one. Like this:

There are thousands of circular and square points (after all, this is the number of documents in our training set, of course, very large ). Now imagine we have another training set. We only have one more article than the original training set. After ing it to a high-dimensional space (of course, the same kernel function is also used ), A sample point is added, but the position of the sample is as follows:

It is the yellow point in the graph, Which is square, so it is a negative sample. This separate sample makes the problem of linear differentiation become linear inseparable. Such a similar problem (only a few linear points cannot be divided) is called the problem of "Approximate Linear differentiation.

Judging from our common sense, we say that 10 thousand vertices comply with a certain law (and thus can be linearly divided), and one point does not, does this point represent the aspects we have not considered in the classification rules (so the rules should be modified for it )?

In fact, we will think that it is more likely that this sample point is an error at all. It is a noise. It is the training set that provides the personnel to manually classify the samples and put them in. Therefore, we will simply ignore this sample point and still use the original classifier. The effect will not be affected at all.

However, this error tolerance to noise is brought about by human thinking, and our programs do not. Because all the sample points must be considered in our original optimization problem expressions (one cannot be ignored, because how does the program know which one to ignore ?), On this basis, find the maximum geometric interval between positive and negative classes, while the geometric interval itself represents the distance, which is non-negative. noise like above will make the entire problem unsolvable. This solution is also called the "hard interval" classification, because it requires that all sample points meet the requirement that the distance between them and the classification plane must be greater than a certain value.

Therefore, we can also see from the above example that the result of the hard interval classification is easily controlled by a few points, which is very dangerous (although there is a saying that truth is always in the hands of a few people, but that's just the words of a small group of people talking about masturbation. We still have to be democratic ).

But the solution is also obvious, that is, the distance from some points to the classification plane does not meet the original requirements. Since the distance scales of each point in different training sets are not the same, the use of intervals (rather than geometric intervals) is conducive to the conciseness of our expressions. Our original requirements for sample points are:


This means that the interval between sample points closest to the classification surface is also greater than 1. If fault tolerance is introduced, add a relaxation variable to the hard threshold of 1, that is, allow:


Because the relaxation variable is non-negative, the final result is that the interval can be smaller than 1. However, when the interval between some points is smaller than 1 (these points are also called Outlier points), it means that we give up precise classification of these points, this is a loss for our classifier. However, dropping these points also brings about the benefit that the classification surface does not have to move in the direction of these points, so it can get a larger geometric interval (in the View of low-dimensional space, the classification boundary is also smoother ). Obviously, we must weigh the losses and benefits. The advantage is obvious. The larger the classification interval we get, the more benefits we will have. Review our original Optimization Problems Related to hard interval classification:


| W | ^ 2 is our target function (of course, the coefficient is dispensable). We hope the smaller the function, the better, therefore, the loss must be the amount that can make it larger (making it smaller will not be a loss, we would have hoped that the smaller the target function value, the better ). There are two common methods to measure the loss. Some people prefer to use them:


Some people prefer:


L indicates the number of samples. There is no big difference between the two methods. If the first type is selected, the obtained method is called the second-order soft interval classifier, and the second is called the first-order soft interval classifier. When the loss is added to the target function, a penalty factor (cost, C among many parameters of libsvm) is required. The original optimization problem is as follows:


Note the following points:

(1) Not all sample points have a relaxation variable corresponding to it. In fact, only "outlier" exists, or, in this case, all the relaxation variables with no outliers are equal to 0 (for negative classes, the outlier is shown in the preceding figure, run the negative sample points on the Right of H2. For the positive class, it is the positive sample points on the left of H1 ).
(2) The value of the relaxation variable actually shows how far the corresponding point exists. The larger the value, the farther the point is.
(3) The penalty factor C determines that you pay more attention to the loss caused by the outlier. Obviously, when the relaxation variables of all the outliers are the same, the larger the value of C is, the larger the loss to the target function, it implies that you are reluctant to give up these outliers. the most extreme case is that you set C as infinitely large, so that if there is a slight outlier, the value of the target function immediately becomes infinitely large, and the problem immediately becomes unsolvable, which degrades to a hard interval problem.
(4) The penalty factor C is not a variable. When the optimization problem is solved, C is a value that you must specify in advance. After this value is specified, the system proceeds to obtain a classifier, then, use the test data to check the result. If the result is not good enough, change the value of C, solve the optimization problem again, get another classifier, and then look at the effect. This is a process of parameter optimization, however, this is by no means the same as the optimization problem itself. During the process of solving the optimization problem, C has always been a fixed value and should be remembered.
(5) Despite the addition of relaxation variables, this optimization problem is still an optimization problem (Khan, This is not nonsense). The process of solving this problem is compared to the original hard interval problem, there is nothing more special.

In the big aspect, the process of solving the problem is to first determine W, that is, to determine the three straight lines in the preceding figure. Then, let's see how much interval is and how many points of outlier there are, calculate the value of the target function and change it to a set of three straight lines (you can see that if the linear position of the classification is moved, some original outlier points will become no longer outliers, some non-outlier vertices will become outliers), and then calculate the value of the target function, so that (iteration) until the minimum W of the target function is finally found.

With so many worships, the reader will be able to summarize them immediately. Relaxation variables are just a way to solve the linear inseparable problem, but let's look back, isn't the introduction of core functions also aimed at solving the problem of linear division? Why are two methods used for one problem?
In fact, there are subtle differences between the two. The general process should be like this, and text classification is also used as an example. In the original low-dimensional space, the samples are quite unclassified. No matter how you look for the classification plane, there will always be a large number of outlier points. At this time, use the kernel function to map to the high-dimensional space, although the results are still not severable, they are closer to the linearly severable state (that is, they reach the approximate linearly severable state) than in the original space ), at this time, it is much simpler and more effective to use relaxation variables to deal with a few "stubborn" Outliers.

In this section, Formula 1 is indeed the most common form of SVM. Now we have a complete SVM framework. Simply put, SVM uses the soft interval Linear Classification Method of kernel functions.

3.2 Penalty Factor

What we will talk about next is actually not a relaxation variable, but it is introduced to use the relaxation variable. Therefore, it is appropriate to put it here, that is, the penalty factor C. Let's take a look at the optimization problems after the relaxation variable is introduced:


Pay attention to the position of C, and you can also recall the role of C (representing how much you pay attention to the outlier, the more important C is, the less you don't want to lose them ). This formula was previously written by SVM and is used by everyone. However, there is no rule that the same penalty factor must be used for all relaxation variables, we can use different C values for each outlier, which means that you pay different attention to each sample. If some samples are lost, they will be lost. If they are wrong, they will be wrong, these are for a relatively small C, and some samples are very important, so we must not classify them incorrectly (such as the files issued by the central government, or laugh at them), so we can give a very large C.

Of course, the actual use is not so extreme, but a common deformation can be used to solve the "skew" problem of samples in classification problems.
First, let's talk about the sample skew, also known as unbalanced. It refers to the large difference in the number of samples of the two classes involved in the classification (or multiple classes. For example, the positive class has 10,000 samples, while the negative class only gives 100 samples, which causes obvious problems. You can look at the figure below:


A square vertex is a negative class. H, H1, H2 are the classification surface calculated based on the given sample. Because there are very few negative samples, some original negative samples are not provided, if the two gray square points are provided, the calculated classification surface should be H ', H2' and H1. They are obviously different from the previous results, in fact, the more sample points the negative class gives, the more likely it will appear near the gray point, and the closer we calculate the result is to the real classification surface. However, due to the skew, a large number of positive classes can push negative classes, thus affecting the accuracy of the results.

One of the ways to deal with the dataset skew problem is to write an article on the penalty factor. As you may have guessed, it is to give a larger penalty factor to the negative class with a small number of samples, it indicates that we pay attention to this part of the samples (the original number is small, and then we discard some samples, the negative class is still inactive ), therefore, the loss of the objective function due to the relaxation of the variable becomes:


Where I = 1... P is a positive sample, j = p + 1... P + q are all negative samples. The libsvm algorithm package uses this method to solve the skew problem.

Then how can c + and C-be sure? Their size is tested (parameter tuning), but their ratio can be determined by some methods. Let's assume that C + is as big as 5, then a very intuitive method for determining C-is to calculate the ratio of the two types of samples, corresponding to the example just given, c-it can be set to 500 (Because 10,000: 100 = 100 ).

But this is not good enough. Looking back at the figure just now, you will find that the reason why positive classes can "bully" negative classes is not because there are few negative class samples, the real reason is that the negative class sample distribution is not wide enough (not expanded to the expected region of the negative class ). Here is an example of a specific point. Now I want to classify political and sports articles. There are many political articles, while sports only provide a few articles about basketball, at this time, the classification will be obviously biased towards the political category. If you want to add samples for sports articles, but the samples are still all about basketball (that is, there is no football, volleyball, racing, swimming, etc ), what will happen? Although the number of sports articles can be the same as that of political articles, the results will still be biased towards political articles if they are too concentrated! Therefore, a better way to determine the ratio of C + and C-is to measure their distribution. For example, you can calculate the size of the space they occupy. For example, you can find a SuperBall for the negative class, which is the ball in the high-dimensional space. It can contain all samples of the negative class, find another positive class and compare it to the radius of the two balls to roughly determine the distribution. Obviously, the distribution with a large radius is wide, and a smaller penalty factor is given.

However, this is not good enough, because some class samples are indeed very concentrated. This is not a problem about the number of samples provided. This is a feature of the class itself (that is, some topics involve a narrow area, for example, computer articles are obviously inferior to cultural articles.) at this time, even if the radius of the ball is very different, different penalty factors should not be assigned to the two categories.

The reader must be crazy here because it cannot be solved? However, as it turns out, there are no complete methods. You can choose to implement simple and useful methods as needed (for example, libsvm directly uses the sample quantity ratio ).


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.