SVM Learning (1): SVM Concept

Source: Internet
Author: User
Tags svm

First, we will introduce the concept of SVM.

Support vector machine was first proposed by Cortes and Vapnik in 1995.Small Sample,Non-linearAndHigh-dimensional Pattern RecognitionAnd can be applied to other machine learning problems such as function fitting.
The SVM method is based on the statistical learning theory.VC Dimension TheoryAndMinimum Structure Risk PrincipleBased on the limited sample informationComplexity of Models(That is, the learning accuracy of a specific training sample, accuracy) andLearning Ability(That is, the ability to identify arbitrary samples without errors) seek the best compromise in order to obtain the bestPromotion capability(Or generalized ability ).

The above is an introduction that is often referenced by the academic literature about SVM. I will break it down and explain it one by one.

Vapnik is the master of statistical machine learning. Needless to say, his published statistical learning theory is a masterpiece that describes the idea of statistical machine learning. In this book, I have demonstrated in detailStatistical Machine LearningDifferent fromTraditional Machine LearningThe essence is that statistical machine learning can precisely give learning results and answer a series of questions, such as the number of required samples. Compared with the precision thinking of statistical machine learning, traditional machine learning is basically a method of crossing the river by feeling the stones. It is completely a skill to construct a classification system using traditional machine learning methods, one person may have good results, while the other person's approaches may have poor results. They lack guidance and principles.

The so-calledVCIt is a measure of the function class and can be simply understood as the complexity of the problem. The higher the VC dimension, the more complicated a problem. It is precisely because SVM focuses on the VC dimension that we can see later that when SVM solves the problem, it is irrelevant to the dimension of the sample (or even the sample is of the tens of thousands of dimensions, this makes SVM suitable for solving text classification problems. Of course, kernel functions are introduced to support this capability ).

Minimum structural riskIt sounds like the following.

Machine Learning is essentially a problem.Real Model(We choose one that we think is better.Approximate ModelThis approximate model is called a hypothesis), but there is no doubt that the real model is unknown (if so, why do we need machine learning? Can I solve the problem directly using a real model? Right, haha) since we don't know the real model, we can't know the gap between the assumptions we choose and the real solution of the problem. For example, we think that the universe was born 15 billion years ago. This assumption can describe many of the phenomena we have observed, but how much difference does it have with real cosmic models? No one can tell, because we have no idea what the real cosmic model is.

The error between this and the real solution is calledRisks(More strictly speaking, the accumulation of errors is called risk ). After we select a hypothesis (more directly, after we get a classifier), we cannot know the real error, but we can approach it with some amount that can be mastered. The most intuitive idea is to use the classifier to represent the difference between the classification result of the sample data and the actual result (because the sample is labeled data and accurate data. This difference is calledExperience risk remp (W). In the past, machine learning used to minimize empirical risks as a goal. However, it was found that many classification functions can easily achieve a 100% accuracy rate in the sample set, in actual classification, it is a mess (that is, poor promotion ability or poor generalization ability ). In this case, a complex enough classification function (with a high VC dimension) is selected to precisely remember each sample, but the data outside the sample is incorrectly classified. Looking back at the principle of experience Risk Minimization, we will find that the main premise of this principle is that experience risk must indeed be able to approach the real risk (the line is called the same), but can it actually approach it? The answer is no, because the number of samples is nothing more than the number of texts to be classified in the real world. The principle of minimizing empirical risk is only applicable to a small proportion of samples, of course, it cannot be guaranteed that there is no error in a larger proportion of real text.

Statistics learning is introduced as a result.Generalized Error BoundsThe concept is that the real risk should be characterized by two parts. The first isExperience riskRepresents the error of the classifier on a given sample.Confidence riskWhich indicates the extent to which we can trust the classifier's classification results on unknown texts. Obviously, the second part cannot be accurately calculated. Therefore, only one estimated range can be provided, and the entire error can only be calculated for the upper limit, but cannot be accurately calculated (so it is called generalized error bounds, but not generalized error ).

Confidence riskIt depends on two quantities. One isSample countObviously, the larger the number of given samples, the more likely our learning results are to be correct, the lower the confidence risk. The second isVC Dimension of classification functionsObviously, the larger the VC dimension, the worse the promotion capability, and the larger the confidence risk.

The formula for Generalized Error Bounds is:

R (w) ≤ remp (w) + round (N/h)

In the formula, R (W) is a real risk, remp (W) is an empirical risk, and confidence (N/h) is a confidence risk.The goal of statistical learning is to minimize empirical risk to seek the sum and minimum of empirical risk and confidence risk, that is, to minimize structural risk.SVM is such an algorithm that strives to minimize structural risks.

Other features of SVM are easy to understand.
Small SampleIs not to say that the absolute logarithm of the sample is small (in fact, for any algorithm, more samples can always bring better results), but to compare it with the complexity of the problem, the SVM algorithm requires a relatively small number of samples.
Non-linearThe SVM is good at coping with the non-scores of sample data lines, mainly throughRelaxation variable(Also calledPenalty variable) AndCore Function TechnologyThis part is the essence of SVM and will be discussed in detail later. To put it bluntly, there is still no conclusion on whether or not the text classification problem can be linearly divided. Therefore, we cannot simply consider it to be linearly segmented for simplified processing, we had to first consider it linear (linear differentiation is just a special case of linear division, we have never been afraid that the method is too common ).
High-dimensional Pattern RecognitionIndicates that the sample dimension is very high. For example, if the vector representation of a text is not processed by dimensionality reduction mentioned in another series of articles (getting started with text classification, when tens of thousands of dimensions occur, it is normal that other algorithms are basically incapable of coping, but SVM is acceptable, mainly because the classifier generated by SVM is simple, the sample information used is very small (only samples called "SVM" are used, which is later), so that even if the sample dimension is very high, this will not cause much trouble in storage and Computing. (In contrast, KNN uses all samples for classification. The number of samples is huge, and the dimension of each sample is high, this day cannot be passed ......).

The next section officially discusses SVM algorithms. Don't bother me.

(Sina Weibo: @ quanliang _ machine learning)

Reprinted from: http://www.blogjava.net/zhenandaci/archive/2009/02/13/254519.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.