SVM support Vector Machine
support vectors: refers to the most difficult data points in the training set that are closest to the classification decision surface.
"Machine": that is, machine machines, is actually an algorithm. In the field of machine learning, some algorithms are often regarded as a machine (or learning machine, predictive function, learning function, etc.).
SVM is a supervised learning method, that is, the category of the known training point, and the corresponding relationship between the training point and the category.
SVM mainly for small sample data for learning, classification and prediction (regression), similar to the sample based on the method of learning and case-based reasoning (case-based reasoning), decision tree induction algorithm. The theoretical basis of 1.1 support vector machine
– 1.1.1 Experience risk optimal ERM
–1.1.2 key theorem and VC dimension
– Mathematical derivation of 1.1.3 Structural risk-optimal SRM 1.2 SVM
– 1.2.1 maximum interval over plane
– 1.2.2 Lagrange Multiplier method
– 1.2.3 Kkt conditions and dual transformations
– 1.2.4 Classifier function
– 1.2.5 Mapped to high-dimensional space
– 1.2.6 Kernel function method
– Relaxation variables for 1.2.7 outliers 1.3 SMO algorithm 1.1 theoretical basis of support vector machines
Support Vector machine based on statistical learning, the structure selection and local minimum (over fitting and less fitting) of the second generation neural network are solved.
Statistical learning theory proves the difference between empirical risk and real risk of finite sample from the angle of statistics, and introduces the concept of confidence interval for the first time, creates the structure risk optimal theory, and provides a unified evaluation framework for all kinds of machine learning algorithms. 1.1.1 Empirical risk optimal ERM
First we can start with the concept of the General and the sample , we understand the overall extension as objective things (System) itself , in statistics we can understand the general as a probability distribution , We need to understand that this real distribution is difficult (and practically impossible), so we extract representative objects from the population and become the overall sample .
We use the distribution of samples as an approximate model or an empirical model of things as a whole. The risk represents the error between the empirical model and the real model.
If a classifier distributes the distribution of empirical models as a whole, we call the error of this classifier an empirical risk .
For convenience, we use the loss function (Loss functions) to evaluate.
Suppose the sample: (x1,y1),..., (Xn,yn) ∈RNXR (x_1,y_1),..., (x_n,y_n) \in r^n \times R, the empirical function for discrete samples can be written:
Remp (α) =1n∑i=1nl (Yi,f (xi,α)) R_{emp} (\alpha) =\frac{1}{n}\sum_{i=1}^{n}l (Y_i,f (X_i,\alpha))
where Remp (α) R_{emp} (\alpha) is the so-called empirical risk. F (xi,α) F (x_i,\alpha) is the objective function used to minimize the empirical risk, L (yi,f (Xi,α)) L (Y_I,F) is the loss function, which represents the deviation between the label Yi X_i,\alpha and the target function for each XI x_i.
For example, the global error function in BP Neural network, the change of slope and intercept in gradient descent method, etc.
In the second generation neural network, we use the minimum empirical risk to represent the real risk, which is the least empirical risk principle erm, the neural network to minimize the empirical risk as the main basis for measuring the accuracy of the algorithm. We find that this does not achieve the global optimal classification effect, easy to cause the phenomenon of overfitting (two reasons: A sample is not strong, the second learning algorithm theory is incomplete, that is, the method of evaluating the real risk is incomplete).
– Over-fitting/over-learning problems: The training error is too small to reduce the ability to promote, that is, the increase in real risk.
– Generalization capability: the accuracy of predicting unknown samples.
To solve these problems--statistical methods--learning consistency issues. 1.1.2 key theorem and VC dimension
Learning Consistency: Solving the problem of measuring the risk of experience and the real risk.
first definition of statistical learning/definition of learning consistency
If an ERM algorithm can provide a function set Q (x,α) Q (x,\alpha) to be able to be true risk and the empirical risk converges to the lower bound of R (α) R (\alpha), then the ERM algorithm conforms to the learning consistency.
PS: Similar to the above
The key theorem of learning theory
The key theorem transforms the learning coherence problem into a convergent problem. Minimizing functions that approximate real (desired) risks through experiential risk minimization functions. That is, the empirical risk minimization principle conforms to the learning process consistency condition that the q (x,α) q (x,\alpha) function sets the worst function in the set.
The method used to measure is VC dimension .
definition of VC dimension : Suppose that there is a sample set with H samples that can be taken by a function set in a function in accordance with all possible 2h