What is machine learning?
(Machine Learning)
)
Machine Learning is to study how computers simulate or implement human learning behaviors to acquire new knowledge or skills and reorganize existing knowledge structures to continuously improve their own performance. It is the core of artificial intelligence and the fundamental way to make computers intelligent. It is applied in various fields of artificial intelligence.
General classification of machine learning:
1) classification (Pattern Recognition): The system is required to analyze the input unknown pattern (description of this pattern) based on known classification knowledge to determine the category of the input pattern, for example, handwriting recognition (identify whether it is the number ).
2) Problem Solving: It is required to find an action sequence for converting the current state to the target State for the given target State.
SVM is generally used for classification (it is generally divided into two categories first, and then promoted to multiple categories for a lifetime: Second, second, and third, and third-born, all-in-one)
Problem description
Vector representation: Assume that a sample has n variables (features): x = (x1, x2 ,..., XN) T
Sample representation:
SVM linear classifier
SVM evolved from the optimal classification surface with linear differentiation. The optimal classification surface requires that the classification line not only correctly separates the two classes (the training error rate is 0), but also maximizes the classification interval. SVM considers finding a hyper-plane that meets the classification requirements and makes the points in the training set as far as possible from the classification surface, that is, finding a classification surface to maximize the blank areas (margin) on both sides of the training set.
After the points closest to the classification surface in two samples and parallel to H1 on the Super Plane of the optimal classification surface, H2 training samples are called SVM.
Legend:
Problem description:
Assume that the training data is:
It can be divided into a hyperplane:
Normalization:
At this time, the classification interval is equal:
Even if: the maximum interval is equivalent to making the smallest
The following two figures show a perceptual knowledge. What's better?
Take a look at the figure below:
Next we will begin to optimize the formula above, because the derivation requires the use of the Laplace theorem and the kkt condition, so let's first take a look at the relevant knowledge. When solving the optimization problem with constraints, the Laplace multiplier method and the kkt condition are two very important methods. For the optimization problem of equality constraints, the optimal value can be obtained using the Laplace multiplier method. If there is an inequality constraint, the kkt condition can be used to obtain the optimal value. Of course, the results obtained by these two methods are only necessary. Only when they are convex functions can they be guaranteed to be sufficient. The kkt condition is the generalization of the Laplace multiplier method. When I was studying it, I only knew to apply two methods directly, but I don't know why the conditions of the Laplace multiplier and kkt work. Why do I need to obtain the optimal value like this?
Laplace multiplier and kkt Conditions
Definition: given an optimization problem:
Minimal target function:
Restrictions:
Defines the Laplace function:
Partial inversion Equation
The value that can be obtained. This is the multiplication sub-Method of the arms.
The Laplace multiplier method above is not enough to help us solve all the problems. The following introduces the inequality constraint.
Minimal target function:
The condition is changed:
Defines the Laplace function:
You can list the equations:
The newly added condition is called the kkt condition.
Kkt Conditions
How can we obtain the optimal value for the optimization problem with inequality constraints? The common method is the kkt condition. Similarly, all the inequality constraints, equality constraints, and objective functions are written as a sub-equation L (a, B, x) = f (x) + A * g (x) + B * h (x), The kkt condition indicates that the optimal value must meet the following conditions:
1. l (a, B, x) returns X to zero;
2. h (x) = 0;
3. A * g (x) = 0;
After the three equations are obtained, the optimal candidate values can be obtained. The third formula is very interesting, because g (x) <= 0. To satisfy this equation, a = 0 or g (x) = 0. this is a source of many important SVM properties, such as the concept of SVM.
2. Why can we obtain the optimal value through the conditions of the Laplace multiplier and kkt?
Why can we get the optimal value? Let's talk about the Laplace multiplier method. Suppose our target function z = f (x), X is a vector, and Z gets different values, which is equivalent to a plane (surface) that can be projected on X) as a contour, for example, the target function is f (x, y), where X is a scalar and the dotted line is a contour. Now let's assume that our constraint g (x) is 0, X is a vector. X is a curve on a plane or surface. Assuming that G (x) and the contour line intersect, the intersection is the value of the feasible region that satisfies both the equality constraints and the target function, but it is definitely not the optimal value, because intersection means there must be other contour lines inside or outside the contour lines, so that the intersection value of the new contour lines and the target function is larger or smaller, the optimal value may be obtained only when the contour line is tangent to the curve of the target function, as shown in, that is, the curve of the contour line and the target function must have the same direction in the normal vector at this point, therefore, the optimal value must meet the following requirements: the gradient of f (x) is a * g (x) gradient. A is a constant, indicating the same direction between the left and right sides. This equation is the result of parameter derivation by L (A, X. (I don't know whether the above description is clear. If it is very close to my physical location, contact me directly. I can understand it in person. Note: It is from wiki ).
The kkt condition is a necessary condition for the optimization problem that meets the strong dual condition. It can be understood that we require min f (x), L (a, B, x) = f (x) + A * g (x) + B * h (x), a> = 0, we can write F (x) as: Max _ {, b} l (a, B, x). Why? Because h (x) = 0, g (x) <= 0, it is now the maximum value of L (a, B, X), A * g (x) yes <= 0, so l (a, B, x) can obtain the maximum value only when a * g (x) = 0. Otherwise, the constraints are not met, so Max _ {a, B} l (a, B, x) is f (x) when the constraints are met ), therefore, our target function can be written as min_x Max _ {a, B} l (a, B, X ). If the dual expression: Max _ {a, B} min_x L (a, B, x) is used ), because our optimization satisfies strong dual conditions (strong dual means that the optimal value of the dual formula sub is equal to the optimal value of the original problem), when the optimal value x0 is obtained, it meets the requirements of F (x0) = max _ {a, B} min_x L (a, B, x) = min_x Max _ {a, B} l (a, B, X) = f (x0), let's take a look at what happened to the two formulas in the middle:
F (x0) = max _ {a, B} min_x L (a, B, x) = max _ {a, B} min_x f (x) + A * g (x) + B * h (x) = max _ {a, B} f (x0) + A * g (x0) + B * H (x0) = f (x0)
We can see that the above blacklist is essentially that min_x f (x) + A * g (x) + B * h (x) has obtained the minimum value in x0, using the Fermat theorem, that is, for the function f (x) + A * g (x) + B * h (x), the derivative is equal to zero, that is
F (x) Gradient + A * g (x) Gradient + B * h (x) gradient = 0
This is the first condition in the kkt condition: l (a, B, x) returns X to zero.
As previously stated, a * g (x) = 0, then the 3rd conditions of the kkt condition, of course, the known condition h (x) = 0 must be met, the Optimal Values of optimization problems that meet strong dual conditions must meet the kkt conditions, that is, the three conditions described above. The kkt condition can be considered as a generalization of the Laplace multiplier method.
I ran the question above. Next I will continue our SVM journey.
After the derivation of the Laplace multiplier method and the kkt condition
The final problem is:
Maximize:
Condition:
This is a famous QP problem. Decision plane: it refers to the optimization solution of the problem.
Slack vaviable)
Because there may also be errors in the data collection process ()
So we introduce relaxation variables to optimize the problem.
The formula becomes
Finally, it is transformed into the following optimization problems:
Among them, C is the penalty factor, which is a coefficient specified by the user, which indicates the number of punishments to be added to the points of the error. When C is very large, there will be fewer points with errors, but the case of over-fitting may be more serious. When C is very small, there may be many points with errors, however, the resulting model may be incorrect.
The formula above seems complicated. Now let's take it over.
......
... (The formula on the draft paper is too annoying)
The final result is:
Maximize:
Condition:
Oh, isn't it like the previous style? It's so nice to be a math.
This formula looks beauul ul, but in most cases, it can only solve the case of linear differentiation. It can only process linear samples. If the provided samples cannot be linearly split, the result is very simple. The linear classifier is used to solve the problem.ProgramIt will be infinite loops and never be solved. But not afraid. We have no killer yet. Then we will extend to a new field: kernel functions. Hey, I believe everyone should have heard of this story. This stuff was mentioned in 1960s, but it was not until 1990s that it started to get angry (the second spring festival), but it was largely turned out by Vapnik. This also shows that computers need to study more classics. It doesn't mean that they won't be able to read them when they are out of date. Some Masters' papers are still enlightening. If you don't talk nonsense, you can skip the question again.
Core functions
So what if Shenma is a core function?
Introduce the concept of VC Dimension first.
In order to study the speed of consistent learning convergence and promotion of empirical risk minimization function sets, SLT defines some indicators to measure the performance of function sets, among them, the most important is the VC Dimension (Vapnik-Chervonenkis dimension ).
VC Dimension definition: for a function with only 0 and 1 values, if H samples can be separated by functions in the function set in all possible 2 h forms, the function set can scatter H samples, the VC Dimension of the function set is the maximum number of samples that can be dispersed.
If there are always functions that can scatter any number of samples, then the VC Dimension of the function set is infinite.
It is easy to see the chart (three points can be classified linearly ).
What if there are four points? Haha, the four points on the right are divided into two classes, which may not be able to be divided.
If there are four points, one line may be too long.
Generally, the larger the VC dimension, the stronger the learning ability, but the more complicated the learning machine.
Currently, there is no general theory for calculating the VC dimensions of any function set. Only the VC dimensions of some special function sets can be accurately known.
In the n-dimensional real space, the VC Dimension of the linear classifier and linear real function is n + 1.
The VC Dimension of sin (ax) is infinite.
How to calculate the VC dimension of a given learning function set is a difficult problem to be solved in the current research on Statistical Learning Theory. If you are interested, you can study it.
Let's talk about ing.
The example is shown below:
The following section from Baidu Library http://wenku.baidu.com/view/8c17ebda5022aaea998f0fa8.html
I think writing is better than me, so I chose to stand on the shoulders of giants.
We set all vertices in the red section between the endpoint A and B on the horizontal axis as positive classes, and the vertices in the black section on both sides as negative classes. Can I find a linear function to correctly separate the two classes? No, because the linear function in two-dimensional space refers to a straight line, it is clear that the line that meets the conditions cannot be found.
But we can find a curve, such as the following one:
Clearly, you can determine the type of a vertex by either over or below the curve (you can find a point on the horizontal axis to calculate the function value of this point, the negative class's point function value must be greater than 0, and the positive class must be smaller than 0 ). This curve is a familiar quadratic curve. Its function expression can be written as follows:
The problem is that it is not a linear function, but we should note that we should create a new vector y and:
In this way, g (x) can be converted to f (y) = <A, Y>. You can bring back y and a respectively, let's see if the equal value is not equal to the original g (x ). You may not be quite clear about writing in the form of inner product. In fact, the form of f (y) is:
G (x) = f (y) = ay
In the space of any dimension, this form of function is a linear function (except that a and Y are all multidimensional vectors), because the number of times of the independent variable Y is not greater than 1.
Can you see where it is? It turns out that a linear problem cannot be split in two-dimensional space. After being mapped to a four-dimensional space, it becomes linearly severable! Therefore, this forms the basic idea we initially wanted to solve the linear inseparable problem-transforming to a high-dimensional space, so that it can be linearly divided.
The most important part of conversion is to find the ing method from X to Y. Unfortunately, there is no systematic way to find this ing (that is to say, just guess and gather together ). Specific to our text classification problem, the text is represented as a vector of thousands of dimensions. Even if the dimension is already so high, it is often linear and cannot be divided, but it also needs to be converted to a higher space. The difficulty is conceivable.
Why is f (y) = ay a function in a four-dimensional space?
You may not understand it for a moment. Let's look at our function definitions in two-dimensional space.
G (x) = ax + B
The variable X is a one-dimensional function. Why is it a function in a two-dimensional space? Because we haven't written another variable, its complete form is actually
Y = g (x) = ax + B
That is
Y = ax + B
How many variables are there? Two. What is the dimension space function?
Let's take a look.
F (y) = ay
Y is a three-dimensional variable, and f (y) is a function in a dimension space?
Let's use an example of a specific text classification to see how this classification method works by ing to a high-dimensional space. Imagine, the original space of the text classification problem is 1000 dimensions (that is, each document to be classified is represented as a 1000-dimension vector). In this dimension, the problem is linear. Now we have a linear function in the 2000 dimension space.
F (x') = <W', x'> + B
Note that there is a 'in the upper right corner of the vector. It can split the original problem. Both W' and X' are 2000-dimensional vectors, except that W' is a fixed value and x' is a variable (well, strictly speaking, this function is 2001-dimensional, haha ), now our input is a 1000-dimensional vector X. The classification process is to first convert X to a 2000-dimensional vector x ', then obtain the Inner Product of the transformed vector X' and the vector W', and then add the value of this inner product to B. The result is displayed, if the result is greater than or less than the threshold value, the classification result is obtained.
What did you find? In fact, we only care about the inner product value in the high-dimensional space. The value is calculated and the classification result is returned. In theory, x' is transformed by X, so in a broad sense it can be called the function of X (if there is an X, an x' is determined, right, but W' is a constant, which is a constant W in a low-dimensional space after transformation. Therefore, a value of W and X is given, there is a definite F (x') value corresponding to it. This makes us wonder whether such a function K (w, x) can be used to calculate the inner product value of a high-dimensional space <W ', x'>?
If such a function is available, when input x is given to a low-dimensional space,
G (x) = K (w, x) + B
F (x') = <W', x'> + B
The calculation results of these two functions are exactly the same, so we don't need to find the ing between them. We can directly use low-dimensional input to represent the G (x) (remind me again, in this case, g (x) is not a linear function, because you cannot guarantee that the number of times X in the K (w, x) expression is no greater than 1 ).
Fortunately, such K (w, x) actually exists. (We find that most of the problems we humans can solve are so clever that we cannot be so clever, there is always something opportunistic to solve, and we feel like we are small.) It is called a kernel function (kernel, kernel), and there are more than one. In fact, any function that meets the Mercer condition can be used as a core function. The basic function of the kernel function is to accept the vectors in two low dimensional spaces and calculate the inner product values of the vectors in the high dimensional space after a certain transformation. Several commonly used kernel functions are listed in Russian and teaching books. I will not think about them (Lazy !).
Looking back at the linear classifier we mentioned above, its form should be:
Now this is a linear function in a high-dimensional space (to distinguish between functions and vectors in a low-dimensional and high-dimensional space, I changed the name of the function and added 'to both W and X '), we can replace it with a function in a low-dimensional space (again, the function in this low-dimensional space is no longer linear,
What else did you find? F (x') and Alpha, y, and B in G (x) are all the same! That is to say, although the problem is linear, we can solve it as a linear problem, but in the process of solving it, if you require Inner Product, use the kernel function you selected. In this way, the obtained α is combined with the kernel function you selected to obtain the classifier!
After understanding the above, you will naturally ask the following two questions:
1. Since there are a lot of kernel functions, how should we choose specific problems?
2. What should I do if the problem is still linear after the kernel function is mapped to a high-dimensional space?
The first question can be answered now: the selection of kernel functions lacks guiding principles! The observations of various experiments (not just text classification) do show that some problems work well with some kernel functions, while others do, but generally, radial basis functions do not produce much deviation. (When I was working on a text classification system, I used the radial basis kernel function. Without parameter optimization, most of the categories were accurate and recalled more than 85%.
Perceptual knowledge, MAP:
Two common kernel functions:
Polynomial kernel functions:
Gaussian Kernel function:
Definition:
The kernel function is introduced, and the problem is transformed into a linear problem, as shown below:
, Where
The formula is available, but how can we find the result? I will take you step by step to solve this problem and give you an intuitive understanding of this problem through hands-on programming. (PS: everyone is too dependent on libsvm, which does not help in-depth research and understanding, and I feel more fulfilled if I implement it myself)