Perceptron is one of the oldest classification methods, and today it seems that its classification model is not strong in generalization at most, but its principle is worth studying.
Because the study of the Perceptron model, can be developed into support vector machine (by simply modifying the loss function), and can develop into a neural network (by simply stacking), so it also has a certain position.
So here's a brief introduction to the principle of the perception machine.
Perceptual Machine Model
The idea of Perceptron is simple, for example, we have a lot of boys and girls on a platform, the model of perceptual machine is to try to find a straight line, can isolate all the boys and girls. The following motion diagram may give the audience some intuitive experience:
Placed in three-dimensional space or higher dimensional space, the perceptual model is to try to find a super-plane, to be able to isolate all the two-dollar category.
Of course you ask, what if we can't find such a straight line. If you can't find it, that means the category is not linear, which means that the Perceptron model is not suitable for your data classification.
One of the biggest prerequisites for using Perceptron is that the data is linearly divided. This severely limits the use of perceptual machines. Its classification competitors in the face of the non-divided situation, such as support vector machine can be through nuclear techniques to make the data in the high dimensional can be divided, the neural network can be activated by the function and increase the hidden layer to make the data can be divided.
In a mathematical language, if we have a sample of M, each sample corresponds to an n-dimensional feature and a two-tuple output, as follows:
Our goal is to find such a super plane, namely:
Let one of the categories meet
Let another category of samples satisfy
Thus the linear can be divided. If the data is linear, such a super-plane is generally not unique, which means that the Perceptron model can have multiple solutions.
To simplify this hyper-planar notation, we add a feature x0=1 so that the hyper-plane is
It is further represented by vectors as:
, where θ is the vector of (n+1) x1, x is the vector of (n+1) x1, ∙.
We all use vectors to represent the hyper-plane behind.
Except that θ is called a weight, and b is biased, so the complete expression of the super plane is:θ*x+b=0
The Perceptron model can be defined as y=sign (θ∙x+b) where:
If we call sign the activation function, the difference between the perceptual machine and the logistic regression is that the sign,logistic regression activation function is sigmoid.
Sign (x) will be greater than 0 divided into 1, less than 0 of the divided into -1;sigmoid will be greater than 0.5 divided into 1, less than 0.5 is divided into 0. so sign is called the Unit step function, and the logistic regression is also considered as a probability estimate.
Perceptual Machine cost function
Well, the above we already know the Perceptron model, we also know that his task is to solve the two classification problem, but also know the form of super-plane, then the following key is how to learn the parameters of the ultra-plane w,b, which requires the use of our learning strategy.
We know the machine learning model, we need to find the loss function first, then transform to the optimization problem, with gradient descent and other methods to update, and finally learn the parameters of our model w,b.
We will naturally think of using the number of false classification points as a loss function, is the number of wrong classification of fewer and less well, the perception of the machine is to do this kind of thing, just need to all points to good.
Unfortunately, such a loss function is not w,b continuous (you simply cannot use the function to express the number of the wrong classification point) and cannot be optimized.
So we want to switch to another option, the total distance from the wrong classification point to the Super plane (visually, the smaller the total distance, the better):
The distance formula is as follows:
And we know that every mis-classification point satisfies
Because when we have the correct value of data points of +1, you mistakenly classify, then you Judge 1, then calculate (w*x0+b) <0, so meet -yi (w*x+b) >0
When the number of points is the correct value of-1 when you mistakenly classify, then you judge +1, then calculate (w*x0+b>0), so meet -yi (w*x+b) >0
Then we can remove the absolute value symbol and get the distance of the wrong classification point:
Because you know, you can simply remove the absolute value. Then you can get the total distance (where m is the number of the wrong classification points):
In this way, we get the initial loss function of the perceptual machine model.
Without considering one of the W-norm points, we can get the loss function as (we'll describe the norm later in detail):
The Perceptron model is f (x) =sign (W*X+B), its task is to solve the two classification problem, to get the perceptual machine model we need to learn the parameter w,b.
Then we need a learning strategy that constantly iterates over the w,b, so we need to find a loss function.
It is natural that we think of the loss function by the number of wrong classification points, but because it is not able to be updated, it can be represented by the distance from the wrong classification point to the super plane, and then the 1/| is not considered | w| |, get our final loss function.
Why don't you consider 1/| | w| |, do not use the total distance expression as a loss function.
The task of the Perceptron is to do two classification work, and it ultimately does not care how much distance from each point the super plane gets (so we can finally not consider | | w| | ), just care if I finally have correctly classified correctly (that is, consider the number of wrong classification points), such as the following red and green lines, for the perception of the machine, the effect of the task is as good.
But in the evaluation criteria of SVM (the Green Line is better than the red wire, which you will understand at the learning vector machine)
So we can not test | | w| |, get rid of it directly, because this time we only consider the mistake classification point, when a mistake classification point appears, we carry on the gradient descent, changes to the w,b.
And it's back to the point where we initially wanted to use the number of the wrong classification as a loss function.
introduce a distance, just deduce it into a form that can be directed. (to say the last, I personally think not to remove | | w| |, is also the same can get the final correct classification of the super-plane, is directly using the distance as a loss function is also possible, may be the gradient is more complex, or the perception machine itself is to use the wrong classification points to distinguish, it is useless this loss function.
Above we talked about the loss function of the Perceptron:
where M is the collection of all the points of the mis-classification. This is a convex function, can be solved by gradient descent method or quasi-Newton method, the gradient descent method is commonly used.
But the general batch gradient descent method (BGD), which is based on the gradient and mean of all samples, is not feasible.
The reason is that there is a limit to our loss function, only the sample in the M-set of the mis-classification can participate in the optimization of the loss function.
So we can not use the most common batch gradient descent, can only use random gradient descent (SGD) or small batch gradient descent (MBGD).
The Perceptron model chooses to use a random gradient descent, which means that we only need to update the gradient with a single, mis-categorized point at a time.
The loss function is based on the partial derivative of W and B:
Then the gradient drop formula is:
Well, here we can give the whole perceptual machine learning process algorithm. As follows:
(1) Select the initial value w0,b0, (equivalent to the initial given a super-plane)
(2) in the training set selection data (Xi,yi) (arbitrary extraction of data points, to determine whether all data points to complete the judgment is not a point of exhaustion, if not, the direct end of the algorithm, if there is access (3))
(3) If Yi (w*xi+b) <0 (note is the wrong classification point, you need to update the parameters)
Then the parameter is updated. Here's how it's updated:
where n is the step in the statistical learning is also called the learning rate
(4) go to (2) until there is no wrong classification point in the training set
The theoretical knowledge of the perceptron is such that the proof of convergence and the dual form of the algorithm can be further understood by interested students.
Although it is now not a widely used algorithm in practice, it is worthwhile to study it well. Why the dual form of perceptual machine algorithm is faster than the original form in the actual application, it is also worth a good experience.
Reference: Hangyuan Li-Statistical Learning methods
Without losing beginner's mind, don't forget the original intention
AI Play-turn intelligence