Original address: http://cs231n.github.io/linear-classify/
##############################
Table of Contents:
1. Introducing the linear classifier
2. Linear score function
3. Explain a linear classifier
4. Loss function
4.1. Multi-class support vector machine
4.2. Softmax classifier
4.3. Support Vector Machines vs Softmax
5. Interactive Web examples of linear classifiers
6. Summarize
############################################## #3
Linear classification
The problem with image classification is described in the previous section, which is to select a category of tasks for a test image using a set of training categories. In addition, we describe the nearest neighbor (KNN) classifier, which determines the category of the test image by comparing the proximity between the test image and the training set image (tagged). As you can see, the nearest neighbor (KNN) classifier has the following drawbacks:
1) in order to compare with the test data, the classifier must store all the training set data. This is a waste of storage space because the data set can be very large.
2) because the test image needs to be categorized by comparison with all training images, it is time-consuming to calculate.
Overview. Now, we are going to develop a classifier that is more efficient for image classification, and finally, we will extend it to the entire neural network and convolutional neural network naturally. This classifier has two main parts: one is the score function (score functions): through which we can convert the raw data to the performance value of the class, and the other is the loss function (loss functions) : The degree of difference between the image category used to quantify the predictions and the true category of the test image. We will use this as the optimization problem, relative to the performance function of the parameters to minimize the loss function value.
Parameterized mapping from images to label scores
The first part of the classifier is to define a performance function that converts the image pixel value to the confidence level of the test image relative to each class. We will develop this approach through specific cases. As always, assume an image training set, and each training image corresponds to a tag (category), where. Therefore, we have a total of n training images (each with a D dimension) and K different categories. For example, on CIFAR-10 we have a training set n=5,000 image, each image has d=32x32x3=3072 pixels, the number of categories k=10, so there are 10 different categories (dog, cat, car, etc.). Now, we define the score function, which is used to convert the original image pixel value to the class score.
Linear classifier. In this module, we start with one of the simplest functions, which is a linear matching function:
In the above equation, we assume that all pixel points of the image are stretched to a single column of vectors, and the size is [Dx1]. The matrix W (size is [kxd]) and vector b (size = [Kx1]) are the parameters of the function. On CIFAR-10, a column vector (size [3072x1]) for all pixel points of the first image is stretched, and the W size is [10x3072],b size = [10x1], so a total of 3,072 values are entered into this function, The final output is 10 digits (indicating the test image is relative to each class of achievement). The parameters in W are often referred to as weights,and B is often referred to as biased vectors , since B affects the output score but does not interact with the actual input data. Therefore, the terminology weights and parameters are often used interchangeably.
Some notes:
1) Single-matrix multiplication means that the evaluation is performed in parallel using 10 separate classifiers (one for each class), and each category is a one-line parameter of W.
2) The input data is fixed, but we can change the value of the parameter w,b. Our goal is to set the parameters so that the computed results match the true category of the entire training set. Here is a more detailed explanation of how to do it, but intuitively we want to match the correct category of scores higher than match the incorrect category.
3) The advantage of this classifier is that training data can only be used for training. Once the training is complete, you can discard the entire training set data and keep only the parameter values that have been learned. Because the test image can be calculated only by this classifier function, based on the results obtained can be categorized.
4) Finally, the classification of a test image consists of a matrix multiplication and an addition, which is much faster than the previous comparison of the entire training set image.
Interpreting a Linera classifier
A linear classifier calculates the result of a test image by weighted summing all the pixel values on the 3 color channels of the test image. By setting the weight value, the classifier can "like" or "dislike" (depending on the symbol of each weight value) Some color of a particular location on the image. For example, the edge part of a picture has a lot of blue, so it is more likely to belong to the category of "boat" (or perhaps the "water" category). You can expect that the "ship" classifier has a number of positive weights on the blue channel (if there is blue to increase the "ship" category), and there are a lot of negative weights on the red/green channel (red/green will decrease the "ship" category's score).
Analogy of images as high-dimensional points. Because the image is stretched to a high-dimensional column vector, we can see each picture as a point on a high-dimensional space (for example, the image on CIFAR-10 has 32x32x3=3072 pixels, so you can tell a point in a 3072-dimensional space). Similarly, the entire dataset is a set of points (labeled categories).
Because we define each class as a weighted summation of all pixel points in the image, each class classifier is a linear function on that space. Although we cannot visualize 3072-dimensional space, if we define these dimensions as 2 dimensions, the visualization results are as follows:
As previously mentioned, each row of the weight matrix W is a classifier of a class. The geometry is explained as follows: If we change a row of data for W, the corresponding row in the pixel space will be converted to a different direction. On the other hand, paranoid vector B can change the position of the classifier. If there is no B, then regardless of W, then there will always be a score of 0, because at this time all the rows through the origin.
Interpretation of linear classifies as template matching. Another explanation for the weight w is that each row of W is a template (sometimes called a blueprint) of a class. The results of a test image relative to each category can be obtained using a dot product operation on the test image and the corresponding category template, and the best result is the category of the test image. So the linear classifier is actually in the template matching, and these templates can be obtained by learning. Another way of thinking is that we are still doing the nearest neighbor classification, but not using all the training images, but using only one template per category (the template can be learned by learning and not necessarily the image that is already in the training set), and we use the (negative) dot product operation instead of the l1/l2 distance operation.
Also, notice that the horse's template seems to have a double champ, which is due to the data set with the left facing horse and the horse facing to the right. The linear classifier integrates the two stud patterns onto a template. Similarly, the car's template looks also integrates a variety of different patterns onto a template, so that it has to be identified from all directions, different colors. Also, the template is red, implying more red-colored cars in the CIFAR-10 data set. Because the performance of the linear classifier is too weak to correctly identify different colors of the car, but then the neural network can accomplish this task. One step ahead: Neural networks can develop intermediate neurons in hidden layers that can detect the type of vehicle specified (for example, a green car facing the left, facing the blue car in front of it), and the next layer of neurons will combine these into a more accurate result by a weighted summation of a separate vehicle detector.
Bias trick. Before we proceed, we would like to mention a general simplification technique, which is to convert two parameter w,b to one. The previously defined performance functions are as follows:
Processing in the above format is slightly cumbersome because two sets of parameters must be saved separately (offset vector b and weight matrix W). A common technique is to combine the two sets of parameters in a matrix, that is, extend the vector to make it one dimension, and the default value of the dimension is constant 1-a default paranoid dimension. With this extra dimension, the new performance function can be simplified to a matrix multiplication:
In CIFAR-10, the size changes from [3072x1] to [3073x1]-(additional one-dimensional owning constant 1), and the size of the weight w is changed from [10x3072] to [10x3073]. The extra column in W represents the offset vector b. As shown below:
Image data preprocessing. In the example above, we use the original pixel value (range in [0 ... 255] to operate. In machine learning, the common approach is to perform normalization of the input features (in the case of images, each pixel is considered a feature). It is also important to have a 0 mean value for each feature. In an image case, this is equivalent to calculating a mean image from the training image, and then subtracting the mean image from each image, thus making its pixel distribution approximate to [-127 ... 127]. A further step is to scale each feature so that it has a value range of [-1, 1]. Combined with the above method, the 0 mean centrality may be more important, and we can understand the reason after learning to understand the dynamic theory of gradient descent.
Loss function
In the previous section we defined a function to convert the pixel value to the grade of the class, which is parameterized by a set of weighted W. In addition, we cannot change the input data (they are fixed), but we can change the weight value. The goal is to set the weight values so that the classes predicted in the training data are consistent with the correct results.
For example, a previous example of testing a cat, its detection category is "cat", "Dog" and "boat". You can see that this example is not ideal: The pixel depicts the cat but the "cat" category is very low (-96.8) and the other categories of high scores ("dog": 437.9; "Boat": 61.95). We want to pass a loss function (alossfunction, or sometimes also referred to as the costfunction or theobjective ) to measure how satisfied we are with the results of the output. Intuitively, if the result is not satisfied, the loss function value should be very high, instead, the loss function value should be very low.
Multiclass Support Vector Machine loss
There are several ways of defining loss functions. First we will develop a common method called multi-class support vector machine loss (the multiclass supported vector machhine (SVM) loss) function. The purpose of the support vector machine loss function creation is that the support vector machine wants to be in a fixed interval, the correct category of each image has a higher score than the incorrect category. Sometimes the loss function can be understood as the previous content: the support vector machine wants a good result loss function value is low.
A more detailed description. Recall the first example, we gave the image pixel values and categories, and then specify the correct category subscript. The score function (score functions) gets the pixel value and then calculates the category score, which we simply referred to as s (short for the result). For example, the result of category J is the first J element:. The first example of the multi-class support vector machine loss function can be summarized as follows:
Example. Let's use an example to see how the above formula operates. Let's say we have 3 categories, get the score, and the first class is the right category. and assume that (a hyper-parameter, which will be explained later in detail), has a value of 10. Using the above company to sum up the incorrect category, you can get two subkeys:
As you can see, the first subkey results in 0 because the result of [ -7-13+10] is negative, and the threshold function has a threshold value of 0, so the result is 0. We get result 0 in this sub-item because the correct category score (13) is greater than the incorrect category score (-7), which exceeds the interval value of 10. In fact, the difference is 20, but the support vector machine only cares for the difference within 10, and any difference that exceeds the interval value is assigned a value of 0. The 2nd subkey evaluates to [11-13+10]=8. Although the correct category has a higher score than the incorrect category (13>11), it does not exceed the interval value of 10, so the output loss value is 8. As a result, the support vector machine loss function wants the correct class of results to be larger than any incorrect category and at least exceeds the interval value (delta), otherwise the loss value will be accumulated.
Note that in the above module we are using a linear score function, so we can also rewrite the loss function in the following format:
Which is the first row of the weight w. Once we change the format of the result function f, then the above formula will not be used.
Before explaining this section, we will also mention a term that is the 0 threshold function, also known as thehinge loss. Sometimes the squared hinge loss (or L2-SVM) is used instead, in the form of a function, that penalizes violated margins more strongly (quadratically instead of linearly). The non-square standard is more standard, but in some datasets, using squared hinge loss has a better effect. Can be determined by cross-validation.
Regularization. The loss function mentioned above has a flaw. Suppose we have a dataset and a set of parameter W, using parameter W to correctly classify all the pictures in the dataset (i.e. all the correct grades are higher than the incorrect classification, and above the interval value, so). The problem is that the result of this set of weight values W is not unique: there may be other similar weight values W to correctly classify all images (i.e., the loss value of all images is 0). A simple verification method is that if the weight w can be correctly categorized, assuming that the weights can also correctly classify all the images, because this change also stretches the magnitude of all grades, so their absolute difference also stretched the same magnitude. For example, if the difference between a correct category and its closest class is 15, then the weight w is multiplied by the parameter 2, and the new difference becomes 30.
In other words, we want to encode a specific set of weights to eliminate the ambiguity above. We can extend the loss function with a regulariation penalty . The most commonly used regularization penalty is the L2 format, which blocks larger weights by a two penalty based on all parameters:
In the above expression, we will sum the squares of all the parameter values. Note that the rule function is not used for data, it is used only for weights. The complete multi-class support vector machine includes a disciplined penalty function, so the male dog has two parts: data loss (theloss: Average of loss values based on all images) and a ruleset loss (theregularization loss). The format is as follows:
or an extended format:
where n is the number of training samples. , we added a rule penalty function to the loss function, whose weight is a super parameter. There is no simple way to set this hyper-parameter, which is determined by cross-compiling.
Adding the rule penalty will have a lot of accessibility features, and there are many more that we will be explaining in the next section, except as we mentioned above. For example, if you have a L2 format penalty, the maximum margin (max margin) will appear on the support vector machine, and you can view CS229 's course notes if you want to learn more details.
Each attractive trait is that it punishes a large weight trend and helps to improve generalization, because it means that no input dimension can have a large impact on the final score alone. For example, let's say we have an input vector, and two weights and vectors. And, so the two weight vectors and the input data dot product results are the same, but the weight vector L2 format penalty is 1.0, and the weight vector L2 format penalty is only 0.25. Therefore, the weight vector is better relative to the L2 format penalty because of its low rule loss value. Intuitively, this is because the weight vectors are smaller and more diffuse. Because the L2 format penalty prefers smaller and more diffuse weight vectors, the final classifier is encouraged-to-take into account all input dimensions to small amounts Rather than a few input dimensions and very strongly. As we'll see later in this section, it can improve the generalization performance of classifiers and reduce overfitting.
Note that bias values do not have the same effect, unlike weight values, where they cannot control the impact strength of an input dimension. Therefore, it is often only a rule that the weighted w is not irregularly biased B. In practice, however, the impact is almost negligible. Finally, note that due to the effect of the rule penalty, we can not achieve 0 loss on all samples, only in the case of w=0 can appear.
Code. Below is the implementation of the loss function (without the rule) under Python, with two versions, non-vectorization and vectorization formats:
def l_i (x, Y, W): "" "unvectorized version. Compute the Multiclass, SVM loss for a, single example (x, y)-X is a column vector representing an image (e.g. 3073 x 1 in CIFAR-10) with a appended bias dimension in the 3073-rd position (i.e. bias trick)-y are an integer giving index of Correct class (e.g. between 0 and 9 in CIFAR-10)-W are the weight matrix (e.g. x 3073 in CIFAR-10) "" "Delta = 1.0 # See notes on Delta later in this section scores = W.dot (x) # scores becomes of size ten X 1, the scores for each CLA SS Correct_class_score = scores[y] D = w.shape[0] # Number of classes, e.g. Loss_i = 0.0 for J in Xrange (D): # iter Ate over all wrong classes if j = = y: # Skip for the true class to only loop over incorrect classes continue # accumulate loss for the i-th example loss_i + + max (0, Scores[j]-correct_class_score + Delta) return Loss_idef L _i_vectorized (x, Y, W): "" "A faster half-vectorized implementation. Half-vectorized refers to the fact that for a single example the implementation contains no for loops, but there was still one loop over the examples (Outside this function) "" "Delta = 1.0 scores = W.dot (x) # Compute the margins for all classes in one vector operation margins = np.maximum (0 , Scores-scores[y] + delta) # on y-th position scores[y]-scores[y] canceled and gave Delta. We want # to ignore the y-th position and only consider margin on Max wrong class margins[y] = 0 loss_i = np.sum (margin s) return loss_idef L (x, Y, W): "" "Fully-vectorized implementation:-X holds all the training examples as columns ( e.g. 3073 x 50,000 in CIFAR-10)-y is an array of integers specifying correct class (e.g. 50,000-d array)-W is weights (e.g. x 3073) "" "# Evaluate loss over all examples in X without using any for loops # Left as exercise to reader in the assignment
The content of this section is that the support vector machine loss function uses a special method to measure how the predictive results of training data are consistent with the real category. In addition, having a good prediction on the training data is also equivalent to minimizing the loss.
Practical considerations
Setting Delta. Let's review the hyper parameter and its settings. What value should we set for it? Do you need cross-validation? It turns out that this hyper-parameter can be set to 1.0 () safely in all the examples. The hyper-parameters and looks like two different values, but in fact, all two of them control the same tradeoff (they both control the same tradeoff): the tradeoff between data loss and ruleset loss in the loss function (the tradeoff between The data loss and the regularization loss in the objective). The key to understanding this is that the magnitude of the weight w is directly affecting the score (including their difference): when we reduce the weight of W's ownership, the difference between the corresponding scores becomes lower; when we increase the weights, the difference in results becomes larger. Therefore, the exact value of margin (for example, or) seems meaningless because the weight w can arbitrarily stretch and reduce the difference. So the only real tradeoff is how much weight we allow (by the rule of intensity).
Relation to Binary support Vector machine. Before you take this course, you may have some experience with the second class of support vector machines, and the loss function format for the sample I is as follows:
Where c is a hyper-parameter. Don't worry, the above format is actually a special case of the formula we are proposing now, when only 2 classes are required to divide the format. That is, when we only need to do 2 classes, the loss function can be simplified to the above format. At the same time, the parameters of this format control the same trade-offs as the hyper-parameters in our format, and the relationship is reciprocal.
Aside:optimization in Primal. If you have learned about SVM before, you may have heard these terms: core (kernels), duals,the SMO algorithm, etc. In this class (as was the case with neural Networks in general) we'll always work with the optimization objectives in Thei R unconstrained form. Many of these objectives is technically not differentiable (e.g. the max (x, y) function isn ' t because it has a kink when X =y), but in practice-a problem and it is common to use a subgradient.
Aside:other multiclass SVM Formulations. it is worth noting, the Multiclass SVM presented in this section is one of few ways of formulating the SVM over multiple classes. Another commonly used form is the One-vs-all (OVA) SVM which trains a independent binary SVM for each class Vs. All other Classes. Related, but less common to see in practice is also the All-vs-all (AVA) strategy. Our formulation follows the Weston and Watkins 1999 (PDF) version, which are a more powerful version than OVA That's can construct multiclass datasets where this version can achieve zero data loss, but OVA cannot. See details in the paper if interested). The last formulation-structured SVM, which maximizes the margin between the score of the correct class an D The score of the Highest-scoring incorrect runner-up class. Understanding the differences between these formulations is outside of the scope of the class. The version presented in these notes are a safe bet to uSe in practice, but the arguably simplest OVA strategy are likely to work just as well (as also argued by Rikin et al. 2004 Inin Defense of One-vs-all Classification (PDF)).
Softmax classifier
It is proved that SVM is one of the two kinds of classification methods commonly used. Another common classifier is the T-he Softmax classifier (classifier), which has a different loss function. If you've heard of two types of logistic regression classifiers before, then the Softmax classifier is a generalization to a multi-class form. The Softmax classifier gives a slightly more intuitive output (normalized class probabilities), with a probabilistic explanation that we'll mention later. In the Softmax classifier, the function does not change, but now we can interpret these scores as the non-normalized logical probabilities of each class, and then use a cross-entropy loss (cross-entropy loss) instead of hinge Loss:
Here we use symbols to represent the first J elements of the class score vector. As before, the complete loss function format is the mean value of the entire training set plus a ruleset penalty. The function is called Softmaxfunction: It gets a set of score vectors (recorded as) and then normalized the set of vector values, so that each value in the vector is between 0 and 1, and all the values in the vector are summed to 1. The complete cross-entropy loss function includes this Softmax function.
Information theory view. The cross-entropy between the true distribution and the estimated distribution is defined as follows:
The Softmax classifier minimizes the cross-entropy between the estimated class probability distribution (as previously shown) and the true distribution, which in this interpretation is the distribution where all probability mass is On the correct class (i.e. Contains a single 1 at the-th position.). Moreover, since the cross-entropy can be written in terms of entropy and the Kullback-leibler divergence as, and the ENTR Opy of the delta function is zero, this is also equivalent to minimizing the KL divergence between the (a measure of distance). In other words, the cross-entropy objective wants the predicted distribution to has all of its mass on the correct answer .
Probabilistic interpretation. The expression is as follows:
This function is to derive the probability that the image belongs to a certain class (normalized) and is parameterized by weight. Remember that the Softmax classifier interprets the scores inside the output vector f as the unnormalized log Probabilities. Exponentiating These quantities therefore gives the (unnormalized) probabilities, and the division performs the Normalizat Ion So, the probabilities sum to one. In the probabilistic interpretation, we is therefore minimizing the negative log likelihood of the correct class, which C An is interpreted as performing Maximum likelihood estimation (MLE). A Nice feature of this view is so we can now also interpret the regularization term R (W) R (W) in the full loss function a s coming from a Gaussian prior over the weight matrix WW, where instead of MLE we is performing the Maximum a posteriori (MAP) estimation. We mention these interpretations to help your intuitions and the full details of this derivation is beyond the scope of This class.
Practical issues:numeric stability. When you write code in real life to calculate the Softmax classifier, because of the exponential (the exponentials) relationship, the intermediate term (the intermediate terms) and may become very large. Dividing a large number can become a numerical instability, so it is important to use a normalization technique. If we are in the numerator and denominator at the same time with a Changshu C and push it into the sum, we will get the following (the number is equivalent) expression:
We can arbitrarily choose the value of C, because this does not change the result, but we can use this value to improve the numerical stability of the calculation (the numerical stability). The value of common C is. This simply states, we should shift the values inside the vector so, the highest value is zero. The code is as follows:
f = Np.array ([123, 456, 789]) # example with 3 classes and each have large SCORESP = Np.exp (f)/Np.sum (Np.exp (f)) # Bad : Numeric problem, potential blowup# instead:first shift the values of f so, the highest number is 0:f-= Np.max (f) # F becomes [ -666, -333, 0]p = Np.exp (f)/Np.sum (Np.exp (f)) # Safe to do, gives the correct answer
Possibly confusing naming conventions. Specifically, the SVM classifier uses
The hinge loss, sometimes also called
The max-margin loss。 Softmax classifier Use
The Cross-entroy loss。 The name of the Softmax classifier is derived from
The Softmax function, the function is used to normalized the original category of scores, so the the cross-entropy can is applied. In particular, technically speaking, talking about "softmax loss" is meaningless, because Softmax is just a squashing function, but this is a relatively common shorthand method.
SVM vs. Softmax
may help to understand the difference between the Softmax and the SVM classifier:
Softmax classifier provides "probabilities" for each class. Unlike the SVM classifier computes uncalibrated and it is not easy to explain the results of each class, the Softmax classifier allows us to calculate probabilities for each possible category. For example, given a picture, use the SVM classifier to get results for, respectively, corresponding to the category "Cat", "Dog" and "boat". Then the Softmax classifier calculates the probability of these three categories so that you can intuitively understand the probability of which category they belong to. However, the reason why we use the term "probability" is because it relies heavily on the rule strength-which you is in charge of as input to the system. For example, assume that the logical probability of non-normalization is. Then the Softmax function is calculated as follows:
The step is to perform an exponential operation before normalization. Now, if the intensity of the rule is greater, then the weight w will become smaller. For example, assume that the weight becomes half. Then the calculation is as follows:
Where the probabilities is now more diffuse. In addition, in extreme cases, the weights can become very small due to the very high level of the rule, and the output probability may be the same at this time. Hence, the probabilities computed by the Softmax classifier is better thought of as confidences where, similar to the SVM , the ordering of the scores is interpretable, but the absolute numbers (or their differences) technically was not.
In practice, SVM and Softmax is usually comparable. The performance difference between the SVM and the Softmax classifier is very small, and different people will have different opinions about it. Compared to the Softmax classifier, the SVM classifier is a much local objective, which could be though of either as a bug or a feature. For example, given a category of results, then the first class is correct. For the SVM classifier (e.g. with desired margin of), a loss value of 0 is calculated because the correct category is higher than the other category and has exceeded the margin. The SVM classifier does not care about the details of the scores: for example, the loss value of the SVM classifier is 0, because the correct category has higher scores than the other categories and has exceeded the margin. But for the Softmax classifier it is not so, when the result is last, it loses more than the score. In other words, the Softmax classifier will never be satisfied with the result value: the higher the correct category, the lower the incorrect category result, and the greater the loss value. However, the SVM is happy once the margins be satisfied and it does not micromanage the exact scores beyond this Constrai NT. This can be intuitively thought of as a feature: for example, the "car" classifier can be more focused on separating "trolley" and "truck" without having to go to the "frog" category because the "Frog" category has low grades, and which likely cluster around a Completely different side of the data cloud.
Interactive Web Demo
Summary
1) We define a performance function (scorefunctions) for converting image pixel values to category scores (in this section we use linear functions that depend on the weights W and bias B)
2) compared to the KNN classifier, the advantage of the parameterization method (parametric approach) is that once the parameters have been learned, all the training data can be discarded. Also, predicting a test image takes a short time because it only needs to perform a matrix multiplication with W, rather than comparing it with each training image.
3) introduces the bias technique (the bias trick), which allows us to add a bias vector to the weight matrix, so that only one matrix multiplication can be performed
4) defines a loss function (lossfunctions, which describes two common loss functions on a linear classifier: theSVM and the Softmax) to measure a given set of parameters , a consistent comparison between the calculated results and the real results. At the same time we find that having a good prediction also means having a small loss value.
Now that we know a way to match an image with a set of parameters, we know two loss functions to measure the quality of the prediction. But how do we know that the set of parameters can get the best (lowest) loss value? This approach is optimization (optimization), which is the subject of the next section.
Further Reading
1) deep learning using Linear-support Vector machines from Charlie Tang to presents some results claiming that the L2SVM Outperforms Softmax.
Linear classification:support Vector Machine, Softmax