The linear classifier is used as the image classification mainly consists of two parts: one is the assumption function, it is the mapping of the original image data to the category. The other is the loss function, which can be transformed into an optimization problem, in which the value of the loss function is minimized by updating the parameter values of the assumed function.

__parametric mapping from image to tag score__
The first part of the method is to define a scoring function that maps the pixel values of an image to the scores of each classification category, and the score indicates the probability that the image belongs to that category. Here's a concrete example to show the method. Now suppose there is a training set that contains many images $x _i \in \mathbb{r}^d$, each with a corresponding category label $y _i$. Here $i =,..., n$ and $y _i \in 1 ... k$. That is to say, there are **N** image samples, each image of the dimension is **D**, a total of **K** different classification.

For example, in CIFAR-10, we have an **N**=50000 Training set with each image having **D **= 32x32x3=3072 pixels, and **K**= 10, because the picture is divided into 10 different categories (dog, cat , automobiles, etc.). The mapping function is now defined as: $f: \mathbb{r}^d \rightarrow \mathbb{r}^k$, which is the mapping of the original image pixel to the classification score.

**linear classifier** : In this model, we start with the simplest probability function, a linear mapping:

\[f (x_i,w,b) = Wx_i +b\]

In the above formula, assume that each image data is stretched to a column vector of length d, and the size is [D x 1]. Where the size is [k x D] of the matrix **W** and the size of [k x 1] column vector **b** is the parameter of the function **(****parameters****)**. Or take CIFAR-10 as an example, $x _i$ contains all the pixel information of the first image, which is pulled into a column vector [3072 x 1],**W** size is [10x3072] and**b** is [10x1]. Therefore, 3,072 digits (raw pixel value) input function, the function outputs 10 digits (the score obtained by different classification). The parameter **W** is called the **weight (weights)**. **b** is called the **deviation vector (bias vector)**, because it affects the output value, but it is not associated with the original data $x _i$. In practice, the terms **weights** and **parameters** are often mixed.

A few things to note:

- First, a separate matrix multiplication $Wx _i$ effectively evaluates 10 different classifications in parallel, where the parameters of each class are a row vector of W.
- The input data $ (x_i,y_i) $ is given and immutable, but the parameters
**W** and **b** are controllable. The goal is to match the calculated classification score to the real category of the training-focused image data by setting these parameters.
- One of the advantages of this method is that the training data is used to learn the parameters
**W** and **b** , once the training is completed, the training data can be discarded, leaving the learning parameters. This is because a test image can simply enter functions and classify them based on the calculated classification score.
- Once the parameters are obtained, only a matrix multiplication and a matrix addition can be used to classify a test data, which is much faster than the KNN method, which reads all data sets for each prediction.

__Understanding Linear classifiers__
The linear classifier calculates the values of all the pixels in the 3 color channels in the image multiplied by $W $, resulting in a predicted value. Depending on the weight of the training, the function shows preference or disgust (depending on the symbol of each weight) for certain colors in some locations in the image. For example, it can be imagined that the "boat" category is surrounded by a lot of blue (which corresponds to water). Then the weight of the "ship" classifier on the blue channel has a lot of positive weights (their appearance increases the "boat" classification of the score), and the green and red channels on the weight of the more negative (they appear to reduce the "boat" classification of the score).

For example, suppose that a single-channel image is only 4 pixels, and there are 3 categories, namely, cats, dogs, and boats. First, the image pixels are stretched into a column vector, the matrix multiply with W, and the scores of each category are obtained. It is important to note that this W is not good: Cat scores are very low. From the view, the classification results for dogs.

Think of the image as a **high-dimensional point** : Since the image is stretched to a high-dimensional column vector, we can view the image as a point in this high-dimensional space (that is, each image is a point in a 3072-dimensional space). The entire dataset is a collection of points, each with 1 categorical labels.

Since the value of each classification category is defined as the weight and the matrix multiplication of the image, then the score of each classification category is the function of a linear function in this space. We have no way to visualize linear functions in 3072-dimensional space, but assuming that these dimensions are squeezed into two dimensions, you can see what these classifiers are doing:

The image space. Each of these images is a point, with 3 classifiers. In the case of a red car classifier, the red line represents the set of points in the space for which the car has a classification score of 0, and the arrows indicate the direction in which the scores rise. The points on the right of all red lines have a positive fractional value and are linearly elevated. The point values on the left side of the red line are negative and linearly reduced.

__the weight parameter of the linear dividing device__

As you can see from above, each row of**W** is a classifier of a classification category. The geometric explanation for these numbers is that if you change the number in one row, you see that the classifier's corresponding line in space starts to rotate in different directions. A deviation of **B**allows the classifier to translate straight lines. It is important to note that if there is no deviation, regardless of the weight, the classification score is always 0 at $x _i = 0$. So all the lines of the classifier have to pass through the origin point.

**consider a linear classifier as a template match** : Another explanation of the weight **W** is that each row of **it** corresponds to a classified template. One image corresponds to the scores of different classifications by using the inner product to compare images and templates, and then find which template is most similar. From this point of view, the linear classifier is using the learning template to do template matching for the image. From another point of view, it can be thought that the KNN is still being used efficiently, except that we do not use the images of all the training sets to compare, but only one picture is used for each category (this picture is what we learn, not one of the training set), and we use the (negative) inner product to calculate the distance between vectors. Instead of using L1 or L2 distances.

Here is an example of the weight of the CIFAR-10 as the training set and the end of the study. Notice that the boat's template has a lot of blue pixels as expected. If the image is a ship traveling on the sea, then this template uses the inner product to calculate the image will give a very high score.

You can see that the horse's template looks like a two-headed horse, which is caused by the horse's head facing each side in the image of the training-focused horses. The linear classifier fuses the two cases together. Similarly, the car's template seems to incorporate several different models into a template to identify cars of different colors in different directions. The car on this template is red, because most of the cars in the CIFAR-10 training set are red. The linear classifier is very weak in the classification ability of different color cars, but it can be seen that the neural network can accomplish this task later. Neural networks can implement intermediate neurons in their hidden layers to detect different kinds of cars (such as the green Front, the left, the blue forward, etc.). The next layer of neurons, by calculating the weights of different car detectors, combines these into a more accurate classification score for cars.

**deviations and weights of** **Merging** **tip** : Before you go further, mention this often-used technique. It can combine our usual parameters W and B. Recall that the classification scoring function is defined as:

\[f (x_i,w,b) = Wx_i + b\]

Processing these two parameters separately (the weight parameter $W $ and the deviation parameter $b $) is a bit clumsy, the common method is to put two parameters in the same matrix, while the $x _i$ vector to add a dimension, the value of this dimension is constant 1, this is the default deviation dimension. This new formula is simplified to the following:

\[f (x_i,w) = W X_i\]

Or take CIFAR-10 as an example, then the size of the $x _i$ becomes **[3073x1] **instead of [3072x1], and more than 1 dimensions containing the constant 1). The W size is **[10x3073]** . W in the more out of this column corresponds to the deviation value B, see:

Deviation technique. On the left is to do matrix multiplication and then add, the right is to add all the dimensions of the input vector 1 with a constant 1 dimension, and in the weight matrix to add a deviation column, and finally do a matrix multiplication. The left and right are equivalent. By doing this on the right, we just need to learn a weight matrix, instead of learning two weights and biases.

__preprocessing of image data__

In the example above, all images are used in the original pixel values (from 0 to 255). In machine learning, normalization (normalization) processing of input features is common. In the case of image classification, each pixel on the image can be viewed as a feature. In practice, it is important to **center** the data by subtracting the average from each feature. In the example of these pictures, this step means that an average image value is computed from all the images in the training set, and then each image is subtracted from the average, so that the pixel value of the image is roughly distributed between [-127, 127]. The next common step is to change the interval of all numerical distributions to [-1, 1]. The centrality of the **0** mean is important so that the loss function gets a very regular shape when the gradient drops. __Loss function__

In the hypothetical function, the training data $ (x_i,y_i) $ is given and cannot be modified. However, the parameters of the weight matrix can be adjusted so that the result of the scoring function is consistent with the real category of the image in the training data set, that is, the scoring function should get the highest score (score) in the correct classification position.

Back to the previous Cat image classification example, it has a score of three categories for cats, dogs, and boats. The weights in the example are very poor because the cat scores are very low (-96.8), while the Dogs (437.9) and the boat (61.95) are relatively high. Therefore, the **loss function (****Loss function,**cost**function)** is used to measure the dissatisfaction with the result. Intuitively, the greater the difference between the output of the scoring function and the real result, the greater the output of the loss function, and the smaller the inverse. __multiclass support Vector__ machine

The specific form of loss function is varied, and the SVM is considered from the angle of Hinge Loss. The loss function of SVM wants the score of SVM on the correct classification to always be higher than the score of the incorrect classification $\delta$ a boundary value. You can think of the loss function as a person, the SVM for the result has his own taste, if a result can make the loss of a lower value, then SVM more like it.

Now recall that the first data contains the pixels of the image $x _i$ and the labels that represent the correct category $y _i$. The scoring function enters the pixel data and then calculates the score for different categories using the formula $f (X_I,W) $. Here we will abbreviate the score to vector $s $. For example, the score for category J is the first J element: $s _j = f (x_i,w) _j$. The loss function for the multi-class SVM for the I data is defined as follows:

\[l_i = Sum_{j \ne y_i} max (0,s_j–s_{y_i} + \delta) \]

**Example** : Use an example to show how the formula is calculated. Suppose there are 3 classifications and get the score s = [13,-7,11]. The first category is the correct category, that is, $y _i = 0$. Also assume that $\delta$ is 10. The above formula adds up all the incorrect classifications ($j \ne y_i$), so we get two parts:

\[l_i = max (0,-7-13+10) +max (0,11-13+10) \]

You can see that the first part of the result is 0, because [ -7-13+10] gets a negative number and gets 0 after the $max (0,-) $ function is processed. The loss value for this pair of category scores and labels is 0, because the difference between the correct classification of the score 13 and the score –7 of the error classification is 20, which is higher than the boundary value of 10. The SVM only cares about the difference of at least 10, the greater the difference is still counted as the loss value of 0. The second part of the calculation [11-13+10] gets 8. Although the correct classification of the score is higher than the incorrect classification score (13>11), but the boundary value is smaller than 10, the difference is only 2, which is why the loss value equals 8. In short, the loss function of SVM wants to classify categories correctly $y _i$ score is higher than the incorrect category score and is at least $\delta$ high. If this is not satisfied, the loss value starts to be calculated.

So in this model, we're dealing with a linear scoring function (), so we can rewrite the equation for the loss function a little bit:

This is the line J of the weight, which is transformed into a column vector. However, once you begin to consider more complex scoring function formulas, it is not necessary to do so.

Before closing this section, it is also important to mention a threshold of 0: function, which is often referred to as the loss of the **leaf (hinge loss)**. It is sometimes heard that people use the Square fold loss SVM (i.e., L2-SVM), which uses a more intense (square rather than linear) penalty for boundary values over the bounds. Not using squares is a more standard version, but in some datasets, the square-leaf loss will work better. Cross-validation can be used to determine which one to use.

we always have some dissatisfaction with predicting the classification of training set data, and the loss function can quantify the degree of dissatisfaction.

—————————————————————————————————————————

Multi-Class SVM "want" the classification score of the correct category is higher than the score of other incorrect classification categories, and at least the delta boundary value is higher. If other classification scores enter the red area, or even higher, then the loss is calculated. If these conditions are not true, the loss value is 0. Our goal is to find some weights that will allow the data samples in the training set to meet these limits, as well as to keep the total loss value as low as possible.

—————————————————————————————————————————

**regularization (regularization):** There is a problem with the above loss function. Suppose there is a dataset and a weight set **W** can correctly classify each data (i.e. all boundaries are met, for all I have). The problem is that **W** is not unique: There may be many similar **Watts** that correctly classify all of the data. A simple example: if **W** is able to classify all the data correctly, that is, for each data, the loss value is 0. At that time, any number of multiplication can make the loss value 0, because this change will all the values of the size of the equal expansion, so the absolute difference between them also widened. For example, if a correct classification of the score and an example of its recent error classification of the score gap is 15, the **W** Times 2 will make the gap into 30.

In other words, we want to be able to add some preference to some particular weight **W** , and not add to other weights to eliminate ambiguity. This is possible by adding a **regularization penalty (regularization penalty)** part to the loss function. The most commonly used regularization penalty is the L2 paradigm, in which the L2 paradigm suppresses the weights of large values by taking a squared penalty of all the parameters per element:

In the above expression, all the elements in the square are summed. Note that the regularization function is not a function of the data, and is based only on weights. With the regularization penalty, it is possible to give a complete multi-class SVM loss function, which consists of two parts: **data loss**, the average loss for all samples, and the **regularization loss (regularization loss)**. The complete formula looks like this:

Expand it to the full formula:

Which is the amount of data in the training set. Now the regularization penalty is added to the loss function, and the weights are calculated using the super-parameters. This parameter cannot be easily determined and needs to be obtained by cross-validation.

In addition to the above reasons, the introduction of regularization penalties also brings a lot of good properties, most of which are described in subsequent chapters. For example, with the introduction of the L2 penalty, SVM has a good property of the **maximum boundary (****max margin)** . (If you are interested, you can view the CS229 course).

One of the best properties is to punish a large numerical weight, which can increase its generalization ability, because it means that no dimension can have an excessive effect on the overall score alone. For example, suppose an input vector, two weights vector,. Then, the two weight vectors all get the same inner product, but the L2 penalty is 1.0, while the L2 penalty is 0.25. Therefore, according to L2 punishment, it is better, because its regularization loss is smaller. Intuitively, this is because the weight values are small and more fragmented. Since the L2 penalty tends to be smaller and more dispersed weight vectors, this encourages the classifier to eventually use all the features on all dimensions, rather than relying heavily on one of the few dimensions. As you can see in the following lessons, this effect will increase the generalization capability of the classifier and avoid *overfitting* .

It is important to note that, unlike weights, deviations do not have this effect because they do not control the intensity of influence on the input dimension. Therefore, the weights are usually only regularization, but not the regularization deviation. In practice, it can be found that the impact of this operation is negligible. Finally, because of the existence of the regularization penalty, it is not possible to get 0 of the loss value in all the examples, because only in exceptional cases can the loss value be 0.

**Code** : The following is a python implementation of a loss function with no regularization part, with two forms of non-vectorization and semi-vectorization:

In this section of the study, it is important to remember that the SVM loss has taken a special approach, making it possible to measure the consistency between the predictive classification of training data and the actual classification label. Also, the two things that make accurate classification predictions and minimize loss values for training-focused data are equivalent.

the next thing to do is to find the weight that will minimize the loss value.

Practical considerations

**Set Delta**: You may notice that the above content is a stroke of the hyper parameter and its settings, so what value should it be set to? Need cross-validation to get it? It now seems that the hyper-parameter is safe in most cases. The hyper-parameter and looks like two different hyper-parameters, but in fact they control the same tradeoff together: the tradeoff between data loss in loss functions and regularization loss. The key to understanding this is to know that the size of the weights has a direct impact on the classification score (and of course the difference has a direct effect on them): when we narrow the median, the difference between the classification scores becomes smaller, and vice versa. Therefore, the specific values of the boundaries between different classification scores (such as or) are meaningless from some point of view, because the weights themselves can control the difference to become larger and smaller. In other words, the real tradeoff is the extent to which we allow weights to be larger (controlled by regularization intensity).

**with two-yuan support vector machines (** **Binary support Vector** machine **)** : Before you take this course, you may have some experience with the two-yuan support vector machine, which calculates the loss formula for the I data:

Where, is a hyper-parameter, and. It can be considered that the SVM formula introduced in this section contains the above formula, which is a special case of a multi-class support vector machine formula with only two classification categories. That is, if we want to classify only two categories, then the formula will be converted to a two-yuan SVM formula. The same tradeoffs are controlled in the formula and in the multi-class SVM formulas, and the relationship between them is

**Note: Optimize in the initial form** . If you have studied SVM before this course, you will have heard about the Kernels,duals,smo algorithm. In this course (mainly neural network related), the optimization of the loss function is always carried out in an unrestricted initial form. Many of these loss functions are technically non-differentiable (for example, the function is not differentiable at the time), but there is no problem in the actual operation, as the secondary gradient can usually be used.

**Note: Other multi-class SVM formulas** . It should be noted that the multi-class SVM presented in this lesson is only one of many SVM formulas. Another commonly used formula is *One-vs-all*(OVA) SVM, which trains an independent two-tuple classifier for each class and other class. There is another less-used called *All-vs-all*(AVA) strategy. Our formula is based on the Weston and Watkins 1999 (PDF ) version, which is more powerful than OVA (in the case of building a multi-class dataset, this version can fetch 0 of the loss value, and the OVA does not.) If you are interested, see the details in the paper). The last formula to be known is structured SVM, which maximizes the boundary of the correctly categorized classification score and the highest score in the non-correct classification. Understanding the differences in these formulas is beyond the scope of this course. The version described in this course note can be used safely in practice, and arguably the simplest OVA strategy that seems to work equally well in practice (in Rikin et Defense's thesis in the 2004 of One-vs-all classification (PDF) (can be found in).

Reference:

https://zhuanlan.zhihu.com/p/20918580

cs231n Note (i) linear classifier