Loss functions (loss function) are used to measure the degree of inconsistency between the predicted value F (x) and the true value Y of your model. It is a nonnegative real-valued function, usually using L (Y, F (x)) to indicate that the smaller the loss function, the better the robustness of the model. Loss function is the core part of experiential risk function, and it is also an important part of structural risk function. The structural risk functions of the model include experiential risk items and regular items, which can usually be expressed as follows:
In which, the preceding mean function represents the experiential risk function, L represents the loss function, and the following φφ is a regularization term (regularizer) or a penalty term (penalty term), which can be L1, or it can be L2, or other regular functions. The whole expression means to find the θθ value that minimizes the target function. Several common loss functions are listed below. Log logarithm loss function (logistic regression)
Some people may think that the loss function of logistic regression is the square loss, but it is not. The square loss function can be deduced by linear regression under the condition that the Gaussian distribution of the sample is assumed, and the logical regression is not the square loss. In the derivation of logistic regression, it assumes that the sample obeys Bernoulli distribution (0-1 distribution), then the likelihood function satisfying the distribution is obtained, then the logarithm is obtained and the extremum is taken. But the logical regression does not seek the extreme value of the likelihood function, instead of taking the maxima as a thought, it derives its empirical risk function as: minimizing negative likelihood functions (i.e. Max F (Y, f (x))--> min-f (Y, f (x)). From the point of view of the loss function, it becomes the log loss function.
Log loss function in standard form:
L (Y,p (y| X)) =−logp (y| X) L (Y,p (y| X)) =−log P (y| X
Just said, the logarithm is to facilitate the calculation of maximum likelihood estimates, because in MLE, direct derivation is more difficult, so it is usually first to take the logarithm and then derivative to find the extremum point. Loss function L (Y, P (y| x) expresses the case where the sample x is in the category Y, so the probability P (y| X) to the maximum (in other words, by using a known sample distribution, to find the parameter value most likely (i.e. the maximum probability) to cause the distribution, or what kind of parameters would enable us to observe the most probability of the current set of data. Because the log function is monotonically incremented, so Logp (y| X) will also reach the maximum value, so after adding the minus sign in front, maximize p (y| X) is equivalent to minimizing L.
The P (y=y|x) expression of the logical regression is as follows:
P (y=y|x) =11+exp (−YF (x)) p (y=y|x) =11+exp (−YF (x))
To bring it into the upper formula, a logistic loss function expression can be obtained by derivation, as follows:
L (y,p (y=y|x)) =log (1+exp (−YF (x))) L (y,p (y=y|x)) =log (1+exp (−YF (x)))
The final objective of the logical regression is as follows:
If it's two, then M is equal to 2, and if it's multiple categories, M is the total number of categories. Here we need to explain: some people think that the logical regression is the square loss, because in the use of gradient descent to find the optimal solution, its iterative formula and the square loss of the derivation of the equation is very similar, thus giving a visual illusion.
Here is a PDF to refer to: Lecture 6:logistic regression.pdf. Quadratic loss function (least squares, ordinary least squares)
The least squares method is a kind of linear regression, ols the problem into a convex optimization problem. In linear regression, it assumes that both the sample and the noise are subject to the Gaussian distribution (why are they assumed to be Gaussian distributions). In fact, here hides a small knowledge point, is the center limit theorem, may refer to "The Central limit Theorem"), finally through the maximum likelihood estimate (MLE) can deduce the least squares equation. The basic principle of least squares is that the optimal fitting line should be the distance from each point to the regression line and the smallest line, that is, the sum of squares and the smallest. In other words, the OLS is based on distance, and this is the distance we use the most Euclidean distance. Why does it choose to use the Euclidean distance as an error metric (ie mean squared error, MSE), mainly for the following reasons: simple, easy to compute; Euclidean distance is a good similarity metric, and the characteristic property is unchanged after the transformation of different representation domains.
The standard form of squared loss (square loss) is as follows:
L (Y,f (x)) = (y−f (x)) 2L (Y,f (x)) = (y−f (x)) 2
When the number of samples is n, the loss function at this time becomes:
Y-f (X) denotes residuals, the whole expression is the sum of squares of residuals, and our aim is to minimize the value of the objective function (note: The formula is not added to the regular term), that is to say, the sum of the squares of the residuals (residual sums of squares,rss).
In practical applications, mean variance (MSE) is usually used as a measure, and the formula is as follows:
MSE=1N∑I=1N (Yi~−yi) 2mse=1n∑i=1n (yi~−yi) 2
The above mentioned linear regression, here is an extra sentence, we usually say that there are two types of linearity, one is because the variable y is the linear function of the independent variable x, one is because the variable y is the linear function of the parameter Shan. In machine learning, it usually refers to the latter.
Third, exponential loss function (Adaboost)
People who have studied the AdaBoost algorithm know that it is a special case of the forward step-by algorithm, is a additive and model, the loss function is exponential function. In ADABOOST, FM (x) FM (x) can be obtained after m this iteration:
AdaBoost The purpose of each iteration is to find the parameters for minimizing the following equations Shan and G:
The standard form of the exponential loss function (Exp-loss) is as follows
It can be seen that the target expression of adaboost is exponential loss, and in the case of a given n sample, the AdaBoost loss function is:
On the derivation of adaboost, we can refer to Wikipedia:adaboost or the P145 of statistical learning methods. Iv. Hinge Loss function (SVM)
In the machine learning algorithm, the hinge loss function is closely related to SVM. In linear support vector machines, the optimization problem can be equivalent to the following equation:
Here's a variant of the equation:
So the original form becomes:
If you take λ=12cλ=12c, the equation can be expressed as:
As you can see, the formula is very similar to the following:
ll in the first half is the hinge loss function, and the following is equivalent to the L2 regular item.
Standard form of Hinge loss function
L (y) =max (0,1−yy~), y=±1l (y) =max (0,1−yy~), y=±1
It can be seen that when |y|>=1, L (y) = 0.
For more information, refer to Hinge-loss.
To add: In LIBSVM, a total of 4 kernel functions can be selected, corresponding to-t parameters are: 0-linear core, 1-polynomial nucleus, 2-rbf nucleus, 3-sigmoid nucleus. V. Other loss functions
In addition to these several loss functions, commonly used are:
0-1 loss function
Absolute value loss function
Look at the visual image of several loss functions, look at the horizontal axis, see the ordinate, and then see what each line represents the loss of function, see more than a few good digestion.
OK, write here for a while and rest. Finally, it should be remembered that the more parameters, the more complex the model, and the more complex the model is easier to fit. Cross-fitting means that the effect of the model on the training data is much better than the performance on the test set. At this time, the regularization can be considered, by setting the hyper parameter before the regular term to weigh the loss function and regular term, reduce the scale of the parameter, achieve the goal of simplifying the model, and make the model have better generalization ability.