All algorithms in machine learning rely on minimizing or maximizing a function, which we call "objective functions". The minimized set of functions is called the "loss function". The loss function is a measure of predicting the expected result performance of a predictive model. The most common way to find the minimum value of a function is "gradient descent". Think of the loss function as an undulating mountain range, where the gradient drops like a slide from the top of the mountain to reach the lowest point of the mountain.
No loss function can be applied to all types of data. The choice of loss function depends on many factors, including whether there are outliers, the choice of machine learning algorithms, the time efficiency of running gradient descent, the ease of finding the derivative of the function, and the confidence of the predicted result. The purpose of this blog is to help you understand the different loss functions.
Loss functions can be broadly divided into two categories: categorical loss (classification Loss) and regression loss (Regression Loss). In this blog post, we will focus on 5 kinds of regression losses.
Regression function predicts real value, classification function prediction label
▌ return loss
1, mean square error, two losses, L2 loss (Mean Square error, quadratic Loss, L2 Loss)
The mean square error (MSE) is the most commonly used regression loss function. The MSE is the sum of the squared distance between the target variable and the predicted value.
The following is a graph of the MSE function, where the true target value is 100 and the predicted value is between 10,000 and 10,000. When the predicted value (x-axis) = 100, the MSE loss (Y-axis) reaches its minimum value. The loss range is 0 to ∞.
MSE loss (Y-axis) vs. predicted (x-axis) graphs
2, average absolute error, L1 loss (Mean Absolute error, L1 Loss)
The mean absolute error (MAE) is another loss function for regression models. Mae is the sum of the absolute values of the difference between a target variable and a predictor variable. Therefore, it measures the average size of the error in a set of predictions, regardless of the direction of the error. (If we also consider the direction, that would be called the mean deviation (Mean Bias error, MBE), which is the sum of residuals or errors). The loss range is also 0 to ∞.
Mae loss (y-axis) vs. predicted (x-axis) graphs
3. MSE vs MAE (L2 loss vs L1 loss)
In short, using squared errors is easier to solve, but using absolute errors is more robust for outliers. However, know it more to know the reason why!
Whenever we train a machine to learn a model, our goal is to find the point that minimizes the loss function. Of course, when the predicted value is exactly equal to the real value, both of these loss functions achieve the minimum value.
Let's quickly go through the Python code of the two loss function. We can write our own functions or use the built-in metric functions of Sklearn:
#true: True array of target variables
#pred: Predicting arrays
**
def MSE (True, pred):
Return Np.sum (((true–pred) **2))
**
Def Mae (true, pred):
Return Np.sum (Np.abs (true–pred))
**
# can also be used in Sklearn
**
From Sklearn.metrics import Mean_squared_error
From Sklearn.metrics import Mean_absolute_error
Let's take a look at the Mae and RMSE values of the two examples (rmse,root Mean square error, RMS error, which is just the square root of the MSE, making it the same as the value range of Mae). In the first example, the predicted value is close to the true value, and the variance between observations is small. In the second example, there is an abnormal observation value, which has a high error.
Left: The error is close to the right: there is a difference between the error and other errors very far
What do we see from them? How do we choose which loss function to use?
Because the MSE operates squared (y-y_predicted = e) for error (e), the value of the error increases a lot if e> 1. If we have a outliers in our data, the value of E will be very high and will be much larger than |e|. This will result in a model with a loss of MSE that gives higher weights to outliers than the Mae-lost model. In the second example above, a model with a loss of RMSE will be adjusted to minimize the outlier, but at the expense of other normal data points, which will ultimately degrade the overall performance of the model.
The Mae loss is applied when the training data is damaged by outliers (i.e., in the training data rather than in the test data, we mistakenly get unrealistic or negative values).
Intuitively, we can consider this: for all observational data, if we give only one prediction to minimize the MSE, then the predicted value should be the mean of all target values. But if we try to minimize Mae, then this prediction is the median of all target values. We know that the median is more robust than the average for outliers, which makes Mae more robust than MSE.
A big problem with the use of Mae losses (especially for neural networks) is that its gradients are always the same, which means that even for small loss values, their gradients are large. The study of the model is not good. To solve this problem, we can use the dynamic learning rate that decreases with the near-minimum value. The MSE behaves well in this case, even with a fixed learning rate. The gradient of the MSE loss is larger when the loss value is higher, and decreases as the loss approaches 0, making it more accurate at the end of the training (see).
What kind of loss function do you decide to use?
If outliers are outliers that can affect the business and should be detected, then we should use MSE. On the other hand, if we think that outliers represent only data corruption, then we should choose Mae as a loss.
I recommend reading the following article, which has a good study comparing the performance of regression models that use L1 loss and L2 loss in the presence and absence of outliers. Keep in mind that the L1 and L2 losses are another name for Mae and MSE, respectively.
Address:
http://rishy.github.io/ml/2015/07/28/l1-vs-l2-loss/
L1 losses are more robust to outliers, but their derivatives are not continuous, so the solution is inefficient. L2 loss is sensitive to outliers, but a more stable closed solution (closed form solution) is given (by setting its derivative to 0)
[Email protected] ~]# RPM-QL Sysbench | grep ' Bin\|lua '/usr/bin/sysbench/usr/share/sysbench/bulk_insert.lua/usr/share/sysbench/oltp_common.lua/usr/ Share/sysbench/oltp_delete.lua/usr/share/sysbench/oltp_insert.lua/usr/share/sysbench/oltp_point_select.lua/usr /share/sysbench/oltp_read_only.lua/usr/share/sysbench/oltp_read_ www.yingka178.com write.lua/usr/share/sysbench/ Oltp_www.tiaotiaoylzc.com UPDATE_INDEX.LUA/USR/SHARE/SYSBENCH/OLTP_UPDATE_NON_INDEX.LUA/USR/SHARE/SYSBENCH/OLTP _write_www.taohuayuan178.com only.lua/usr/share/sysbench/select_random_points.lua/usr/share/sysbench/select_ random_ranges.lua/usr/share/sysbench/tests/include/inspect.lua/usr/share/sysbench/tests/include/oltp_legacy/ Bulk_insert.lua/usr/share/sysbench/tests/include/oltp_legacy/common.lua/usr/share/sysbench/tests/include/oltp_ legacy/delete.lua/usr/share/sysbench/tests/include/oltp_legacy/insert.lua/usr/share/sysbench/tests/include/oltp_legacy/oltp.lua/usr/share/sysbench/tests/include/oltp_legacy/oltp_simple.lua /usr/share/sysbench/tests/include/oltp_legacy/parallel_ Prepare.lua/usr/share/sysbench/www.yingka178.comtests/ Include/oltp_legacy/select.lua/usr/share/sysbench/tests/ include/oltp_legacy/select_random_points.lua/usr/share/ Sysbench/tests/include/oltp_www.078881.cn Legacy/select_random_ranges.lua/usr/share/sysbench/tests/include/oltp_legacy/update_index.lua/usr/ share/sysbench/tests/include/oltp_legacy/update_non_ Index.lua
Two kinds of loss function problems: it is possible that any loss function can not give the ideal prediction. For example, if the true target value of 90% of the observed data in our data is 150, the remaining 10% of the true target value is between 0-30. A Mae-based model, then, could predict all observational data to be 150, ignoring 10% of outliers, as it tries to approach the median. Similarly, the MSE-loss model gives many predictions ranging from 0 to 30, because it is confused by outliers. Both of these results are undesirable in many businesses.
What do you do in this situation? A simple solution is to convert the target variable. Another way is to try different loss functions. This is our third loss function--huber loss--is presented as motive.
3, Huber Loss, the average absolute error of smoothing
Huber loss's sensitivity to data outliers is lower than the square error loss. It can also be guided in 0 places. Basically it is absolute error, when the error is very small, the error is two times the form. When the error needs to be two times the form depends on a super parameter (delta), which can be fine-tuned. When
Database performance test: Sysbench usage