Transfer from: The introduction of http://blog.csdn.net/ybdesire/article/details/73695163 problem
With Sklearn, when calculating loglosss, the multiple-class problem is computed with such code (as follows), and an error is made. Where Y_true is the real value, y_pred is the predictive value
Y_true = [0,1,3]
y_pred = [1,2,1]
Log_loss (y_true, y_pred)
valueerror:y_true and y_pred contain different Mber of Classes 3, 2. Please provide the true labels explicitly through the labels argument. Classes found in y_true: [0 1 3]
1
2
3
4
5
What the hell is going on here?
This problem arises because you do not understand the computational process of Logloss. In the Logloss calculation process, the output must be required to be expressed in one-hot. This problem can be fixed by changing the solution of this onehotencoder problem to the following.
From sklearn.metrics import Log_loss from
sklearn.preprocessing import onehotencoder
one_hot = Onehotencoder (n _values=4, sparse=false)
y_true = One_hot.fit_transform ([0,1,3])
y_pred = One_hot.fit_transform ([1,2,1])
Log_loss (y_true, y_pred)
1
2
3
4
5
6 7 8
So, what is the exact calculation process of Logloss? Explained in detail below. Logloss Calculation Detailed
First, let's look at the Logloss formula:
logloss=& #x2212;1n& #x2211;i=1n& #x2211; J=1myi,jlog (pi,j) "role=" presentation "style=" Text-align: Center position:relative; " >logloss=−1n∑i=1n∑j=1myi,jlog (pi,j) logloss=−1n∑i=1n∑j=1myi,jlog (PI,J)
The meanings of each letter in this formula are: N: sample number M: number of categories, such as the above multiple-class example, M is 4 Yij: The I sample belongs to category J when it is 1, otherwise 0 Pij: The probability of the first sample being predicted as Class J
We use the following set of data to illustrate the computational process: Y_true = [0,1,3] y_pred = [1,2,1]
solving Logloss
First, we know that n=3 (3 samples), M=4 (category number 4 (0,1,2,3)).
So, Y and P are all 3x4 matrices:
However, if the P-matrix is made Log,log (0) is infinitely large. Sklearn solves this problem by converting 0 in P to 1e-15 (1 to 15).
p = Array ([[ 1.00000000e-15, 1.00000000e+00, 1.00000000e-15,
1.00000000e-15, 1.00000000e-15, 1.00000000e-15,
1.00000000e+00, 1.00000000e-15, 1.00000000e-15,
1.00000000e+00, 1.00000000e-15, 1.00000000e-15]])
1
2
3
4
And, after debugging (debug Sklearn Source Method reference this article), also found that Sklearn will logloss calculation formula made a little change, as shown below, the 1/n moved to the p.
logloss=& #x2212;& #x2211;i=1n& #x2211; J=1myi,jlog (1npi,j) "role=" presentation "style=" Text-align: Center position:relative; " >logloss=−∑i=1n∑j=1myi,jlog (1npi,j) logloss=−∑i=1n∑j=1myi,jlog (1NPI,J)
This change corresponds to the source code is
y_pred/= y_pred.sum (Axis=1) [:, Np.newaxis]
1
So, these two matrices are converted to:
# above P divided by 3 is this p
p=array ([[ 3.33333333e-16, 3.33333333e-01, 3.33333333e-16,
3.33333333e-16, 3.33333333e-16, 3.33333333e-16,
3.33333333e-01, 3.33333333e-16, 3.33333333e-16,
3.33333333e-01, 3.33333333e-16, 3.33333333e-16]]
y = Array ([[1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1]])
1< C17/>2
3
4
5
6
7
8 9
Get Y and P, and use the dot multiplication function below to calculate the value of the Logloss.
Loss =-(Y * Np.log (P)). SUM (Axis=1)
1
The final logloss of the
is: 106.91216605. summary To facilitate calculations, the Sklearn converts the number 0 to the Logloss calculation in 1e-15 Sklearn, a little bit different from the traditional Logloss formula reference Sklearn's Log_loss Source: https://github.com/scikit-learn/scikit-learn/blob/14031f6/sklearn/metrics/classification.py #L1544 How to debug a Python third-party library dynamically: http://blog.csdn.net/ybdesire/article/details/54649211