The answer to this issue, combined with a specific Hulu business case, can be said to be interesting and understood. Come on, learn!

Today's content is

**"Classification, sequencing, evaluation of regression models"**

**Scenario Description**

In the model evaluation process, classification problems, sequencing problems, regression problems often need to use different evaluation indicators for evaluation. However, in many evaluation indicators, most of the indicators can only reflect the model part of the ability, if not a reasonable comprehensive use of evaluation indicators, not only can not find the model itself problems, and may even draw the wrong conclusions. Under the background of Hulu's business, we hypothesized several models to evaluate the scene, to see if we can glimpse, find the choice of the indicator or the problem of the model itself.

**Problem description**

Limitations of accuracy (accuracy)

The tradeoff between accuracy (Precision) and recall (Recall)

RMS error (Root Mean Square Error,rmse) "Unexpected"

*Knowledge Points:*

*accuracy rate (accuracy)* *accuracy rate (Precision)*

*Recall rate (Recall) RMS error (Root Mean Square error, RMSE)*

**Solutions and Analysis**

**1. Limitations of accuracy (accuracy)**

Hulu's luxury advertisers want to target advertising to luxury users, and Hulu gets a portion of the luxury user's data through a third-party DMP (data Management Platform, a database management platform), and uses this as a classification model for training sets and test sets to train luxury users. The accuracy of the model is more than 95%, but in the actual ad delivery process, the model still put most of the ads to non-luxury users, what may be the cause?

*Difficulty: 1 stars*

Before we answer this question, let us first clarify the definition of classification accuracy (accuracy)--the exact rate is the number of samples that are correctly classified and the proportion of the total number of samples.

The *ncorrect* is the number of samples that are correctly classified,*ntotal* is the number of total samples.

Accuracy is the simplest and most intuitive evaluation index of classification problems, but there are obvious flaws in the accuracy rate, that is, when the proportion of the sample belongs to the category is very uneven, the classification of the large sample is often the most important factor to influence the accuracy rate. For example, negative samples accounted for 99%, then the classifier will be all samples to predict negative samples can also get 99% accuracy.

With this in mind, we can solve the problem. Because the luxury users are obviously only a small part of Hulu all users, the overall classification accuracy of the model is high, does not mean that the classification of luxury users high accuracy. In the online delivery process, we will only the model to determine the "luxury users" to run, so the "luxury users" to determine the accuracy of the problem is not high enough to be magnified. To solve this problem, it is more effective to use the average accuracy rate (arithmetic average of the sample accuracy under each category) to evaluate the model.

In fact, this question is a relatively open question, need the interviewer according to the phenomenon of the problem to go through the problem of a step-by-step troubleshooting. The standard answer is not limited to the selection of indicators, even if the selection of evaluation indicators, there will still be a model over-fitting or under-fitting, test set and training set division, offline evaluation and online test sample distribution differences and so on a series of problems. However, the question of the choice of evaluation indicators is the most likely to be found and the factors that will most affect the outcome of the assessment.

**2. The tradeoff between accuracy (precision) and recall (recall)**

Hulu provides a fuzzy search function for video, the accuracy of the top 5 results returned by the search sort model (precision) is very high, but in the actual use of the process, the user still often can't find the video they want to find, especially some of the less popular episodes, which may be what part of the problem?

*Difficulty: 1 stars*

To answer this question, we need to first clarify two concepts, accuracy rate (precision) and recall rate (recall).

Accuracy rate (precision): The number of positive samples classified correctly is the ratio of the number of samples that the classifier determines as positive samples.

Recall rate (recall): the proportion of the correct number of positive samples to the true number of positive samples.

In the sequencing model, the precision and recall values of the top N return results are often used to measure the performance of the sorting model because there is no accurate threshold to determine the result as a positive or negative sample. That is, we think that the result of the top n returned by the sort model is a positive sample of the model, calculated [email protected], and [email protected].

Precision and recall are contradictory unified two indicators, in order to improve the accuracy rate, it is necessary for the classifier to be "more sure when" to predict the sample as a positive sample, but at this time the classifier will often be too conservative choice to miss a lot of "not sure" positive samples, resulting in lower recall rate.

Back to the problem, the problem gives [email protected] The result is very good, that is, the ranking model top 5 of the return value of the quality is very high, but in the actual use process, the user in order to find some popular video, tend to look for the results in the back, or even page to find the target video, But according to the topic, users often can't find the video they want, which shows that the model does not have the relevant video to find out to the user, obviously, the problem is in the recall rate, if the relevant results have 100 words, even if [email protected] reached 100%,[email protected] It's just 5%. Should we look at the indicators of recall at the same time when evaluating precision? Further, should you choose a different top N to observe? Further, should the higher-order evaluation indicators be chosen to reflect more comprehensively the performance of the model in both precision and recall?

The answer is clearly yes, in order to comprehensively assess the quality of a ranking model, we not only have to look at the model under the different top n [email protected] and [email protected], and it is best to draw precision-recall curve accordingly, Here we briefly introduce the method of drawing the P-r curve.

The horizontal axis of the P-r curve is the recall rate, and the longitudinal axes are precision ratios. For a sort model, one of the points on the P-r curve represents the recall rate and the accuracy rate under a positive sample threshold (greater than the threshold model is predicted to be a positive sample, less than the threshold predicted as a negative sample). The entire p-r curve is generated by sliding positive sample thresholds from highest to lowest. B shows that the solid line represents the P-r curve of the model insts model, and the dashed lines represent the p-r curves of the INSTS2 model. The point near 0 of the horizontal axis represents the accuracy and recall of the model with the maximum positive sample threshold. It can be seen that the accuracy rate of the insts model is 0.9, while the recall rate is close to 0 o'clock, while the accuracy of the INSTS2 model is 1. It is indicated that the samples of the INSTS2 model are all real positive samples, and the insts model has the condition of predicting errors even for the highest scoring samples. As the recall rate increased, the accuracy overall decreased, at the recall rate of 1 o'clock, the accuracy of the insts model was more than the INSTS2 model, which fully illustrates that we only use a point of accuracy and recall results are not fully measurable model performance, only through the overall performance of the p-r curve, To make a more comprehensive assessment of the model.

(Pictures from Fawcett, Tom.) "An introduction-ROC analysis." Pattern Recognition letters27.8 (2006): 861-874.)

In addition, the F1 score and ROC curves are also capable of comprehensively reflecting the performance of a sequencing model, and the F1-score is the harmonic average of precision and recall, defined as follows:

ROC Curve We have done a detailed introduction in the previous article, interested students can turn over the public number of historical articles.

**3. "Accident" of square root error**

As a streaming media company, Hulu has a lot of American drama resources, predicting the flow trend of each American drama is very important for advertising and user growth. We want to build a regression model to predict traffic trends for a US drama, but no matter what regression model we get, the Rmse (root Mean square error, square root error) indicator is very high, but in fact, the model has a prediction error of less than 1% in 95% of the time interval, With fairly good predictions, what is the most likely cause of the high RMSE indicators?

*Difficulty: 1 stars*

As you know, RMSE (Root Mean square error, RMS error) is often used to measure the quality of a regression model. But according to the description of the topic, RMSE this indicator has failed. Let's take a look at how RMSE is calculated:

Where Yi is the real value of the first sample point, it is the predicted value of the first sample point, and N is the number of sample points.

In general, RMSE can well reflect the deviation of the regression model from the real value, but when there is a very large outlier in the actual problem, even the very individual points will make the RMSE indicator very poor.

Back to the problem, the prediction error of the model in 95% time interval is less than 1%, obviously the model effect is very good in most time interval. But the Rmse effect has been poor, most likely due to the existence of a very serious outlier in the remaining 5% time interval. In fact, in the actual problem of traffic estimation, noise point is really easy to produce, for the special small flow of the U.S. drama, just released the U.S. drama, or the award-winning American drama, even some related social media emergencies brought about by the traffic, will become the cause of outliers.

So what's the solution? There are three angles, the first angle is that if we assume that these outliers are "noise points", we need to filter out these noise points at the data preprocessing stage, and the second angle, if we don't think these outliers are noise points, we actually need to further improve the predictive capability of the model. Model The mechanisms that are generated from outliers. about how the flow prediction model improves this problem is a big topic, we do not start the discussion; third, we want to find a more appropriate indicator to evaluate the model, in fact, there are more robust indicators than RMSE, such as mape (Mean Absolute Percent error average absolute percent errors), defined as follows:

Compared with Rmse,mape, the error of each point is normalized, eliminating the effect of absolute error caused by individual outliers.

**Summary and extension**

In this article, we are based on three hypothetical Hulu scenarios, mainly explaining the importance of evaluating the choice of indicators. Each evaluation indicator has its value, but if the model is evaluated only from a single evaluation index, it often results in one-sided or even erroneous conclusions. Only through a set of complementary evaluation indicators to verify the results of the test, we can better identify and solve the problem of the model, so as to better solve the problem of real business scenarios.

**Next Topic Preview**

**"Feature engineering-numerical characteristics"**

**Scenario Description**

Feature engineering is an input form that combines problems to find effective features and process them into suitable models. In machine learning there is a classic word called "garbage in, garbage out", meaning that if the input data is garbage, then the results are garbage. It can be seen that the key to the success of the model depends not only on the selection of the model, but also on whether we have found a valid input based on a particular problem. The commonly used data can be divided into structured data and unstructured data, in which: structured data can be regarded as a table of relational database, each column has a clear definition, contains two basic types of numerical, category, and unstructured data mainly includes text data and image data, all the information is mixed together, There is no clear definition of the category, and the size of each piece of data is different.

**Problem description**

1. Why do I need to normalized the characteristics of a numeric type?

2. What types of features should be handled?

3. How to deal with high-dimensional combination features?

4. How to find the combination feature effectively?

Hulu machine learning questions and Answers series | 21: Classification, sequencing, evaluation of regression models