7 machine learning System Design
Content
7 Machine Learning System Design
7.1 Prioritizing
7.2 Error Analysis
7.3 Error Metrics for skewed classed
7.3.1 Precision/recall
7.3.2 Trading off precision and RECALL:F1 score
7.4 Data for machine learning
7.1 Prioritizing
When we set out to design a machine learning system for a practical problem, in what ways should we spend more time making the system less error? To build a junk e-mail classifier (a spam classifier) For example, we can consider the following aspects:
- Collect lots of data
- Develop sophisticated features based on mail routing information (from email header).
- Develop sophisticated features for message body, e.g. should "discount" and "discounts" being treated as the same word? How about "deal" and "Dealer"? Features about punctuation?
- Develop sophisticated algorithm to detect misspellings (e.g. M0rtgage, Med1cine, w4tches.)
It's hard to say which of the above is most effective, and each one often takes a lot of time to investigate.
7.2 Error Analysis
The recommended Practices for solving machine learning problems are:
- Start with a simple algorithm the can implement quickly. Implement it and test it on your cross-validation data.
- Plot Learning curves to decide if more data, more features, etc. is likely to help.
- The Error analysis:manually examine the examples (in cross validation set) is your algorithm made errors on. See if you spot any systematic trend in what type of examples it is making errors on.
It is very important to change the error to a single value, otherwise it is difficult to judge the performance of the learning algorithm that we are designing.
7.3 Error Metrics for skewed classed
Sometimes it is hard to say whether the reduction of errors really improves the algorithm. Take the cancer classification as an example:
We trained a logistic regression model to predict whether the patient had cancer (y = 1 if cancer, y = 0 otherwise), and we measured the error on the test set at 1% (that is, the correct diagnostic rate of 99%). However, in fact, the probability of cancer is only 0.5%, that is, if we completely ignore the characteristic amount, directly to all y = 0, then the error rate of the model is only 0.5%. The error of our hard-to-get model is greater than that of direct y=0. It's so annoying! But think about it, just make y = 0 really better than the model we trained? Suppose we need to predict a patient who is in fact already suffering from cancer (y=1), then the former is completely impossible to predict correctly and the latter has the chance to predict it correctly. From this point of view, the model we have trained seems to be better. This also shows that the algorithm 1 is less than the error of the algorithm 2, not necessarily the algorithm 1 is good.
7.3.1 Precision/recall
The above-mentioned situation usually occurs in the case of skewed classes, that is, one type of data is far more than that of another. In this case, we need to take another way to measure the performance of a learning algorithm.
We define the accuracy rate (Precision) and recall rate (Recall) as shown below, and they measure the performance of the algorithm from two angles, respectively.
So, when y = 0 o'clock, Recall = 0/(0 + #False neg) = 0, even though it has a small error, its recall rate is too low.
Note that if an algorithm predicts that all cases are negative, then precision is undefined, because at this point
#predicted Positive = 0, except 0 is meaningless.
7.3.2 Trading off precision and RECALL:F1 score
In section 7.2 We mentioned that it is very important to turn the error into a single value, because we can easily compare the pros and cons of different algorithms. Now that we have precision and recall two metrics, we need to weigh both. If a logistic regression model is used to predict whether a patient is suffering from cancer, consider the following:
Scenario 1: Assuming that a normal person is wrongly diagnosed with cancer, it will be subjected to unnecessary psychological and physical stress, so we have to be very confident in predicting a patient's cancer (y=1). So one way is to raise the threshold (threshold), let us raise the threshold to 0.7, namely:
Predict 1 if: hθ(x) ≥0.7
Predict 0 if: hθ(x) <0.7
In this case, according to the definition of section 7.3.1, we will have a higher precision, but the recall will become lower.
Scenario 2: Assuming that a patient who has cancer has been wrongly diagnosed as having no cancer, the patient may lose valuable life because it cannot be treated in time, so one way we want to avoid missing cancer patients is to lower the threshold, assuming that it is reduced to 0.3,
Predict 1 if: hθ(x) ≥0.3
Predict 0 if: hθ(x) <0.3
In this case, a higher recall will be obtained, but the precision will fall.
Situation 1 and situation 2 seem to be contradictory, in fact, precision and recall are often the following relationships, and high thresholds correspond to high precision and low recall; low thresholds correspond to low precision and high recall.
In this way, we have to make a trade-off between precision and recall. Consider the following example:
It can be seen that algorithm 3 is a constant prediction y=1 (because of its recall = 1).
In order to convert the two measurements of precision (p) and recall (R) to a single way, a simple consideration is to take the average of both: (P + R)/2; But this approach does not seem ideal, because if so, the algorithm 3 will be considered optimal, and algorithm 3 is a constant predictor y =1,precision is very low.
In fact, a better way to define F1 based on the harmonic mean of precision and recall is score as follows:
1/F1 = (1/p + 1/r)/2
Get F1 = 2PR/(P + R)
To make the F1 larger, you need p and r at the same time large, special, there are:
- P = 0,r = 1, then F1 =0
- P = 1,r = 0, then F1 = 0
- P = 1,r = 1, then F1 = 1
Note: We should test the F1 value on the cross-check set and avoid relying on the test set.
7.4 Data for machine learning
How much data is enough for us to train the learning algorithm?
Usually, an "inferior" algorithm (inferior algorithm), if given enough data to learn, is often better than a "classy" algorithm (superior algorithm) that lacks data learning. So there is a consensus in the machine learning community that:
"It's not a who had the best algorithm that wins. It 's Who's have the most data.
(I think this is the charm of BIG DATA .)
It is important to note that in order to make full use of the data, we should select the feature quantity that contains enough information. Usually a judgment reference is given to the input x, whether a human expert can confidently predict the Y.
The basic principle of a large amount of data is that you can use a learning algorithm with a large number of parameters (e.g. logistic regression/linear regression with many features; neural network with Many hidden units) to ensure that the deviation is small, and then use a large number of training sets to greatly reduce the overfitting (less variance), so as to train an error on the test set, a strong generalization ability model.
Stanford Machine Learning Note-7. Machine learning System Design