The process of building a machine learning algorithm:
- Quickly build a simple algorithm and test the performance of the algorithm with a cross-validation set.
- Draw the learning curve, check whether the algorithm has high variance or high deviation problem, so as to choose corresponding coping methods.
- Error analysis, to see the examples of errors in the algorithm, to analyze whether these instances have some systematic trend.
Evaluate algorithm performance
Skew class (skewed classes): Most instances of a training set belong to one class and others are of little or no account.
In the case of class skew, we cannot simply use the error to judge the effect of the algorithm. The new evaluation measure should be used.
TP: Prediction is true, reality is true
FP: Prediction is true, actual is false
FN: Prediction is False, actual is true
TN: Prediction is False, actual is false
Precision ratio: TP/(TP + FP) Precision, the higher the better
Recall: TP/(TP + FN) Recall, the higher the better
Form PR curve : Quasi-not, full (right convex, higher recall, lower precision, adjustment threshold)
How to Automatically select thresholds: Calculates the F1 value , F1 score = 2PR/(P + R), whichever threshold corresponds to the highest value.
TPR:TP/(TP + FN)
FPR:FP/(TN + FP)
Formation of ROC curve : Sensitivity, specificity (left convex)
Data issues
Getting a lot of data in many cases is a good way to get a high-performance learning algorithm, but don't blindly collect large amounts of data.
A better way: we have a lot of data (low variance, avoid overfitting), and we train a learning algorithm with many parameters (low deviation), so many times we can train a high-performance algorithm.
Machine learning System Design----Learning system