COMMON Pitfalls in machine learningJanuary 6, DN 3 COMMENTS
Over the past few years I has worked on numerous different machine learning problems. Along the the I have fallen foul of many sometimes subtle and sometimes is subtle pitfalls when building models. Falling into these pitfalls would often mean when you think you had a great model, actually in Real-life it performs Terri Bly. If your aim is the business decisions was being made based on your models, you want them to be right!
I hope to convey these pitfalls to your and offer advice on avoiding and detecting them. This was by no means a exhaustive list and I would welcome comments on other common pitfalls I had not covered.
In particular, I has not tried to cover pitfalls the might come across when trying to build production machine learning s Ystems. This article was focused more on the Prototyping/model building stage.
Finally, while some of the based on my own experience, I had also drawn upon John Langford ' s Clever Methods of Overfi Tting, Ben Hamner's talk on machine learning Gremlins and Kaufman et al. Leakage in Data Mining. All is worth looking at.
Traditional Overfitting
Traditional overfitting is where you fit a overly complicated model to the trained dataset. For example, by allowing it to has too many free parameters compared to the number of training points.
To detect the should always is using a test set (preferably you is using cross validation). Plot your train and test performance and compare the. You should expect to see a graph like Figure 1. As your model complexity increases, your train error goes to zero. However, your test error follows an elbow shape, it improves-a point then gets worse again.
Figure 1. Train/test performance when varying number of parameters in the model
Potential ways to avoid traditional overfitting:
- Obtain more training data
- Use a simpler Predictor function
- Add some form of regularisation, early stopping, pruning, dropout, etc to the model
- Integrate over many predictors
More training data
In general, the most training data you have the better. But it is expensive to obtain.
Before going down this route it's worth seeing if more train data would actually help. The usual-to-do-is-to-plot a learning curve, how the training sample size affects your error:
Figure 2. Example Learning curves
In the left-hand graph, the gradient of the "line" at we maximum training size is still very steep. Clearly here more training data would help.
In the right-hand graph, we had started to reach a plateau, and more training data was not going to help too much.
Simpler Predictor Function
Ways could use a simpler predictor:
- Use a more restricted model e.g. logistic regression instead of a neural network.
- Use your Test/train graph, figure 1, to find a appropriate level of model complexity.
Regularisation
Many techniques has been developed to penalise models that is overly complicated (Lasso, Ridge, dropout, etc.). Usually this involve setting some form of hyper-parameter. One danger here's, tune the Hyper-parameter to fit the test data, which we'll discuss in parameter Tweak Overfi Tting.
Integrate over many predictors
In Bayesian inference, to predict a new data point :
Where is some hyper-parameters, was our training data and is the model parameters. Essentially we integrate out the parameters, weighting each one by how likely they is given the data.
Parameter Tweak Overfitting
This was probably the type of overfitting I see most commonly. Say You use cross validation to produce the plot in Figure 1. Based on the plot you decide the 5 parameters are optimal and you state your generalisation error to be 40%.
However, you had essentially tuned your parameters to the test data set. Even if you use cross-validation, there are some level of tuning happening. This means your true generalisation error was not 40%.
Chapter 7 of the Elements of statistical learning discuss this on more detail. To get a reliable estimate of generalisation error, you need to put the parameter selection, model building inside, etc. I n an inner loop. Then run a outer loop of cross validation to estimate the generalisation error:
12345678910111213141516171819 |
//Outer cross Validation Loopfor (i in 1:k) { Train = X[folds! = i,] Test = X[folds = = i,] //Inner cross Validation Loop For (j in 1:k) { innertrain = Train[folds! = J,] /c25> innertest = Train[folds = = J,] //Train Model //Try multiple parameter settings //Predict on Innertest } //Choose best Parameters //Train model using best parameters from inner loop //test performance on test } |
For instance, in your inner loop, we may fit all models, each using 5X CV. From this inner loop, you are pick the best model. In the outer loop, your run 5X CV, using optimal model from the inner loop and the test data to estimate your Generalisatio N error.
Choice of Measure
You should use whatever measure are canonical to your problem or makes the most business sense (accuracy, AUC, GINI, F1, et C.). For instance, in credits default prediction, GINI is the widely accepted measurement.
It is good to practice to measure multiple performance statistics when validating your model, but you need to focus on one for measuring improvement.
One pitfall I See all the time is using accuracy for very imbalanced problems. Telling me, achieved an accuracy of 95%, when the prior is 95% means so you have achieved random performance. You must use a measure that suits your problem.
Resampling Bias and Variance
When measuring your test performance you'll likely use some form of resampling to create multiple test sets:
- K-fold Cross Validation
- Repeated k-fold cross validation
- Leave One out Cross Validation (LOOCV)
- Leave Group out Cross Validation (LGOCV)
- Bootstrap
Ideally we want to use a method, that achieves low bias, and low variance. By that I mean:
- Low bias-the generalisation error produced was close to the true generalisation error.
- Low variance (high precision)-the spread of the re-sampled test results is small
In general, the more folds your use of the lower the bias, but the higher the variance.
You'll need to pick a resampling method to suit your problem. With large training sets, K-fold cross validation with a K of 5 May suit. With small training sets, you'll need a larger k.
Max Kuhn have done a fantastic empirical analysis of the different methods in relation to bias and variance (Part I, Part I I). In general repeated 10-fold cross validation seems to being quite stable across most problems in terms of bias and Varian Ce.
Bad Statistics
Once you has computed the results of your cross validation and you'll have a series of measurement of your test Performanc E. The usual thing to does here are to take the mean and report this as your performance.
However just reporting the mean may hide the fact there are significant variation between the measurements.
It is useful to plot the distribution of the performance estimates. This would give you and idea of what much variation there is:if you need to boil it down to a single number, computing the Standard deviation (or variance) can is useful. However often see performance quoted like:80% +/-2% where the 80% is the mean and the 2% are the standard deviation. Personally I dislike this's reporting as it suggests the true performance is between 78%-82%. I would quote them separately; Mean and standard deviation.
If you want to write 80% +/-2, you would need to compute bounds on the estimate.
Information leakage
Information leakage occurs when data in your training label leaks into your features. Potentially it is even more subtle, where irrelvant features appear as highly predictive, just because we have some sort Of bias in the data you collected for training.
As an example, imagine is a eCommerce website and want to predict converters from the the-the-the-the-the-the-same people-visit your. You build features based on the raw URLs the users visit, but take special care to remove the URLs of the conversion page (e.g. complete Purchase). You split your users into converters (those reaching the conversion page) and non-converters. However, there'll be URLs immediately before the conversion page (checkout page, etc.) that'll be present in all the C Onverters and almost none of the non-converters. Your model would end up putting a extremely high weight on these features and running Your Cross-validation would give Your Model a very high accuracy. What needed to is done here is remove any URLs that always occur immediately before the conversion page.
Feature Selection Leakage
Another example I see regularly was applying a feature selection method that looks at the data label (Mutual information fo R instance) on all of the dataset. Once you select your features, you build the model and use Cross-validation to measure your performance. However your feature selection have already looked at all the data and selected the best features. The Your choice of features leaks information about the data label. Instead you should has performed Inner-loop cross validation discussed previously.
Detecion
Often it can difficult to spot these sorts of information leakage without domain knowledge of the the the problem (e.g. eCo Mmerce, medicine, etc.).
The best advice I can suggest to avoid-is-to-look on the top features that was selected by your model (say Top 20, 50 ). Do they make some sort of intuitive sense? If not, potentially your need to look into them further to identify if their are some information leakage occurring.
Label randomisation
A Nice method to help with the feature selection leakage are to completely randomly shuffle your training labels right at t He start of your data processing pipeline.
Once you get to cross validation, if your model says it has some sort of signal (a AUC > 0.5 for instance) and you have P Robably leaked information somewhere along the line.
Human-loop overfitting
This is a bit of a subtle one. Essentially when picking "what parameters" to "use," what features to include, it should all is done by the your program. You should is able to run it end-to-end to perform all the modelling and performance estimates.
It is OK-to-be-a bit more "hands-on" initially when exploring different ideas. However your final "production" model should remove the human element as much as possible. You shouldn ' t be hard-coding parameter settings.
I have also seen this occurring when particular examples is hard-to-predict and the modeller decides to just exclude thes E. Obviously this should is done.
Non-stationary Distributions
Does your training data contain all the possible cases? Or could it be biased? Could The potential labels?
Say For example your build a handwritten digit classifier trained on the the MNIST database. You can classify between the numbers 0-9. Then someone gives your handwritten digits in Thai:
How would would your classifier behave? Possibly need to obtain handwritten digits for other languages, or any of the category that could incorporate non W Estern Arabic numerals.
Sampling
Potentially, your training data may has gone through some sort of sampling procedure before you were provided it. One significant danger here is the this was sampling with replacement and you end up with repeated data points in both th E train and test set. This would cause your performance to be over-estimated.
If you had a unique ID for each row, check these is not repeated. In the general check how many rows are repeated, would you be expect some, but was it more frequent than expected?
Summary
My one key bit of advice when building machine learning models is:
If it seems to good to BES true, it probably is.
I can ' t stress this enough, to is a really good at building machine learning models we should be naturally sceptical. If your AUC suddenly increases by the points or you accuracy becomes 100%. You should the stop and really look at and do. Has you fallen trap of one of the pitfalls described?
When building your models, it's always a good idea to try the following:
- Plot learning curves to see if you need more data.
- Use Test/train error graphs to see if your model is too complicated.
- Ensure you is running Inner-loop cross validation If you is tweaking parameters.
- Use the repeated K-fold cross validation if possible.
- Check your top features-do they make sense?
- Perform the random label test.
- Check for repeated unique IDs or repeated rows.
"Reprint" COMMON Pitfalls in machine learning