Error models: Over-fitting, cross-validation, deviation-variance tradeoff
Author Natasha Latysheva; Charles Ravarani
Posted in cambridgecoding
Introduced
?? In this article, you may have mastered the core concepts of machine learning: deviation-Variance tradeoffs . The main idea is that you want to create models that are as predictable and still applicable to new data as possible (this is generalization). Dangerously, you can easily create overfitting in your data local noise models, such models are useless and lead to weak generalization capabilities, because noise is random and therefore different in each dataset. Essentially, you want to create a model that captures only the useful components of the dataset. On the other hand, Generalization is good but the model that produces good predictions too rigid is another extreme (this is called under-fit ).
?? We use K-nearest neighbor algorithm to discuss and demonstrate these concepts, K-nearest neighbor with a simple parameter k, you can use different parameters to clearly show the idea of under-fitting, overfitting and generalization ability. At the same time, the related concepts between equilibrium under-fitting and over-fitting are called deviation-variance tradeoffs . Here is a table summarizing some of the different but identical models in either overfitting or under-fitting
?? We will explain the meaning of these terms and how they relate. Cross-validation is also discussed, which is an excellent indicator for evaluating model accuracy and generalization.
?? You will encounter these concepts in all future posts, including model optimization, random forest, naive Bayesian, logistic regression, and how to combine different models into a set Narimoto model.
Generate Data
?? Let's start with creating an artificial dataset. You can easily use the make_classification () function in Sklearn.datasets to do this. Specifically, you will generate a relatively simple two-tuple problem. To make it a little more interesting, Let's make our data look crescent-shaped and add some random noise. This should make it more realistic and improve the difficulty of classifying observations.
"'
Creating the datasete.g. Make_moons generates crescent-shaped Datacheck out make_classification, which generates linearly-separable data
From sklearn.datasets import make_moons
X, y = make_moons (
n_samples=500, # The number of observations
Random_state=1,
noise=0.3
)
Take a Peek
Print (X[:10,])
Print (Y[:10])
"'
[[0.50316464 0.11135559]
[1.06597837-0.63035547]
[0.95663377 0.58199637]
[0.33961202 0.40713937]
[2.17952333-0.08488181]
[2.00520942 0.7817976]
[0.12531776-0.14925731]
[1.06990641 0.36447753]
[ -0.76391099-0.6136396]
[0.55678871 0.8810501]]
[1 1 0 0 1 1 1 0 0 0]
?? The dataset you just generated looks like this:
"'
Import Matplotlib.pyplot as Plt
From matplotlib.colors import Listedcolorma
%matplotlib Inline # for the plots-appear inline in Jupyter notebooks
Plot the first feature against the other, color by class
Plt.scatter (X[y = = 1, 0], X[y = = 1, 1], color= "#EE3D34", marker= "X")
Plt.scatter (X[y = = 0, 0], X[y = = 0, 1], color= "#4458A7", marker= "O")
"'
<\center>
?? Next, let's divide the data into
training sets and
test sets . The training set is used to develop and optimize the model. The test set is completely detached until the finished model is finally run on this. Having a test set allows you to run a good estimate of the model outside of the data that you didn't see previously.
"'
From sklearn.cross_validation import Train_test_split
Split into training and test sets
Xtrain, XTest, ytrain, ytest = Train_test_split (X, y, random_state=1, test_size=0.5)
"'
?? Use the k nearest neighbor (KNN) classifier to predict the dataset category. Introduction to statistical learning the second chapter provides a very good introduction to the KNN theory. I am ISLR Book of the brain residue. You can also look at the previous article how to implement the algorithm from Scratch in Python.
Introducing the Hyper-parameter K in KNN
?? The KNN algorithm works by assigning category labels to new data points using K-nearest neighbor information. Focus only on the classes that are most similar to the data points, and assign this new data point to the most common classes in these neighbors. When using KNN, you need to set the K value that you want the algorithm to use.
?? If K is high (k=99), the model will take a large number of neighbors into account in making decisions on the unknown data base. This means that the model is quite limited because it takes a lot of information into account when classifying instances. In other words, a large K-value results in a fairly "rigid" model behavior.
?? Conversely, if K is very low (k=1, or k=2), only a small number of neighbors are considered when making categorical decisions, which is a very flexible and very complex model that perfectly fits the exact form of the data. Therefore, model predictions are more dependent on local trends in data (critical, including noise).
?? Let's take a look at the k=99 and k=1 when the KNN algorithm classifies data. The Green Line is the decision boundary of the training data (the threshold in the algorithm determines whether a data point belongs to a blue or red class).
?? At the end of this article you will learn how to generate these images, but first let's go into the theory first.
?? When k=99 (left), it looks like the model fit is a little too smooth, for a bit closer to the data can be tolerated. The model hasLow FlexibilityAndLow Complexity. It depicts a general decision-making boundary. It has a relatively highdeviations, because modeling data is not good, the underlying generation of modeled data is too simple and deviates from the facts. However, if you throw a slightly different dataset into another, the decision boundary may look very similar. This is a stable model that does not have a very large difference – it has a low variance.
?? When K=1 (right), you can see the model over-fitting the noise. Technically, generating very perfect predictions in the training set (the error in the lower-right corner equals 0.0), But hopefully you can see that this fit is too sensitive for individual data points. Keep in mind that you have added noise to the dataset. It seems that the model fits the noise too much and fits very tightly. You can say that the k=1 model hasHigh FlexibilityAndhigh degree of complexityBecause it is very tightly tuned to the data. It also has a low bias, and if not unexpected, the decision boundary will certainly fit the trend of your observational data. However, on slightly changed data, the fitting boundary changes greatly, which will be very significant. The K=1 model hasHigh Variance.
?? But what is the generalization capability of the model and how does it behave on new data?
?? At the moment you can only see the training data, but the quantitative training error is not very useful. It is not interesting to summarize the performance of the training set you just learned in the model. Let's look at how the test set behaves, because it gives you a more intuitive impression of the model. Try using different K values:
From Sklearn. NeighborsImport kneighborsclassifierfrom Sklearn Import Metrics knn99 = kneighborsclassifier (n_neighbors = About) knn99. Fit(Xtrain, ytrain) yPredK99 = knn99. Predict(XTest) Print"Overall Error of k=99 Model:",1-Round (metrics. Accuracy_score (Ytest, yPredK99),2) knn1 = kneighborsclassifier (n_neighbors =1) knn1. Fit(Xtrain, ytrain) yPredK1 = knn1. Predict(XTest) Print"Overall Error of K=1 Model:",1-Round (metrics. Accuracy_score (Ytest, yPredK1),2)
Overall Error of k=99 model:0.15
Overall Error of K=1 model:0.15
?? In fact, it looks like these models are about as good as the test set. The following are the decision boundaries that are learned from the training set to be applied to the test set. See if we can figure out the prediction of two model errors.
?? There are different reasons for the two model errors. It seems that the k=99 model is not very good at capturing the features of the crescent-shaped data (this is under-fitting), whereas the k=1 model is a severe over-fitting of the noise. Remember that overfitting is characterized by good training performance and poor test performance, which you can observe here.
?? Maybe k in the 1 to 99 median is what you want?
50)knn50.fit(XTrain, yTrain)yPredK50 = knn50.predict"Overall Error of k=50 Model:"1 - round(metrics.accuracy2)
Overall Error of K=50 model:0.11
?? It looks a little better. Let's examine the decision boundaries of the model when k=50.
?? Good! The model fits into the actual trend of similar datasets, and this improvement is reflected in the lower test error.
Deviation-Variance Tradeoff: Concluding observations
?? I hope you now have a good understanding of the model's under-fitting and overfitting. See if you now understand all the terms at the beginning of this article. Basically, the correct equilibrium relationship between fitting and under-fitting is found to be equivalent to the deviation-variance tradeoff.
?? In general, when you train a machine learning algorithm on a data set, focus on how the model behaves in an independent data model. It is not enough to classify a training set well. Essentially, it's not just about building a generalized model – it's not impressive to get a 100% accuracy rate for a training set. is just an indicator of overfitting. Overfitting is the case of tightly fitting the model and tuning the noise rather than the signal.
?? To be clear, you are not a trend in modeling data sets. Instead, try to model the real-world process and guide us through the data. The exact data set you happen to use is just a small instance of the underlying fact, which contains noise and its own characteristics.
?? The following summary pictures show the lack of fit (high deviation, low variance), proper fitting, and over fitting (low deviation, high variance) models on the training set and test set:
?? Building a generalization model the motivation behind this idea is to slice the dataset into a training set and a test set (to provide accurate measurements of the model's performance at the end of your analysis).
?? However, it is also possible to fit the test data. If you try many different models on the test set and change them for the sake of precision, then the information in the test set may inadvertently seep into the model creation phase. You need a solution.
Evaluating model performance using K-fold cross-validation
?? Enter K-fold cross-validation, which is a handy technique for measuring model performance using only the training set. The process is as follows: You randomly divide the training set into K equal parts; then we train the data on the k-1/k training set; This gives you some indicator of the performance of the model (e.g., overall accuracy). Next train the training algorithm in the different k-1/k training sets and evaluate the remaining 1 parts. You repeat this process k times, get K different model performance measures, and use the average of these values to get the overall performance metric. Continue with the example, 10 percent cross-validation behind the following:
?? You can use K-fold cross-validation to get an assessment of model accuracy, and you can also use these estimates to adjust your model until you are satisfied. This allows you to avoid the risk of overfitting by using the final test data. In other words, cross-validation provides a way to simulate more data than you actually have, So you don't have to model the final use of the test set. K-fold cross-validation and its variants are very popular and very useful, especially if you are experimenting with many different models (if you want to test the performance of different parametric models).
Compare training errors, cross-validation errors and test errors
?? So, what K is the best? Try different k values for the training data building model, and see what the result model is for the training set itself and the test set prediction category. Finally see K-fold cross-validation how to spend the best K.
?? Note: In practice, when scanning such parameters, using the training set to test the model is a bad idea. In the same way, you cannot use the test set to browse through one parameter multiple times (one per parameter value). Next, you are using these calculations just as an example. In practice, only K-fold cross-validation is a safe method!
Import NumPy as Npfrom Sklearn. Cross_validation import train_test_split, cross_val_scoreknn = Kneighborsclassifier ()# The range of number of neighbors want to testN_neighbors = NP. Arange(1,141,2)# Here's store the models for each dataset usedTrain_scores = List () Test_scores = List () Cv_scores = List ()# Loop through possible n_neighbors and try them outFor ninchN_neighbors:knn. N_neighbors = NKNN. Fit(Xtrain, Ytrain) train_scores. Append(1-Metrics. Accuracy_score (Ytrain, KNN. Predict(Xtrain)))# This'll over-estimate the accuracyTest_scores. Append(1-Metrics. Accuracy_score (Ytest, KNN. Predict(XTest))) Cv_scores. Append(1-Cross_val_score (KNN, Xtrain, ytrain, CV =Ten). Mean())# Take the mean of the CV scores
?? So what's the optimal k? When multiple predictions are the same, you randomly pick a minimum as the K value.
# what do these different datasets think is the best value of k?print ( ' the best values of K are:n ' ' {} According to the Training setn ' ' {} According to the Test Set andn ' ' {} According to cross-validation ' . Format (min (n_neighbors[train_scores = = min (train_ Scores)]), min (n_neighbors[test_scores = = min (Test_scores)]), min (n_neighbors[cv_scores = = min
(cv_scores)]))
The optimal K is:
1 according to the Training Set
According to the Test Set and
According to Cross-validation
?? Not just to collect the best K, but also to look at the prediction error for a series of k tests.
# Let' sPlot the error you get withDifferent values ofKplt.figure (Figsize= (Ten,7.5)) Plt.plot (N_neighbors, Train_scores, c="BLACK",label="Training Set") Plt.plot (N_neighbors, Test_scores, c="BLACK", linestyle="--",label="Test Set") Plt.plot (N_neighbors, Cv_scores, c="Green",label="Cross-validation") Plt.xlabel (' number ofK Nearest Neighbors ') Plt.ylabel (' classificationError ') Plt.gca (). Invert_xaxis () plt.legend (loc ="Lower left") Plt.show ()
?? Let's talk about the classification errors of the training set. You consider a small number of neighbors, and the training rally gets a low predictive error. This makes sense because in doing the new classification is to approximate each point only to consider its own situation. The test error follows a similar trajectory, but grows after a certain point due to overfitting. This phenomenon shows that The built training set model is not well modeled on the specified test set sample.
?? As can be seen in this diagram, especially for the low value of K, the K-fold cross-validation of the area of the highlighted parameter space (that is, the very low value of k), which is very easy to fit. Although cross-validation and evaluation of test sets lead to some different optimal solutions, they are fairly good and generally correct. As you can see, cross-validation is a reasonable estimate of the test error. This type of plot is good to get a good sense of how the parameters affect the performance of the model and to help build the data set intuition to learn.
Code Show
?? This is the code that generates all of the above images, training to test different KNN algorithms. The code is the code for the Scikit-learn sample, mainly dealing with the calculation of decision boundaries and making the picture look good.
Contains the parts of machine learning that split datasets, algorithm fitting, and testing.
def detect_plot_dimension(X, h=0.02, b=0.05):X_min, X_max = x[:,0].min ()-B, x[:,0].max () + by_min, Y_max = x[:,1].min ()-B, x[:,1].max () + bxx, yy = Np.meshgrid (Np.arange (X_min, X_max, h), Np.arange (Y_min, Y_max, h)) dimension = XX, yyreturnDimension def detect_decision_boundary(dimension, model):XX, yy = dimension# Unpack the dimensionsBoundary = Model.predict (Np.c_[xx.ravel (), Yy.ravel ()]) boundary = Boundary.reshape (Xx.shape)# Put The result into a color plotreturnBoundary def plot_decision_boundary(panel, dimension, boundary, colors=[' #DADDED ', ' # Fbd8d8 ']):XX, yy = dimension# Unpack the dimensionsPanel.contourf (xx, yy, boundary, cmap=listedcolormap (colors), alpha=1) Panel.contour (xx, yy, boundary, colors="G", alpha=1, linewidths=0.5)# The decision boundary in green def plot_dataset(panel, X, y, colors=["#EE3D34", "#4458A7"], markers=[ " x", "O"]):Panel.scatter (X[y = =1,0], X[y = =1,1], color=colors[0], marker=markers[0]) Panel.scatter (x[y = =0,0], X[y = =0,1], color=colors[1], marker=markers[1]) def calculate_prediction_error(model, X, y):ypred = Model.predict (X) score =1-Round (Metrics.accuracy_score (y, ypred),2)returnScore def plot_prediction_error(panel, dimension, score, b=. 3):XX, yy = dimension# Unpack the dimensionsPanel.text (Xx.max ()-B, yy.min () + B, ('%.2f '% score). Lstrip (' 0 '), size= the, horizontalalignment=' right ') def explore_fitting_boundaries(model, n_neighbors, datasets, width):# Determine the height of the plot given the aspect ration of each panel should be equalHeight = float (width)/len (n_neighbors) * Len (Datasets.keys ()) nrows = Len (Datasets.keys ()) Ncols = Len (n_neighbors)# set up the plotFigure, axes = Plt.subplots (nrows,ncols,figsize= (width, height), sharex=True, sharey=True) dimension = Detect_plot_dimension (X, h=0.02)# The dimension each subplot based on the data# Plotting the dataset and decision boundariesi =0 forNinchN_neighbors:model.n_neighbors = Nmodel.fit (datasets["Training Set"][0], datasets["Training Set"][1]) boundary = Detect_decision_boundary (dimension, model) J =0 forDinchDatasets.keys ():Try:p Anel = axes[j, I]except(TypeError, Indexerror):if(nrows * ncols) = =1:p Anel = axeselifnrows = =1:# If you are only having one datasetPanel = Axes[i]elifNcols = =1:# If you are only try one number of neighborspanel = axes[j]plot_decision_boundary (panel, dimension, boundary)# Plot The decision boundaryPlot_dataset (panel, x=datasets[d][0], y=datasets[d][1])# Plot the observationsScore = Calculate_prediction_error (model, x=datasets[d][0], y=datasets[d][1]) Plot_prediction_error (panel, dimension, score, b=0.2)# Plot the score# make compacted layoutPANEL.SET_FRAME_ON (False) Panel.set_xticks ([]) panel.set_yticks ([])# Format the axis labelsifi = =0:p Anel.set_ylabel (d)ifj = =0:p Anel.set_title (' k={} '. Format (n)) J + =1i + =1Plt.subplots_adjust (hspace=0, wspace=0)# make compacted layout
?? You can then run the code like this:
# specify the model and settingsmodel = KNeighborsClassifier()n_neighbors = [200995023111]datasets = {"Training Set": [XTrain, yTrain],"Test Set"20# explore_fitting_boundaries(model, n_neighbors, datasets, width)explore_fitting_boundaries(model=model, n_neighbors=n_neighbors, datasets=datasets, width=width)
Conclusion
?? The bias-variance tradeoff occurs in different areas of machine learning. All algorithms can be thought of as resilient, and not just KNN. The goal of finding a good data pattern and the ability to generalize new data so that flexible best points apply to essentially all algorithms.
An article that takes you to understand what is overfitting, under-fitting, and cross-validation