How to evaluate the assumptions we get from our learning algorithms and how to prevent overfitting and less-fitting problems.
When we determine the parameters of the learning algorithm, we consider the choice of parameters to minimize the training error. Some people think that getting a small training error must be a good thing. But in fact, just because this hypothesis has a very small training error, when the sample size is enlarged, the training error will be found to become larger, which indicates that it is not a good hypothesis. For example, the fitting is very good, and once the sample size changes, the training error increases.
So how do we tell if an assumption is over-fitting? We can draw the hypothetical function h (x) and then observe. But for the more general situation, there are many features, such as. It becomes difficult or impossible to see by drawing a hypothetical function. Therefore, we need another way to evaluate the assumption function.
A standard method for evaluating assumptions is given below, if we have a set of data groups where only 10 sets of training samples are shown, and of course there are hundreds or thousands of training samples (). To make sure that we can evaluate our hypothetical function, what we have to do is divide the data into two parts. The first part will be our training set, and the second part will be our test set. Dividing all the data into training and test sets, one of the typical partitioning methods is to use 70% of the data as a training set and 30% of the data as a test set at a scale of 7:3. The m here represents the total number of training samples, and the rest of the data will be used as a test set. The subscript test will indicate that the samples are from the test set, so X (1) test,y (1) test will be the first set of test samples. It is worth noting that the first 70% of the data is selected here as the training set, and the last 30% of the data as the test set. But if this set of data has a certain regularity or order, then it is best to randomly choose 70% as the training set, the remaining 30% as the test set. Of course, if the data has been randomly distributed, then the first 70% and the last 30% can be selected. But if your data is not randomly arranged, it is best to scramble the order, or use a random order to build the data, then take the first 70% as the training set and the last 30% as the test set.
A typical training and testing linear regression learning algorithm is shown below. First, we study the training set and get the parameter θ. In particular, the minimization of the training error J (θ), where j (θ) is defined using the 70% data, which is just the training data. Next, calculate the test error, J subscript test to indicate the test error. The parameters that were obtained from the training focus before taking out θ are placed here, and the calculation of the test error can be written in the form (blue handwriting), which is actually the average of the squared error of the test set, which is the value we expect to get. Therefore, we test each test sample with a hypothetical function that contains the parameter theta, and then calculate the Mtest squared error by assuming the function and the test sample. This is the definition of the test error when we use linear regression and squared error criteria.
So if it is a classification problem, for example, when using logistic regression.
The steps to train and test the logistic regression are very similar to what you said earlier. First we will learn from the training data, that is, 70% of all the data to get the parameter theta, and then use the following way to calculate the test error, the objective function and we usually do the same logistic regression. The only difference is that now we are using a mtest test sample, where the test error jtest (θ) is actually called the false classification rate, also known as the 0/1-cent error rate. 0/1 indicates the correct or incorrect sample we have predicted. This error equals 1 when assuming that the value of the function h (x) is greater than or equal to 0.5, and that the value of y equals 0, or if H (x) is less than 0.5, and y is equal to 1. Both cases indicate that the hypothetical function has misjudged the sample, where the threshold value is defined as 0.5. That is, assuming that the results are more likely to be 1, but actually 0, or that the assumptions tend to be 0, but the actual labels are 1, both of which indicate a miscarriage of judgment. Otherwise, we define the error value as 0, at which point the value is assumed to correctly classify the sample Y.
Then, we can use the error rate errors to define the test error, that is, 1/mtest times the error rate errors of H (i) (xtest) and Y (i) (sum from I=1 to Mtest).
Stanford University public Class machine learning: Advice for applying machines learning-evaluatin a phpothesis (how to evaluate the assumptions given by the learning algorithm and how to prevent overfitting or lack of fit)