Hypothesis space and generalization ability. The meaning of generalization ability is also explained above, for the sake of emphasis, here again: generalization ability is a learning method, which is used to measure the performance of the model learned by the learning method throughout the sample space.
This is of course very important, because the data we use to train the model is only a small sample of the sample space, and if we just focus too much on them, there is the so-called "overfitting" (over Fitting) scenario. Of course, if the training data are too recklessly ignored, there will be "under Fitting". You can visually experience fitting and under-fitting with a single graph (as shown in Figure 1, left is under fit and right is over fit).
Figure 1 Under-fitting and over-fitting
So we need "relaxation degree" to find the best balance point. The minimization of structural risk in statistical learning (Structural Risk MINIMIZATION,SRM) is the study of this, compared to the traditional risk minimization (empirical Risk minimization,erm) Focus on minimizing the upper bounds of risk, rather than simply minimizing the risk of experience. It has a principle: Select the function that minimizes the risk of experience in the subset of functions that minimize the upper bounds of risk. And this subset of functions is exactly the hypothetical space we mentioned earlier.
Note: The so-called empirical risk can be understood as the risk of training data sets. In correspondence, ERM can be understood as a learning method that only focuses on training data sets, which is based on the theory that empirical risk converges to the desired risk in a reasonably reasonable mathematical sense, i.e. the so-called "real" risk.
A detailed discussion of SRM and ERM will involve concepts such as VC and regularization, which are not detailed here, but need to have such an intuitive understanding: to make the model generalization ability of our learning methods good enough, we need to make a certain "limit" to the model, and this "limit" is shown in the choice of hypothetical space. A very common practice is to make certain penalties for the complexity of the model, which tends to streamline the model. This coincides with the so-called "Ames Razor Principle": "If not necessary, do not increase the entity" "Do not waste more things to do, with less things, the same can do things."
Cross-validation (Validation) allows us to know the extent to which we have fit by choosing the right hypothetical space to circumvent overfitting and to help us choose the right model. The following three types of cross-validation are common. S-fold Cross Validation: Chinese can be translated into S-fold, which is the most applied method, the method is as follows.
Divide the data into s parts: D={d_1,d_2,..., d_s}, and do the S test altogether. In the first experiment, using d-d_i as a training set, d_i as a test set to train and evaluate the model. Finally, the model with the smallest average test error is chosen. Leave a cross-validation (Leave-one-out crosses Validation): This is a special case of S-fold cross-validation, at which point S=n. Simple cross-validation: This is the simplest implementation and the method used in this book (when cross-validation is performed). It simply randomly groups the data to the extent that training intensive accounts for 70% of the original data (this ratio can vary depending on the situation), and the test error is used as the criterion when selecting the model.
The question comes from the Stanford University Machine Learning course on Coursera, which is described as follows: the size and price of the existing 47 houses requires the creation of a model for predicting new house prices. A little translation problem, you can know: the input data only one dimension, that is, the area of the house. The target data is only one dimension, i.e. the price of the house. What needs to be done is machine learning based on the relationship between the area and the price of a known house.
Let's move on to the next step.
getting and working with data
The first 10 samples of the original dataset are shown in table 1.1, where the units of the House area and house price can be decided at will, as they do not affect the results.
Table 1.1 Price Data set
2104,399900
1600,329900
2400,369000
1416,232000
3000,539900
1985,299900
1534,314900
1427,198999
1380,212000
1494,242500
1940,239999
2000,347000
1890,329999
4478,699900
1268,259900
2300,449900
1320,299900
1236,199900
2609,499998
3031,599000
1767,252900
1888,255000
1604,242900
1962,259900
3890,573900
1100,249900
1458,464500
2526,469000
2200,475000
2637,299900
1839,349900
1000,169900
2040,314900
3137,579900
1811,285900
1437,249900
1239,229900
2132,345000
4215,549000
2162,287000
1664,368500
2238,329900
2567,314000
1200,299000
852,179900
1852,299900
1203,239500
But in general, we should do simple processing to expect to reduce the complexity of the problem. In this example, a common practice for standardizing input data is to use the following mathematical formula:
# import the libraries that need to be imported
NumPy as NP
import Matplotlib.pyplot as Plt
# defines the array x that stores the input data (x) and the target data (y)
, y = [], []
# Iterating over the dataset, the variable sample corresponds to a sample for example in
open ("./11.txt", "R"):
# Because the data is separated by commas, the split method in Python is called and the comma is passed in as a parameter
_x, _y = Sample.split (",")
# Converts the string data to a floating-point number
x.append (float (_x))
y.append (float (_y))
#读取完数据后, Convert them into numpy arrays to facilitate further processing of
x, y = Np.array (x), Np.array (y)
# normalization
x = (X-x.mean ())/X.STD () # The
original data in the scatter plot Form Draw
Plt.figure ()
plt.scatter (x, Y, c= "G", s=6)
plt.show ()
Figure 2 Pre-processing data scatter plot
Here the horizontal axis is the normalized house area, the longitudinal axle is the house price. Above we have done a better job of machine learning the first step: data preprocessing.
selection and Training model
After the data is ready, the next step is to choose the appropriate learning method and model. Fortunately, by visualizing the original data, it is very intuitive to feel that it is possible to get a good result through polynomial fitting in linear regression (Linear Regression). The mathematical expression of its model is as follows.
Note: Fitting a scatter with polynomial is only a small part of linear regression, but its intuitive meaning is obvious. Considering that the problem is relatively simple, we have chosen polynomial-fitting. The detailed discussion of linear regression is beyond the scope of this book and is not covered here.
where F (x|p;n) is our model, p, n are the parameters of the model, where p is the coefficients of the polynomial F, and N is the number of polynomial. L (p;n) is the loss function of the model, where we use the common square loss function, the so-called Euclidean distance (or the two norm of the vector). x, y are input vectors and target vectors, and in our example, the x and y vectors are all 47-dimensional vectors, each consisting of 47 different house sizes and house prices.
Once you've identified a good model, you can begin to write code to train. For most machine learning algorithms, the so-called training is the process of minimizing a loss function, and this polynomial-fitting model is no exception: Our goal is to have the smallest L (p;n) defined above. In the field of mathematical statistics there is a special theoretical study of this regression problem, in which the more famous normal equation is directly given a simple formula of the solution. However, because of the existence of numpy, the training process even become more simple.
# at ( -2,4) This interval takes 100 points as the basis of the drawing
x0 = Np.linspace ( -2, 4, +)
# The function of the function definition of NumPy to train and return the polynomial regression model of the functions
# deg parameter represents n in the model parameter, That is, the number of polynomial in the model
# The returned model can return the corresponding predicted y def get_model (deg) according to the input x (the default is x0)
: return
Lambda Input_x=x0:np.polyval ( Np.polyfit (x, y, deg), input_x)
Here we need to explain the two functions in the NumPy: the use of Polyfit and Polyval. Polyfit (x, Y, deg): The function returns the smallest parameter p, i.e. the coefficients of the polynomial, that makes the above (note: The X and y in the formula are the input x and y). In other words, the function is the training function of the model. Polyval (P, x): Returns the value y of the polynomial, based on the values of the coefficients p and the x in the polynomial of the polynomial.
evaluate and visualize results
After the model is done, we should try to judge the model under various parameters. For the sake of brevity, we evaluated these three sets of parameters using n=1,4,10. Since the purpose of our training is to minimize the loss function, it seems a reasonable practice to use the loss function to measure the quality of the model.
# returns the corresponding loss according to the parameter n, input x, and y
def get_cost (deg, input_x, input_y):
return 0.5 * ((Get_model (deg) (input_x)-input_y) * * 2). SUM ()
# define the test parameter set and perform various experiments on it
Test_set = (1, 4, ten) for
D in Test_set:
# outputs the corresponding loss
print (Get_cost (d, X, Y))
The results are as follows: When n=1,4,10, the first two digits of loss are 96, 94, and 75 respectively. So it seems that n=10 is better than n=4, and N=1 is the worst, but as you can see from Figure 3, it seems that choosing n=1 directly as the parameter of the model is the best choice. The source of the contradiction here is the past-fitting situation mentioned earlier.
Figure 3 Visualization of linear regression
So, what's the most intuitive way to see if there's ever a fit? Of course, it's drawing.
# draw the corresponding image
Plt.scatter (x, Y, c= "G", s=20) for
D in Test_set:
plt.plot (x0, Get_model (d) (), label= "degree = {}" . Format (d))
# Limit the range of the horizontal and vertical axes to ( -2,4), (〖10〗^5,8x〖10〗^5)
Plt.xlim ( -2, 4)
Plt.ylim (1e5, 8e5)
# Call the legend method to correctly display the label for the Curve
plt.legend ()
plt.show ()
The result of the above code is shown in Figure 3.
Of these, three lines represent N=1, n=4 and n=10 respectively (the upper right corner of Figure 1.10 is also indicated). As can be seen from the beginning of the N=4 model has already begun to appear the phenomenon of fitting, to n=10 when the model has become very unreasonable.
At this point, it can be said that the problem has been basically solved. In this example, in addition to cross-validation, we cover most of the major steps in machine learning (the reason for not cross-validation is that there is too little data ...). )。 The code section adds up to a total of 40~50 lines, which should be considered a more appropriate length. I hope that we can have a general understanding of machine learning through this example, but also hope that it will arouse everyone's interest in machine learning.