Anyone who knows a little bit about supervised machine learning will know that we first train the training model, then test the model effect on the test set, and finally deploy the algorithm on the unknown data set. However, our goal is to hope that the algorithm has a good classification effect on the unknown data set (that is, the lowest generalization error), why the model with the least training error will also be effective in controlling the generalization error? This section of the study of the theory of knowledge is to let everyone know it is also know why.

Learning theory 1.empirical risk minimization (experience risk minimization)

Suppose there is a training set of M samples, and each sample is generated from the probability distribution D independently of each other. For hypothetical H, define the training error training error (or empirical risk experience risk) for the H-category sample to account for the proportion of the entire training set:

In addition, the definition generalization error generalization error is: Generate a new sample from the probability distribution D of the generated training set, assuming the probability of H-mis-classification

It is worth noting that the assumption that the training set and the new samples are independent from the same distribution D (IID independent distribution) is an important foundation in the learning theory. When we select model parameters, we use the following methods:

The so-called empirical risk minimization (empirical risk minimization), empirical risk minimization is difficult to solve for a non-convex optimization problem, and logistic regression and SVM can be regarded as the convex optimization approximation of this problem.

To make the definition more generalized (not limited to the selection of the linear classifier parameter θ), we define hypothesis Class H as all classifiers (function h) considered by a learning algorithm, and the empirical risk minimization becomes:

2. The relationship between minimizing empirical risk and minimizing generalization error 2.1 for limited h

This will give a direct conclusion of the formula derivation, the specific steps can refer to cs229 course notes. First, according to union bound lemma and Cherno bound lemma can be launched:

The formula is called uniform convergence, where k is the number of models assumed in H. The assumption of minimizing the empirical risk and the assumption of minimizing the generalization error are then brought in, which can be made:

In other words, under the premise of uniform convergence, the generalization error of the model which is selected by minimizing empirical risk is only 2γ with the classification accuracy of the minimization generalization error model. Thus, there is a **theorem:** set H contains the assumption that the number of K sample is M, then at least in the probability of 1-δ we have:

With this theorem, you can get a fuller understanding of what Andrew says about bias and variance (regardless of variance in probability)

The leftmost graph corresponds to the high bias, because there are few optional assumptions in the hypothesis class, so even a large increase in the sample size does not reduce the minimum generalization error, which corresponds to the lack of fitting; the rightmost graph corresponds to the high variance, Because the hypothesis class is increased (parameter added), the minε (h) will definitely fall or remain the same, but a k increase will cause gamma to also increase, especially when M is very small, which may lead to the fact that although the empirical risk is very low but the generalization error is still very large, this phenomenon corresponds to overfitting. The best solution is to compromise between high bias and high variance, which corresponds to the middle picture. Finally, there is a corollary to the complexity of the sample **:** set H contains a hypothetical number of K, then at least in the probability of 1-δ, to minimize the empirical risk of the generalization error and the minimum generalization error is less than or equal to gamma, you need to ensure that the sample size:

2.2 For the Infinite H

In most cases, the assumptions contained in H are infinite, so it is necessary to generalize the conclusions of the previous section to the infinite H, but to prove that the steps are too complex and Andrew has also skipped the proof process. Before we introduce the theorem, we need to talk about Vapnik-chervonenkis dimension (VC dimension): First define shatter: The given sample set S,h shatters s means H can divide all the marks on S. The VC Dimension VC (H) of H is defined as: The sample amount of the maximum set that can be H shatter. For example, for three points in a collection:

H (x) = 1{θ0 +θ1x1 +θ2x2≥0} can shatter these three points:

And this H cannot shatter4 a set of points, so VC (H) = 3. **theorem:** for a given H,H∈H,D=VC (h), at least 1-δ probability, we have:

Recall, SVM can use the kernel function of the infinite dimension of the characteristics, then whether there will be a fitting? Andrew's answer is no: you can prove that the VC dimension of SVM is upper bound (even if the dimension of the eigenvector is infinite), so the variance is not too large. **inference:** in the probability of at least 1-δ, the complexity of the sample size M is oγδ (d) in order to make the difference between the generalized error of minimizing empirical risk and the minimum generalization error less than or equal to Γ. If the required sample size is calculated strictly according to the deduced formula, it is often found that such a large sample size cannot be obtained at all. But Andrew mentions a rough estimate: to make the algorithm better, the sample size M with VC (h) is linear, and for most H,VC (h) It is linearly related to the number of model parameters; together, It is generally necessary to have a linear relationship between the sample size of the training set and the number of model parameters. In the end, Andrew also mentions his own experience: to do logistic regression, where the sample size of the training set is 10 times times the number of parameters, it is usually possible to fit a good boundary, even if it is less than 10 times times, acceptable.

Model selection 1.cross Validation cross-validation 1.1 Hold-out crosses validation

- Randomly divide the sample set into a training set (typically 70% of the sample size) and a test set (the remaining 30%)
- To train each candidate hypothesis on a training set and get a corresponding model
- Select the model that can minimize the empirical risk on the test set as the optimal model
- Models that are insensitive to initial-condition disturbances and can be re-trained throughout the sample set (optional)

The advantage of this approach is that each model only needs to be trained once and the cost is low; The disadvantage is that only 70% of the samples are used for training. When the sample data is very large, you should consider other methods when the sample data is very rare (for example, only 20).

1.2 K-fold Cross Validation

- Randomly divides the sample set S into k subsets S1,..., Sk, with a sample size of m/k per subset (K usually takes 10)
- The methods for training and validating the model are:
- Select a model with minimal generalization error estimates and retrain the model in sample set S

This method is expensive compared to hold-out cross validation, and is trained for each model in K-times, but more fully utilizes the sample for model training. There is also a more extreme approach to the extremely scarce sample size.

1.3 Leave-one-out Cross Validation

Step with k-fold cross validation, but the number of fold is equal to the sample size M.

2. Feature Selection

When the characteristic number is very long, it is easy to have the phenomenon of fitting, so feature selection is very necessary. For a model of n features, there will be a subset of 2n possible features, and some heuristic search processes can be used to find a good subset of features:

2.1 Forward Search

- Initialized feature space F =∅
- Repeat {for i = 1,..., n if i∉f, make fi = f∪{i} and evaluate the effect of fi by cross-validation Select the best fi in the previous step as the new F}
- Select the most effective subset of features in the loop process

The outermost loop termination condition can be a traversal of all feature combinations, or dimension of the feature space | F| reached a preset value (e.g., 100 from 1000 characters).

2.2 Backward Search

Basically consistent with forward search, except that the initial feature space becomes all features: F={1,..., n}, the terminating condition becomes f=∅, and each time a feature is deleted from the feature space in the loop. These two methods are also known as **wrapper model feature selection**, but the computational cost of such methods is higher, and if the entire feature space is completely traversed, the computational complexity is O (N2).

2.3 Filter Feature Selection

- Calculate a rating S (i) for each feature XI to measure its contribution to the information of classification label y predictions
- Select the highest k feature of the rating S (i) (can determine the value of K by cross-validation)

Rating S (i) an optional calculation scheme is: the correlation between Xi and Y. In practice, the most commonly used correlation calculation is mutual information (especially for discrete features):

This formula requires that both the eigenvalues and the class tag variables are two yuan, which gives "how different P (xi, Y) and P (xi) p (y) are": If Xi and Y are independent of each other, then P (xi, y) = P (xi) p (y), MI (xi, y) = 0, that is to say XI does not contribute any information to Y, so the score should be small. For text categorization using Naive Bayes, the default feature space size is the size of the vocabulary n (usually very large), while using feature selection often increases the accuracy of the classification.

Bayesian statistical regularization

To prevent overfitting, the feature selection is made to make the model simpler, while regularization can prevent overfitting on the premise that all parameters are preserved: In the case of linear regression, our previous approach was to obtain the optimal model by using the maximum likelihood function.

The logic behind it is that there is an optimal combination of parameters θ, and we can find it by means of the maximum likelihood function. This is the view of the **Frequency School (frequentist)** . The Bayesian school (**Bayesian**) thinks that θ is also a random variable, in order to make predictions, we first assume that a **prior distribution** of θ p (θ), given a sample set, we can obtain the posterior distribution of θ:

When given a new sample x, we want to predict its results when we can get the distribution of the results:

We can give predictions by the distribution of expectations:

However, it is difficult to find an integral in a very high dimension, so the above steps will become difficult to calculate. So instead of finding the most probable θ in the posterior distribution of theta, map (maximum a posteriori):

Finally, the prediction is made by Θmap. In practice, the prior distribution of θ P (θ) is usually assumed to be θ~ N (0,τ2 **I**), while the difference between the use of Bayesian map estimation for θ and the use of maximum likelihood function is that the loss function is one more penalty for the former than the latter:

+λ| | θ| | 2 (λ=1/τ2)

With this regularization, the model will be less prone to overfitting, for example: Bayesian logistic regression is also proven to be an efficient text classification algorithm, even if there is usually n >> m in text categorization.

4.online Learning

In order to embody the integrity of the curriculum system, this knowledge was put to the last, and only a brief introduction. The algorithm we learned before is to train the model with a batch of samples and then predict the new data, which is called **batch learning** , and there is an algorithm that learns one side of the prediction, that is, each new data will be trained and predicted immediately, which is called **Online learning. **each of the previously learned models can do online learning, but given the real-time nature, not every model can be updated in a short time and the next prediction, and the perceptron algorithm is well suited to do online learning:

The parameter Update method is: if hθ (x) = y is accurate, the parameter is not updated otherwise, θ:=θ+ yx (in fact, this formula and gradient descent update strategy is the same, but the class label changed to {1,-1} and the learning rate α does not affect the performance of the Perceptron is removed) Finally, assume | | X (i) | | ≤d, and there is a unit vector U (| | u| | = 1) and Y (i) (UTx (i)) ≥γ (in fact, is the model of SVM, you can divide all the data by the geometrical interval gamma), also can prove that the total number of prediction errors of the Perceptron online algorithm will be less than or equal to (d/γ) 2.

Stanford CS229 Machine Learning course Note six: Learning theory, model selection and regularization