Let's continue the discussion of reading Vapnik ' s book Statistical Learning theory. In the very beginning of the book, Vapnik first described and the fundamental approaches in pattern recognition:the Parametri C Estimation approach and the non-parametric estimation approach. Before introducing the non-parametric approach, which the support vector machine belongs to, Vapnik first address the Foll Owing three beliefs that the philosophy of parametric approach stands for (Page 4 to 5):
- The existence of a function defined by a limited number of parameters a good approximation to the desired function;
- The normal law for most real-life problems;
- The maximum likelihood method is a good tool for estimating parameters
in my opinion, the first condition should is required for All machine learning approaches, no matter whether they is parametric or non-parametric. For the second point, it's based on the central limit theorem, which stated, the distribution of a large set of Inde pendent random variables approximate a Gaussian distribution. If We first pre-process the dataset to normalise the data points to centre the mean into the origin, the Gaussian Distribu tion becomes a normal distribution. The term ' normal law ' is used. In my opinion, it's better to just highly the assumption of independence, since the assumption of independence was Mor E fundamental. The Gaussian distribution, is however, only a special situation of this condition. Regarding to the third point, the statement seems a little too absolute, as maximum likelihood estimation does not stand F or the whole world.
despite we see some limitations in the statements, as long As we keep on reading the book, the author just wanted to use these assumptions that many methods followed to highlight th e Limitations of parametric approaches. We know that expects in parametric learning approach is able to argue the points, as debates is often usual in the AC Ademic World.
then Vapnik started to introduce the Perception algorithm In 1958, and the Empirical risk minimisation (ERM) criterion this used for machine learning. It is of interest to note that the ERM was used to measure the error referring to the training samples and while our real prob Lem of machine learning are the estimate the unobserved behaviours in the test dataset. There is a problem of overfitting, which occurs when the training samples was not enough. The overfitting problem occurs when the training samples are small and thus the model fit the training samples but lack of Generalisation (achieving poor performance for the test dataset).
exactly as what I thought, the next problem the author Addressed is the generalisation of the algorithm. Then the very important VCs dimension theory is mentioned. The basic motivation of the VC dimension relates to density estimation. We know that, due to the law of large numbers, the relative frequency of a event approximates to its real probability, wh En the size of the samples approaches to infinity. However, since our training datasets are always finite in reality. This drives the author to consider constructing a more general theory to estimate the capability of density estimation of A training dataset, the so-call VC dimension. The motivation of support vector machine is, the machine with the lowest VC dimension are the best.
Then the author presents the main principal of designing a learning machine based on a limited size of the dataset. The main principal is this, for density estimation, if we may directly estimate a specific density we needed, rather than Inducting this density by first estimating the more general densities that the specific density depends on.
For example, if we can estimate the condition probability, we could not need to estimate the probability of the condition an d The probability of the event under all conditions. More important, with limited information such as a small size training dataset, we could only allow to estimate a more speci FIC density. On the other hand, the problem we be going to solve was to predict the class of unobservable samples, which requires the M Achine should is able to generate a solution more general than it training dataset or a specific test point. The machine should has the capability to estimate all samples in the feature space.
?
?
?
Reading Notes for statistical learning theory