In the last chapter, we studied the parameter estimation methods of PDF, which mainly have the maximum likelihood estimation and Bayesian estimation, they mainly estimate the parameters of the PDF with definite form, and in reality, we can't know the exact form of the PDF, but can only estimate the whole PDF by using all the samples. And this estimate can only be solved by numerical method. In layman's terms, if the parameter estimation is to select one of the specified functions as the target estimate, then nonparametric estimation is to find a suitable choice from all possible functions.
There are three methods of nonparametric Estimation: Histogram method, kn nearest neighbor method, Kernel function method, in which the kernel function method is called Parzen window method or kernel density method.
1. Histogram method
This is one of the simplest and most intuitive non-parametric estimation method, speaking of this method, I believe many people in junior high school have been exposed to, to give a simple particle;
A. If a class of mathematics results x (for simplicity, here is assumed that the sample has only one component) from passing 60 to 100 divided into 4 equal intervals of the small window, because X is a one-dimensional vector, will be divided into 4 1 square equals 4 bin, the volume of each bin is recorded as V;
B. Statistically, count the number of samples that fall into each bin Ni (which we call a frequency);
C. Suppose that there are 60 students in this class, but the math test has passed 40 people, then the total sample is 40, it is obvious that at this time the probability density per bin is a constant, can be calculated by the following formula:
According to the above example, we can simply analyze the basic idea of the histogram. Our goal is to find out the distribution of the function of the probability density of each sample, that is, the estimate of P (x), the same, regardless of the class problem, assuming that all samples are the same class.
Step1: Suppose a small area R, the probability of a random sample falling into R:
Step2: According to the two-item distribution, it is possible to find the probability that a sample with K falls into R:
Where N is the total number of samples, K is equal to its expected value, k=pr*n, so can get the estimate of PR equals:
Step3: When P (x) is contiguous and R's volume V is sufficient for hours, p (x) can be considered a constant, so the probability of falling into R is approximately:
It is possible to put the PR estimate into the above formula:
(basic formula)
Is it the same as the estimate in the above-mentioned mathematical score particles?
In the histogram estimation, there is a direct impact on the estimation results of the problem, that is, the bin volume V selection, not too large, not too small, should be in tune with the total number of samples, too general results in low resolution, too average, too small, too fine classification, too much volatility. An official explanation for this is that, as the number of samples increases, the bin should be as small as possible, while at the same time ensuring that there are enough samples in the bin, the number of samples in each bin must be a fraction of the total number of samples, expressed in the formula:
(1)
2.kn Nearest Neighbor Method
Although the histogram method is simple, but can not be, for example, in the case of a limited number of samples, after all, how many samples in a bin is not only related to the volume of the bin, but also with the distribution of the sample, in order to get a better estimate, we need to adjust the bin size according to the sample distribution, The KN neighbor method is an estimation method under the finite sample, which can be regarded as an adaptive histogram estimation method.
Basic idea: Within the value range of the sample x, each value is taken as the center of a bin, the KN is determined according to the total sample to specify the number of samples that fall within each bin, so that when the estimate of P (x) is obtained, the KN samples nearest to the current center point are found in the current bin. In the process of searching for this KN sample, the volume of the bin is constantly changing until the bin is exactly in the KN sample, the volume of the bin is determined, and the estimated amount is:
It is not difficult to find from the above formula that the sample density is inversely proportional to the bin volume, so that there is a better resolution in the high density bin, and the low density bin will guarantee the continuity of the estimate. As with the histogram estimation, in order to achieve good estimation, we need to choose the function form of the KN according to the principle of formula (1), that is, its relationship with the sample population satisfies a kind of equation, for example:
The difference between the KN neighbor method and the Histogram method, in addition to the variable bin volume, is also reflected in: KN neighbors do not divide the possible value of x into a number of bins, but in the value of x in the range of each point value as the center of the bin, when the closest to the current point of the KN sample, The volume of the current bin is also determined. The KN nearest neighbor method, although it solves the uneven distribution of the histogram estimates under the finite sample, is prone to another problem, that is, the dimension disaster, that is, when the dimension of X is high, the number of samples cannot be achieved accurately.
3.Parzen Window Method
This is a method of estimating the probability density of the current sample x by a kernel function, which can be regarded as a process of interpolating a sample with a kernel function within the value space of X.
Back to the (basic formula), when the statistics fall into a bin sample number, you need to determine whether the observed sample XI should be put into the bin containing x samples, then how to judge? Is it related to the distance of two samples? Keep looking down.
In the histogram, give a sample only one-dimensional example, here we go back to the general, assuming that X is a D-dimensional eigenvector, when each bin will also be multidimensional, the exact word is a hypercube, set its each dimension of the edge length of H, then the volume of the bin is the D-square. To count the number of samples that fall into the bin, we define a binary function:
With this binary function, it can be very easy to judge, only need to calculate the value of the function in the (X-XI)/h function value can be put into the current bin for 1, or not put in, so you can count down into the bin of all the sample number:
The upper generation (basic Formula) is available:
(2)
In the formula (2), the ∑ symbol after the string is called the kernel function, also called the Window function, recorded as:
It reflects the contribution of the observational sample XI to the probability density estimate at x, which is related to the distance between the two samples, so the visual interpretation of the formula (2) is that the probability density function estimates of the population sample are obtained by averaging the distance contribution values of all the observed samples falling into each bin.
For a kernel function, the Massell constraint needs to be met, i.e. the function value is non-negative and the integral is 1:
Finally, several common kernel functions are introduced:
A. Square window
B. Gaussian window (Multidimensional)
That is, the normal distribution of the covariance matrix ∑ with the observed sample XI as the mean value, where
C. Super Ball window
where V is the volume of the super-sphere, and P is the radius.
It can be noted that there is a common H parameter in the above three kernel functions, which is the smoothing parameter, which reflects the impact of a sample on the estimated range.
The non-parametric estimation of the probability density function requires a sufficient number of samples, as long as there are enough samples to ensure convergence to any density function, but also so the computational and storage ratio is large, and the previous parameter estimation is more suitable for small sample cases, and if the density function has sufficient prior knowledge, Parameter estimation may achieve better estimation results. In short, there is sufficient prior knowledge about the probability density of the prior probability and the class condition, or if there are enough samples, then we can make a better probability density estimation.
Patterns Recognition (Pattern recognition) Learning notes (vi)--nonparametric estimation of probability density function