A kernel function implicitly maps a data point to some high-dimensional feature space and substitutes for the inner prod UCT of feature vectors, so, a non-linearly separable classification problem can is converted into a linearly separ Able One. This trick can is applied to many feature vector-based models such as SVM, which we had introduced in previous articles.
To test the validity of a kernel function, we need the Mercer theorem: function $k: \mathbb{r}^m\times\mathbb{r}^m \rightarrow\mathbb{r}$ is a Mercer kernel iff for all finite sets $\{\vec{x}_1,\vec{x}_2,..., \vec{x}_n\}$, the CO rresponding kernel matrix is proved to be symmetric positive semi-definite.
One of the good kernel functions is the Gaussian kernel $k (\vec{x}_m,\vec{x}_n) =exp\{-\frac{1}{2\sigma^2}| | \vec{x}_m-\vec{x}_n| | ^2\}$, which has infinite dimensionality. Another one is the polynomial kernel $k (\vec{x}_m,\vec{x}_n) = (\vec{x}_m^t\vec{x}_n+c) ^m$ with $c >0$. In reality, we can construct a new kernel function with some simple valid kernels according to some properties.
We can also use a generative model to define kernel functions, such as:
(1) $k (\vec{x}_m,\vec{x}_n) =\int p (\vec{x}_m\text{|} \vec{z}) \cdot P (\vec{x}_n\text{|} \vec{z}) \cdot P (\vec{z}) \cdot d\vec{z}$, where $\vec{z}$ is a latent variable;
(2) $k (\vec{x}_m,\vec{x}_n) =g (\vec{\theta},\vec{x}) ^tf^{-1}g (\vec{\theta},\vec{x}) $, where $g (\vec{\theta},\vec{x }) =\bigtriangledown_{\vec{\theta}}ln{p (\vec{x}\text{|} \vec{\theta})}$ is the Fisher score,
and $F =\frac{1}{n}\sum_{n=1}^n g (\vec{\theta},\vec{x}_n) g (\vec{\theta},\vec{x}_n) ^t$ is the Fisher information Matrix.
Gaussian Process is a probabilistic discriminative model, where an assumption are made that the set of values of $ Y (x) $ evaluated at a arbitrary set of points $\{\vec{x}_1,\vec{x}_2,..., \vec{x}_n\}$ is jointly Gaussian distributed. Here is the kernel matrix to determine the covariance.
Gaussian Process for Regression:
Typically, we choose $k (\vec{x}_m,\vec{x}_n) =\theta_0 exp\{-\frac{\theta_1}{2}| | \vec{x}_n-\vec{x}_m| | ^2\}+\theta_2+\theta_3 \vec{x}_m^t\vec{x}_n$, and assume that:
(1) Prior distribution $p (\vec{y}_n) =gauss (\vec{y}_n\text{|} \vec{0},k_n) $;
(2) Likelihood $p (\vec{t}_n\text{|} \vec{y}_n) =gauss (\vec{t}_n\text{|} \vec{y}_n,\beta^{-1}i_n) $.
Then, we have $p (\vec{t}_n) =\int p (\vec{t}_n\text{|} \vec{y}_n) \cdot P (\vec{y}_n) \cdot D\vec{y}_n=gauss (\vec{t}_n\text{|} \vec{0},k_n+\beta^{-1}i_n) $. Here, $p (\vec{t}_n) $ are the likelihood of hyperparameter $\vec{\theta}$, and we can make a MLE to learn $\vec{\theta}$.
Also, $p (\vec{t}_{n+1}) =gauss (\vec{t}_{n+1}\text{|} \VEC{0},K_{N+1}+\BETA^{-1}I_{N+1}) $. Hence, denote $\vec{k}=[k (\vec{x}_1,\vec{x}_{n+1}), K (\vec{x}_2,\vec{x}_{n+1}),..., K (\vec{x}_n,\vec{x}_{n+1})]^T$ , then we can get the conditional Gaussian $p (\vec{t}_{n+1}\text{|} \vec{t}_n) = Gauss (\vec{k}^t (k_n+\beta^{-1}i_n) ^{-1}\vec{t}_n,k (\vec{x}_{n+1},\vec{x}_{n+1})-\vec{k}^T (K_N+\beta ^{-1}i_n) ^{-1}\vec{k}+\beta^{-1}) $
Gaussian Process for classification:
We make a assumption that $p (t_n\text{|} A_n) =\sigma (A_n) $, and take the following steps:
(1) Calculate $p (\vec{a}_n\text{|} \vec{t}_n) $ by Laplace approximation;
(2) Given $p (\vec{a}_n\text{|} \vec{t}_n) $ and $p (\vec{a}_{n+1}\text{|} \VEC{T}_{N+1}) $, $p (a_{n+1}\text{|} \vec{a}_n) $ is a conditional Gaussian;
(3) $p (a_{n+1}\text{|} \vec{t}_n) =\int p (a_{n+1}\text{|} \vec{a}_n) \cdot p (\vec{a}_n\text{|} \vec{t}_n) \cdot d\vec{a}_n$;
(4) $p (t_{n+1}\text{|} \vec{t}_n) =\int \sigma (a_{n+1}) \cdot P (a_{n+1}\text{|} \vec{t}_n) \cdot d\vec{a}_{n+1}$.
References:
1. Bishop, Christopher M. Pattern recognition and machine learning [m]. Singapore:springer, 2006
PRML 5:kernel Methods