1. Overview of machine learning
The information of a random variable in terms of their distribution can be measured as entropy.
The maximum entropy configuration for a discrete variable are the uniform distribution, and for a continuous varia BLE is the Gaussian distribution.
The additional amount of information required as we approximate a random variable with another distribution are called Relative entropy (kl-divergence).
The kl-divergence between the joint distribution and the product of both marginals is called Mutual information.
2. The Gaussian distribution
partitioned Gaussians: Suppose x = [x1; X2] Obeys the Gaussian distribution with the mean vector mu = [mu1; MU2]
and the covariance matrix SIG = [Sig11,sig12; SIG21,SIG22], as well as the precision matrix
LAMB = [Lamb11,lamb12; LAMB21, LAMB22] = INV (SIG), then we have:
(1) marginal distribution: P (x1) = Gauss (mu1, SIG11), p (x2) = Gauss (mu2 , SIG22);
(2) Conditional distribution: P (x1| x2) = Gauss (x1-INV (LAMB11) *lamb12* (x2-mu2), inv (LAMB11)).
Linear Gaussian Model: Given p (x) = Gauss (mu, inv (LAMB)) and P (y| x) = Gauss (*x+b, inv (L)), then we have:
(1) p (y) = Gauss (*mu+b, inv (L) +A*INV (LAMB) *a ')
(2) p (x| Y) = Gauss (Sig*{a ' *l* (y-b) +lamb*mu},sig), where SIG = Inv (lamb+a ' *l*a).
Maximum Likelihood Estimate: The mean vector can be estimated sequentially by mu = mu< /c4> + (xcnt - mu) /cnt, whereas
The covariance matrix can only is obtained by SIG = Sumi ((xi-mu) * (xi-mu) ')/CNT.
The distribution of parameters can be obtained by Bayeisan inference So long as we have maintained the suffic Ient statics.
References:
1. Bishop, Christopher M. Pattern recognition and machine learning [m]. Singapore:springer, 2006
PRML 1:gaussian Distribution