**1. Linear Regression Model**

The **Least Square Method** is derived from maximum likelihood under the assumption of a Gaussian noise distribution, And hence could give rise to the problem of **over-fitting**, which are a general property of MLE.

The **regularized Least squares**, which is the generalized form of LSM, stems from a naive Bayesian approach called **maximum posterior estimate**. Given a certain MAP model, we can deduce the following closed-form solution to the linear regression:

1 Function w = linregress (x,t,lamb) 2 % closed-Form solution of MAP Linear Regression 3 % precondtion: X is a set of data columns, 4 row vector T is the labels of x 5 postcondition:w is the linear mo Del parameter 6 such that y = W ' * x 7 if (nargin<3) 8 MLE, no regularizer (penalty term) 9 Lamb = 0 end11 m = Size (x,1),% m-1 features, one constant term12 w = (x*x ' +lamb*eye (m)) \x*t ';

However, batch techniques run on the entire training set can is computationally costly, so we need some effective on-line Algorithms:

1 Function w =Linregress (x,t,err) 2 Batch Gradient descent forLinear Regression 3 by using the newton-Raphson Method 4%Precondtion:x is a set of data columns, 5Row Vector T is the labels of X 6 postcondition:wis the linear model parameter 7% such that y = W ' * x 8 if (nargin<3) 9 err = 0.0001; end11 m = siz E (x,1); w = zeros (m,1), Grad = Calgrad (x,t,w), + while (norm (grad) >=err) w = w-calhess (X) \gr Ad;16 grad = Calgrad (x,t,w); end18 end19 Function grad = Calgrad (x,t,w)% Gradient of the Error Function22 [M,n] = size (X), Grad = zeros (m,1), for i = 1: N25 grad = grad+ (w ' *x (:, i)-t (i)) *x (:, i); end27 end28 function hess = calhess (X) + Hessian Matrix of the Error Function31 m = Size (x,1< c21>); Hess = zeros (m); 1: M34 for j = 1: M35 Hess (i,j) = X (i,:) *x (J,:) '; end37 End End

in frequentist viewpoint of ** Model complexity **, the expected square loss can is decomposed into a Squared bias (the difference between the average prediction and the desired one), a variance term (sensitivity to data se TS) and a constant noise term **.**** &NBSP; **

Bayesian model comparison would choose ** model averaging ** instead, Which is also known as the fully Bayesian treatment. Given the prior distribution of models (Hyper-parameters) p (Mi) and the ** marginal ** ** likelihood p (d| Mi) (a convolution of P (d| **** W **, Mi) and P (** W ** | Mi), we can deduce the posterior distribution of models P (mi| D). To make a prediction, we just marginalize with respect to both the parameters and Hyper-parameters. ** **

However, the computations in fully Bayesian treatment are usually intractable. To make a approximation, we can carry on **model selection** in light of the model posterior distribution (MAP Estim ATE), and just marginalize over parameters. Here we take linear regression for example to illustrate what to implement such evidence approximation once have The optimal hyper-parameters.

First of all , given P (**w**) = Gauss (**w**| **M0** , S0), we can infer the posterior distribution of the parameters:

p (**w**| X,**t**) = Gauss (**w**| **MN** , S N), where **MN**=sn* (INV (S0) ***M0**+beta*x ' ***t**), INV (SN) = INV (S0) +beta*x ' *x.

Then we shall calculate the convolution of the likelihood of **T** and the posterior of **W** to get the predictive Distribution:

P (t| ** x**, X,**t**) = Gauss (t| ** MN**' *x,beta-1+x ' *sn*x).

We can easily find, the prediction value (the mean vector) is a linear combination of the training set target variable S.

**2. Logistic Regression Model**

Classification by **generative approach** are somehow to maxmize the likelihood, the product of all P (**xn**, Ck) t NK, to get the prior class possibilities P (Ck) (usually nk/n) and class-conditional distribution P (**xn**| CK) (usually in the form of Gaussian or multinomial). An alternative are **discriminative approach**, in which we directly presume the form of class-conditional Distributio N (usually parametric softmax functions) and estimate the parameters by maximizing the likelihood.

For example, in 2-class Logistic Regression, we presume the form of P (c1| ** x**) is Y (**x**) = sigmoid (**w ' *x**+b), and by taking the negative logarithm of the likelihood we get **cro Ss-entropy Error function**:

Where TN indicates whether **xn** belongs to C1, and yn = P (c1| ** Xn**). To minimize it, we can use **Newton-raphson method** to update the parameters iteratively:

**W** (new) = **W**(old)-Inv (X ' *r*x) *x ' * (**y**-**t**)

1 Function y = logregress (x,t,x) 2 Logistic Regression for 2- Class classification 3 Precondt Ion:x is a set of data columns for training, 4 row vector T is the labels of X (+1 or-1 ) 5 X is a data column for testing 6 Postcondition:the predicted value of data x 7 m = size (X,1); 8 options = optimset (' GradObj ', ' on ', ' maxiter ', + ), 9 W = fminunc (@logRegCost, rand (m,1 ), options); 10 function [Val,grad] = Logregcost (w ) One% determine the value and the gradient12% of the error function13 q = ( t+1)/2 ; p = 1./(1+exp (-W ' *x)), val =-q*log (P ')-(1-Q) *log (1-p ' ), Grad = x* (p-q) '/size (q,2); + en D18 y = 2*round (1./(1+exp (-W ' *x)) -1;19 end

When it comes to Bayesian method, we should first know the technique of Laplace **approximation**, which construct a Gaussian distribution to approximate a function with a global maximum point **z****0**:

Where A is the Hessian matrix of negative logarithm of f (**Z**) at the point **z0**.

To tackle logistic regression in a Bayesian, given a Gaussian prior P (**w**) = Gauss (**w**| ** M0**, S0), we should first use the method above to approximate the posterior distribution P (**w**| X,**t**) as Gauss (**w**map,sn), and then approximate the predictive distribution (the convolution of P (c1| x, **w**) and P ( **w**| X, **t**)) asσ (κ (ΣA2) μa), where

κ (ΣA2) = sqrt (1+ΠΣA2/8) -1,σa = X ' *sn*x,μa = **w**MAP ' *x.

**References:**

1. Bishop, Christopher M. *Pattern recognition* and machine learning [m]. Singapore:springer, 2006

PRML 2:regression Models