Generalized linear Model
We have given examples of regression and classification in the preceding example. In the example of regression, $y \mid x;\theta \sim N (U,\sigma ^{2}) $, in the classification example, $y \mid x;\theta \sim Bbernoulli (\phi) $
The generalized linear model is based on the family of exponential functions, and the Gong family prototype is:
$p (Y;\eta) = B (y) exp (\eta^{t}t (y)-A (\eta)) $
$\eta$ is a natural parameter, $T (y) $ for full statistics, generally $t (y) =y$. Select a fixed t,a,b to define a distribution with a parameter of $\eta$.
For the Bernoulli distribution (the mean is $\phi$), there are:
$p (Y=1,\phi) =\phi;p (Y=0;\phi) =1-\phi$
$p (Y;\phi) = \phi^{y} (1-\phi) ^{1-y}$
$p (Y;\phi) = exp (Ylog\phi + (1-y) log (1-\phi)) $
$p (Y;\phi) = exp ((log (\frac{\phi}{1-\phi})) Y+log (1-\phi)) $
So there are:
$T (y) = y$
$a (\eta) =-log (1-\phi) $
$a (\eta) = log (1+e^{\eta}) $
$b (y) =1$
For Gaussian distributions, there are:
$p (Y;u) = \frac{1}{\sqrt{2\pi}}exp (-\frac{1}{2} (y-u) ^{2}) $
$p (Y;u) = \frac{1}{\sqrt{2\pi}}exp (-\frac{1}{2}y^{2}) \cdot exp (uy=\frac{1}{2}u^{2}) $
So there are:
$\eta = u$
$T (y) = y $
$a (\eta) = \frac{u^{2}}{2} = \frac{\eta^{2}}{2}$
$b (Y) = (\frac{1}{\sqrt{2\pi}}) exp (-\frac{1}{2}y^{2}) $
structure GLM
1. $y \mid x;\theta \sim exponentialfamily (\eta) $
2. Given x, our goal is to predict T (y), in most cases T (y) =y, so we can choose to predict output h (x), $h (x) =e\left [y \mid x \right]$
3. The natural parameter $\eta$ and input x are linearly correlated, $\eta = \theta^{t}x$
Ordinary least squares
Ordinary least squares is a special case of the GLM model: Y is continuous, and the conditional distribution of y after the given x is the Gaussian distribution $n (U,\sigma^{2}) $. Therefore, the distribution of exponential function family is Gaussian distribution. As before, the Gaussian distribution U as a family of exponential functions $u =\eta$. So there are:
$h _{\theta} (x) = E\left [y \mid x; \theta \right] = u = \eta =\theta^{t}x$
Logistic regression
In logistic regression, Y takes only 0 and 1, so the Bernoulli distribution is used as the distribution of the exponential function family, so $\phi = \frac{1}{1+e^{-\eta}}$. Further, by $y \mid X;\theta \sim Bernoulli (\phi) $, then $e\left [y \mid x;\theta \right] = \phi $, get:
$h _{\theta} (x) = E\left [y \mid x; \theta \right] $
$h _{\theta} (x) = \phi $
$h _{\theta} (x) = \frac{1}{1+e^{-\eta}}$
$h _{\theta} (x) = \frac{1}{1+e^{-\theta^{t}x}}$
Softmax regression
In logistic regression, the Y discrete value is only two, now consider when y takes multiple values, $y \in {,..., k}$.
To parameterize a polynomial with k-possible outputs, we can use K-parameters $\phi_{1},..., \phi_{2}$ to represent the probability of each output. However, these parameters are redundant because the sum of the K parameters is 1. So we just need to parameterize k-1 variables: $\phi_{i} = P (y=i;\phi) ~ ~ P (y=k;\phi) = 1-\sum_{i=1}^{k-1}\phi_{i}$, for convenience, we make $\phi_{k}= 1-\sum_{i=1} ^{k-1}\phi_{i}$, but remember that it is not a parameter but is determined by other k-1 parameter values.
To make the polynomial a family distribution of exponential functions, define the following $t (y) \in r^{k-1}$:
$ T (1) =\begin{bmatrix} 1\\ 0\\ 0\\ \vdots \\0 \end{bmatrix}$
$ T (2) =\begin{bmatrix} 0\\ 1\\ 0\\ \vdots \\0 \end{bmatrix}$
$ T (k-1) =\begin{bmatrix} 0\\ 0\\ 0\\ \vdots \\1 \end{bmatrix}$
$ T (k) =\begin{bmatrix} 1\\ 0\\ 0\\ \vdots \\0 \end{bmatrix}$
Unlike before, T (y) is not equal to Y,t (y) Here is a k-1-dimensional vector, not a real number. A $ (T (y)) _{i}$ represents the first element of $t (y) $.
Then define a function $1{\cdot}$, when the argument is true, the value of the function is 1, and vice versa is zero. such as 1{2=3}=0.
Therefore, $ (t (y)) _{i}=1{y=i}$, further we have $e[(t (y)) _{i}]=p (y=i) =\phi_{i}$.
It follows that the polynomial also belongs to the exponential function family:
$p (Y;\phi) = \phi_{1}^{1\{y=1\}} \phi_{2}^{1\{y=2\}} \cdots \phi_{k}^{1\{y=k\}}$
$p (Y;\phi) = \phi_{1}^{1\{y=1\}} \phi_{2}^{1\{y=2\}} \cdots \phi_{k}^{1-\sum_{i=1}^{k-1} (T (y)) _{i}}$
$p (Y;\phi) = \phi_{1}^{(t (y)) _{1}} \phi_{2}^{(t (y)) _{2}} \cdots \phi_{k}^{1-\sum_{i=1}^{k-1} (t (y)) _{i}}$
$p (Y;\phi) = exp ((t (y)) _{1}log (\phi_{1}) + (t (y)) _{2}log (\phi_{2}) + \cdots + (1-\sum_{i=1}^{k-1} (t (y)) _{i}) log (\phi_{ K})) $
$p (Y;\phi) =exp ((t (y)) _{1}log (\phi_{1}/\phi_{k}) + (t (y)) _{2}log (\phi_{2}/\phi_{k}) +\cdots+ (t (y)) _{k-1}log (\phi_{ K-1}/\phi_{k}) +log (\phi_{k})) $
$p (Y;\phi) = B (y) exp (\eta^{t}t (y)-A (\eta)) $
which
$ \eta =\begin{bmatrix} log (\phi_{1}/\phi_{k}) \ log (\phi_{2}/\phi_{k}) \ \vdots \\log (\phi_{k-1}/\phi_{k}) \end{ bmatrix}$
$a (\eta) =-log (\eta_{k}) $
$b (y) =1$
Therefore, the following function relationships are available:
$\eta_{i}= \frac{\phi_{i}}{\phi_{k}}$
For convenience, we define:
$\eta_{k} = 0$
So we get the following relational formula:
$e ^{\eta_{i}}= \frac{\phi_{i}}{\phi_{k}}$
$\phi_{k}e^{\eta_{i}} = \phi_{i}$
$\phi_{k}\sum_{i=1}{k}e^{\eta_{i}}=1$
So we get the following response function:
$\phi_{i}= \frac{e^{\eta_{i}}}{\sum_{j=1}^{k}e^{\eta_{j}}}$
This mapping function $\eta$ to $\phi$ is called the Softmax function.
Make $\eta_{i}=\theta_{i}^{t}x ~ ~ (i=1,2,..., k-1), \theta_{1},..., \theta_{k-1}\in r^{n+1}$
Therefore, the following conditions are distributed:
$p (Y=1 \mid x;\theta) = \phi_{i}$
$p (Y=1 \mid x;\theta) = \frac{e^{\eta_{i}}}{\sum_{j=1}^{k}e^{\eta_{j}}}$
$p (Y=1 \mid x;\theta) = \frac{e^{\theta_{i}^{t}x}}{\sum_{j=1}^{k}e^{\theta_{j}^{t}x}}$
Loss function:
Maximum likelihood estimate:
Chapter III Generalized linear Model (GLM)