Supervised learning

For a house price forecasting system, the area and price of the room are given, and the axes are plotted by area and price, and each point is drawn.

**Defining symbols:**

\ (x_{(i)}\) represents an input feature \ (x\).

\ (y_{(i)}\) represents an output target \ (y\).

\ ((x_{(i)},y_{(i)}) represents a training sample.

\ (\left\{(x_{(i)},y_{(i)}), i=1,\dots,m\right\}\) represents a sample of M, also known as a training set.

Superscript \ ((i) \) represents the index of the sample in the training set.

\ (\mathcal{x}\) represents the space for the input value,\ (\mathcal{y}\) represents the space for the output value. \ (\mathcal{x}=\mathcal{y}=\mathbb{r}\)

The goal of supervised learning is to give a training set and learn a function \ (h\):\ (\mathcal{x} \mapsto \mathcal{y}\). \ (h (x) \) is a "good" prediction for the corresponding value of Y. Functions \ (h\) are called **hypothesis functions** (assuming functions)

If we predict that the target value is continuous, then this problem is called **regression** problem, if \ (y\) is only a discrete small number, then this problem is called **classification** problem.

Linear regression

Suppose the price is not only related to the area, but also to the number of bedrooms, as follows:

At this time \ (x\) is a **2-dimensional vector** \ (\in \mathbb{r^2}\). where \ (x_1^{(i)}\) represents the house area of the first ( i\) sample,\ (x_2^{(i)}\) represents the number of house bedrooms for the first \ (i\) sample.

We now decide to approximate y as the linear function of x, which is the following formula:

\[h_{\theta} (x) =\theta_0+\theta_1x_1+\theta_2x_2\]

\ (\theta_i\) is the parameter (weight) of a linear function space from \ (\mathcal{x}\) to \ (\mathcal{y}\) mapping. Simplify the formula:

\[h (x) =\sum_{i=0}^n \theta_ix_i=\theta^tx\]

where \ (x_0\)= 1, so \ (x_0\theta_0=\theta_0\) is the Intercept. \ (\theta\) and \ (x\) are vectors,\ (n\) is the number of input values (not including \ (x_0\))

To learn the parameters \ (\theta\), we define the **loss function** :

\[j (\theta) =\frac{1}{2}\sum_{i=1}^m (H_{\theta} (x^{(i)})-y^{(i)}) ^2\]

A normal least squares regression model is generated.

1 LMS (least mean squares) algorithm

To select a \ (\theta\) to minimize \ (J (\theta) \) , randomly set random values to \ (\theta\) and then use a search algorithm to continually update \ (\ Theta\) so that \ (\theta\) converges to the desired minimum (J (\theta) \) value. Here, using the gradient descent algorithm, first initialize \ (\theta\), and then continue to perform the update:

\[\theta_j:=\theta_j-\alpha\frac{\partial}{\partial \theta_j}j (\theta) \]

\ (\alpha\) is the learning rate (learning rates). All the \ (j = (0,\dots,n) \) values are executed at the same time.

First, the partial derivative of the sample \ ((x, y) \) is obtained, followed by the sum:

\[\begin{align} \frac{\partial}{\partial \theta_j}j (\theta) &=\frac{\partial}{\partial \theta_j}\frac{1}{2} (H_ {\theta} (x)-y) ^2\\&=2*\frac{1}{2} (H_{\theta} (x)-y) *\frac{\partial}{\partial \theta_j} (H_{\theta} (x)-y) \\&= (h_{\ Theta} (x)-y) *\frac{\partial}{\partial \theta_j}\left (\sum_{i=0}^n \theta_ix_i-y\right) \\&= (H_{\theta} (x)-y) x _j\\\end{align}\]

For a single sample, update rule (**LMS update rule or widrow-hoff Learning rule** ):

\[\theta_j:=\theta_j+\alpha\left (y^{(i)}-h_{\theta} (x^{(i)}) \right) x_j^{(i)}\]

The update rule in the above formula is proportional to the error term of \ ((y_{(i)}-h_{\theta (x_{(i )}) \), if the predicted result is less than the output value, then the \ (\theta\) change is small;

**Batch gradient drop:**

Repeat calculation for all \ (\theta\) values until convergence \ (\{\)

\[\theta_j:=\theta_j+\alpha \begin{matrix} \sum_{i=1}^m (y_{(i)}-h_{\theta} (x^{(i)})) x_j^{(i)} \end{matrix}\]

\(\}\)

**the process of gradient download**
After calculating \ (\theta_1\) and \ (\theta_2\) , substituting \ ( H_{\theta} (x) \) as a function of \ (x\) , the drawing is as follows:

**Housing Area Price map**
**Random gradient descent (incremental gradient descent):**

\[\begin{align} Loop \{\\& for\;i=1\;to\;m \{\\&& \theta_j:=\theta_j+\alpha (y_{(i)}-h_{\theta} (x^{(i)}) ) x_j^{(i)}& (for every J) \\&\} \\\} \\\end{align}\]

Only one sample is updated at a time.

A random gradient drop is usually better than a batch gradient drop because the batch gradient drop is time-consuming to update the entire training set.

2 Normal equation

Gradient descent is one of the minimization \ (j\) methods, besides, the normal equation can also be minimized \ (j\). By taking a derivative of \ (\theta_j\) , then let it be equal to 0 so that \ (j\) is minimized.

2.1 Matrix derivative

function \ (f\):\ (\mathbb{r}^{m\times n} \mapsto \mathbb{r}\) mapping, which represents a function from the \ ((m,n) \) matrix to a real number. We define function \ (f\) for derivation of \ (a\) :

\[\NABLA_AF (A) = \begin{bmatrix}\frac{\partial f}{\partial a_{11}} & \dots & \frac{\partial f}{\partial A_{1n}} \ \vdots & \ddots & \vdots \\frac{\partial f}{\partial a_{m1}} & \dots & \frac{\partial f}{\partial A_{mn}} \\end{bmatrix}\]

\ (\NABLA_AF (A) \) is a \ ((m,n) \) matrix, each element is \ (\frac{\partial f}{\partial a_{ij}}\), for example, if

\ (A=\begin{bmatrix} a_{11}&a_{12}\\ a_{21}&a_{22}\\ \end{bmatrix}\) is a \ (2\times 2\) matrix. function $f:\mathbb{r}^{2 \times 2} \mapsto \mathbb{r} $:

\[f (A) = \frac{3}{2}a_{11}+5a_{12}^2+a_{21}a_{22}\]

Therefore, the function \ (f\) is derivative of \ (a\) :

\[\NABLA_AF (A) = \begin{bmatrix}\frac{3}{2}&10a_{12}\a_{22}&a_{21}\\end{bmatrix}\]

**Trace operations**

For a \ (n\times n\) Matrix Trace, the formula:

\[tra=\sum_{i=1}^n A_{ii}\]

If \ (a\) is a real number (such as a \ (1\times 1\) matrix), then \ (tr\,a = a\).

For two matrices \ (a\) and \ (b\), trace operations are:

\[trab=trba\]

Multiple matrices:

\[trabc=trcab=trbca\]

\[trabcd=trdabc=trcdab=trbcda\]

For two equal squares \ (a\) and \ (b\), there is also a real number \ (a\)with the formula:

\[tra=tra^t\]

\[TR (a+b) =tra+trb\]

\[traa=atra\]

**Matrix derivative formula:**

\[\nabla_a TrAB = B^t\qquad \qquad \qquad \qquad (1) \]

\[\nabla_{a^t} f (a) = (\nabla_af (a)) ^t \qquad \qquad \qquad (2) \]

\[\nabla_a TRABA^TC = CAB + c^tab^t \qquad \qquad \qquad (3) \]

\[\nabla_a | a| = | a| (a^{-1}) ^t \qquad \qquad \qquad (4) \]

Order \ (a\in \mathbb{r}^{n\times m}\),\ (b\in \mathbb{r}^{m\times n}\) Verification formula (1)\ (\nabla_a TrAB = b^t\) :

\[A=\BEGIN{BMATRIX}A_{11} & \dots & A_{1m}\\vdots & \ddots & \vdots\a_{n1} & \dots & A_{NM}\END{BM Atrix}\]

\[B=\BEGIN{BMATRIX}B_{11} & \dots & B_{1n}\\vdots & \ddots & \vdots\b_{m1} & \dots & B_{MN}\END{BM Atrix}\]

\[\begin{align}\nabla_a trab&= \nabla_a tr\left (\begin{bmatrix}a_{11} & \dots & A_{1m}\\vdots & \ddots & Amp \VDOTS\A_{N1} & \dots & A_{nm}\end{bmatrix} \times \begin{bmatrix}b_{11} & \dots & B_{1n}\\vdots & \DD OTs & \vdots\b_{m1} & \dots & B_{mn}\end{bmatrix}\right) \&=\nabla_a Tr\left (\begin{bmatrix}a_{11}b_{ 11}+a{12}b_{21}+\dots+a_{1m}b_{m1}&\dots&a_{11}b_{1k}+a_{12}b_{2k}+\dots+a_{1m}b_{mk}&\dots&a_ {11} b_{1n}+a{12}b_{2n}+\dots+a_{1m}b_{mn}\\vdots&\vdots&\vdots&\vdots&\vdots\a_{k1}b_{11}+a{k2}b_{ 21}+\dots+a_{km}b_{m1}&\dots&a_{k1}b_{1k}+a_{k2}b_{2k}+\dots+a_{km}b_{mk}&\dots&a_{k1}b_{1n}+a {K2} B_{2n}+\dots+a_{km}b_{mn}\\vdots&\vdots&\vdots&\vdots&\vdots\a_{n1}b_{11}+a{n2}b_{21}+\dots+a_ {NM} B_{m1}&\dots&a_{n1}b_{1k}+a_{n2}b_{2k}+\dots+a_{nm}b_{mk}&\dots&a_{n1}b_{1n}+a{n2}b_{2n}+\dots +a_{nm}b_{mn}\end{bmatrix}\right) \&=\nabla_a \left (a_{11}b_{11}+a{12}b_{21}+\dots+a_{1m}b_{m1}+a_{k1}b_{1k}+a_{k2}b_{2k}+\dots+a_{km}b_{mk}+a_{n1}b_{1n}+a{n2}b_{2n}+\dots+a_{nm}b_{ Mn}\right) \&=\begin{bmatrix}b_{11}&\dots&b_{m1}\\vdots&\ddots&\vdots\b_{1n}&\dots& B_{mn}\end{bmatrix}\&=b^t\\\end{align}\]

The equation (4) can be obtained by using the adjoint representation of the inverse of the matrix.

2.2 Least squares regression

Gets a training set that defines the **decision matrix** \ (x\) is \ ((m\times n) \)(if it contains intercept entries, it is \ ((m\times n+1) \)):

\[x=\begin{bmatrix}-(x^{(1)}) ^t-\-(x^{(2)}) ^t-\-\vdots-\-(x^{(M)}) ^t-\end{bmatrix}\]

Make \ (\overrightarrow y\) as an m-dimensional vector that contains all the target values of the training set.

\[\overrightarrow y=\begin{bmatrix}-(y^{(1)})-\-(y^{(2)})-\-\vdots-\-(y^{(M)})-\end{bmatrix}\]

Because \ (H_{\theta} (x^{(i)}) = (x^{(i)}),\ (\theta\) is a \ ((n\times 1) \) vector, so there are:

\[\begin{align}x\theta-\overrightarrow y &= \begin{bmatrix}-(x^{(1)}) ^t\theta-\-(x^{(2)}) ^t\theta-\-\vdots-\- (X^{(M)}) ^t\theta-\end{bmatrix}-\begin{bmatrix}-(y^{(1))-\-(y^{(2)})-\-\vdots-\-(y^{(M)})-\end{bmatrix}\& =\begin{bmatrix}h_{\theta} (x^{(1)})-(y^{(1)}) \h_{\theta} (x^{(2)})-(y^{(2)}) \\vdots\h_{\theta} (x^{(M)})-y^{(M)} ) \end{bmatrix}\\\end{align}\]

For a vector \ (z\), we have \ (Z^tz=\begin{matrix} \sum_i z_i^2 \end{matrix}\):

\[\begin{align}\frac{1}{2} (X\theta-\overrightarrow y) ^t (X\theta-\overrightarrow y) &=\frac{1}{2} \sum_{i=1}^m ( H_{\theta} (x^{(i)})-y^{(i)}) ^2\\&=j (\theta) \\\end{align}\]

Before the derivation of the \theta\ , the equation (2) and the equation (3) are the first to:

\[\NABLA_{A^T}TRABA^TC = b^ta^tc^t + ba^tc\qquad \qquad \qquad \qquad (5) \]

So:

\[\begin{align} \nabla_{\theta}j (\theta) &=\nabla_{\theta}\frac{1}{2} (X\theta-\overrightarrow y) ^T (X\theta-\ Overrightarrow y) \&=\frac{1}{2}\nabla_{\theta} (\theta^tx^tx\theta-\theta^tx^t\overrightarrow y-\ Overrightarrow y^tx\theta+\overrightarrow Y^t\overrightarrow y) \&=\frac{1}{2}\nabla_{\theta}tr (\theta^TX^TX\ Theta-\theta^tx^t\overrightarrow y-\overrightarrow y^tx\theta+\overrightarrow y^T\overrightarrow y) \&=\frac{1} {2}\nabla_{\theta} (Tr\theta^tx^tx\theta-2tr\overrightarrow Y^tx\theta) \&=\frac{1}{2} (X^TX\theta+X^TX\ Theta-2x^t\overrightarrow y) \&=x^tx\theta-x^t\overrightarrow y\end{align}\]

In the third step, because \ (\theta= (n\times 1), x= (m\times n) \), so \ (\theta^tx^tx\theta-\theta^tx^t\overrightarrow y-\ Overrightarrow y^tx\theta+\overrightarrow y^t\overrightarrow y\) calculated is a \ ((1\times 1) \) matrix, which is a real number, according to \ (tr\,a = a\) can deduce steps 2nd and 3rd. The 5th step is the equation (5), which is obtained by making \ (A^T=\THETA,B=B^T=X^TX and c=i also have equation (1) \) . To minimize \ (j\) , make \ (\nabla_{\theta}j (\theta) \) 0, resulting in:

\[x^tx\theta=x^t\overrightarrow Y\]

So, to minimize \ (j\) , you need to make:

\[\theta= (X^TX) ^{-1}x^t\overrightarrow y\]

3 Probability interpretation

Why linear regression can be used when dealing with a regression problem, and in particular why the loss function \ (j\)can be computed with least squares.

Now let's assume the relationship between the output variable and the input by the following equation:

\[y^{(i)}=\theta^tx^{(i)}+\epsilon^{(i)}\]

Here \ (\epsilon^{(i)}\) is an error term that represents a factor not taken into account in modeling, or a random noise. According to the **Gaussian distribution (Gaussian distribution)**, it is assumed that \ (\epsilon^{(i)}\) is independent of the same distribution (iid,independently and indentically distributed), wherein the mean value of the Gaussian distribution \ (\mu\) is 0, the variance is \ (\sigma^2\), i.e. \ (\epsilon^{(i)}\in \mathcal{n} (0,\ sigma^2) \), so the density of\ (\epsilon^{(i)}\) is:

(\ (\epsilon^{(i)}\) is assumed to be Gaussian because, according to the central limit theorem, the sum of a large number of independent variables is in accordance with the normal distribution. )

\[p (\epsilon^{(i)}) =\frac{1}{\sqrt{2\pi}\sigma}exp\left (-\frac{(\epsilon^{(i)}) ^2}{2\sigma^2}\right) \]

That is, given \ (x^{(i)}\) and parameter \ (\theta\) , the function value should obey the Gaussian distribution:

\[p (y^{(i)}|x^{(i)};\theta) =\frac{1}{\sqrt{2\pi}\sigma}exp\left (-\frac{(y^{(i)}-\theta^tx^{(i)}) ^2}{2\sigma^2} \right) \]

\ (P (y^{(i)}|x^{(i)};\theta) is the distribution of \ (x^{( i)}\ ) given \ (\theta\) and \ (y^{ (i)}\ ), where \ (\theta\) does not belong to a condition, not a random variable.

The distribution of \ (y^{(i)}\) means:\ (y^{(i)}|x^{(i)};\theta \sim \mathcal{n} (\theta^tx^{(i)},\sigma^2) \)

Above is a sample distribution equation, now write a \ ( x\) and \ (\theta\), get the predicted value \ (\overrightarrow y\) function, where \ (\theta\ ) is a definite value, and the equation is also called a likelihood function:

\[l (\theta) =l (\theta; X,\overrightarrow y) =p (\overrightarrow y| X;\theta) \]

Note that the above formula is based on the independence hypothesis of \ (\epsilon^{(i)}\) , and the equation can also be expressed as:

\[\begin{align}l (\theta) &=\coprod_{i=1}^m p (y^{(i)}|x^{(i)};\theta) \&=\coprod_{i=1}^m \frac{1}{\sqrt{2\ Pi}\sigma}exp\big (-\frac{(y^{(i)}-\theta^tx^{(i)) ^2}{2\sigma^2}\big) \\end{align}\]

Now, in order to find the optimal solution of the parameter \ (\theta\) , the maximum **likelihood** is chosen to maximize the \ (\theta\) value of the Mission \ (L (\theta) \) . First, using **logarithmic likelihood function** instead of **likelihood function**

\[\begin{align}\ell (\theta) &=LOGL (\theta) \&=log\coprod_{i=1}^m \frac{1}{\sqrt{2\pi}\sigma}exp\big (-\ frac{(y^{(i)}-\theta^tx^{(i)) ^2}{2\sigma^2}\big) \&=\sum_{i=1}^m Log\frac{1}{\sqrt{2\pi}\sigma}exp\big (-\ frac{(y^{(i)}-\theta^tx^{(i)}) ^2}{2\sigma^2}\big) \&=mlog\frac{1}{\sqrt{2\pi}\sigma}-\frac{1}{\sigma^2}*\ Frac{1}{2}\sum_{i=1}^m (y^{(i)}-\theta^tx^{(i)}) ^2\end{align}\]

To make the \ (\ell (\theta) \), you need to minimize

\[\frac{1}{2}\sum_{i=1}^m (y^{(i)}-\theta^tx^{(i)}) ^2\]

This is the least squares loss function \ (J (\theta) \).

4 Local Weighted linear regression

For a problem from \ (x\in \mathbb{r}\) prediction \ (y\) , in the left figure below, the \ (y=\theta_0+\theta_1x\) is used to match the dataset. In fact, however, the data in the figure is not a straight line.

Now if you add more than one feature \ (x^2\), or \ (y=\theta_0+\theta_1x+\theta_2x^2\), the result is better to match the data, however, if you add too many features, as shown in the image on the right, Although each point of the dataset is passed, this is not a very good result.

For the case shown in the left illustration, the function clearly does not describe the data well, which is known as **under-fitting** , and the case on the right is called overfitting

Therefore, the choice of feature is helpful to ensure the good performance of the learning algorithm.

**Local weighted linear regression (LWR)** can make the selection of attributes less important for the algorithm.

The general linear regression algorithm, in order to obtain the predicted value, needs:

1, find the \ ( \theta\) to minimize \ (\sum_{i=1} (y^{(i)}-\theta^tx^{(i)}) ^2\)

2, Output \ (\theta^tx\)

Local weighted linear regression algorithm:

1, find the \ ( \sum_{i=1} w^{(i)} (y^{(i)}-\theta^tx^{(i)}) minimized \ (^2\)

2, Output \ (\theta^tx\)

where \ (w^{(i)}\) is a non-negative weight value.

\[w^{(i)}=exp\big (-\frac{(x^{(i)}-x) ^2}{2\tau^2}\big) \]

The \ (\tau\) parameter controls the rate at which the weights change,

When \ (|x^{(i)}-x|\) is very small,\ (w^{(i)}\) is close to 1, if \ (|x^{ (i)}-x|\) is large,\ (w^{(i)}\) is close to 0. where \ (x\) is the point we need to evaluate. Therefore, selecting \ (\theta\) has a high weight on the training sample that is near the (x\) , and has a small weight for the distance from the training sample. This achieves the purpose of local weighting.

Local weighted linear regression is a **non-parametric** learning algorithm, whereas the previous linear regression algorithm belongs to the **parameter** learning algorithm.

The linear regression algorithm has a fixed parameter \ (\theta\), once the \ (\theta\) value is determined, we no longer need to retain the training data to predict the new eigenvalues.

The local weighted linear regression algorithm needs to preserve the entire training set in order to compute \ (w^{(i)}\).

"CS229 Note one" supervised learning, linear regression, LMS algorithm, normal equation, probabilistic interpretation and local weighted linear regression