Preliminary introduction
Supervised learning: Given a DataSet and know what its correct output should be like, feedback (feedback), divided into
- Regression (regressioin): Map input to a continuous output value.
- Classification (classification): Map output to discrete output values.
Unsupervised learning: Given a data set, it is not known what the correct output is, no feedback, divided into
- Cluster (clustering): Examples:google News, computer clustering, Markert segmentation.
- Association (associative): Examples: Estimates the condition based on the patient's characteristics.
Unary linear regression
Hypothesis (hypothesis): $h _\theta (x) =\theta_0+\theta_1 (x) $
Parameters (Parameters): $\theta_0, \theta_1$
Cost function: $J (\theta_0, \theta_1) = \frac{1}{2m}\sum\limits_{i=1}^{m}\left (H_\theta (x^{(i)})-y^{(i)}\ right) ^2$, least squares
Objective function (Goal): $\min\limits_{\theta_0, \theta_1}j (\theta_0, \theta_1) $
Gradient Descent algorithm (Gradient descent)
Basic idea:
- Initializes $\theta_0, \theta_1$
- adjusts $\theta_0, \theta_1$ until $j (\theta_0, \theta_1) $ reaches the minimum, updates the formula ($\theta_j = \ Theta_j-\alpha\frac{\partial}{\partial \theta_j}j (\theta_0, \theta_1) $)
For unary linear regression problems, the partial derivative of $j (\theta_0, \theta_1) $ is obtained
$$\frac{\partial j}{\partial \theta_0} = \frac{1}{2m}\sum\limits_{i=1}^{m}2\times\left (\theta_0 + \theta_1x^{x (i)}-y ^{(i)} \right) = \frac{1}{m}\sum\limits_{i=1}^{m}\left (H_\theta (x^{(i)})-y^{(i)} \right) $$
$$\frac{\partial j}{\partial \theta_1} = \frac{1}{2m}\sum\limits_{i=1}^{m}2\times\left (\theta_0 + \theta_1x^{x (i)}-y ^{(i)} \right) x^{(i)} = \frac{1}{m}\sum\limits_{i=1}^{m}\left (H_\theta (x^{(i)})-y^{(i)} \right) x^{(i)}$$
Thus the parameter $\theta_0, the \theta_1$ update formula is
$$\theta_0 = \theta_0-\alpha\frac{1}{m}\sum\limits_{i=1}^{m}\left (H_\theta (x^{(i)})-y^{(i)} \right) $$
$$\theta_1 = \theta_1-\alpha\frac{1}{m}\sum\limits_{i=1}^{m}\left (H_\theta (x^{(i)})-y^{(i)} \right) x^{(i)}$$
Where $\alpha$ is called the learning rate (learning rates), if it is too small, the algorithm converges too slowly; conversely, if it is too large, the algorithm may miss the minimum value, or even not converge. Another thing to note is that, above $\theta_0, \theta_1$ 's update formula uses all the data in the dataset (called "Batch" Gradient descent), which means that for every update, we have to scan the entire data set, Causes the update to be too slow.
Review of linear algebra
- Matrix and Vector definitions
- Matrix addition and multiplication
- Matrix-Vector Product
- Matrix-matrix Product
- Properties of matrix Multiplication: Binding law, Exchange law not established
- Inverse and transpose of matrices: matrices without inverses are called "singular (singular) matrices"
Reference documents
[1] Andrew Ng Coursera public class first week
Machine Learning Public Course notes (1)