Shanghai Jiao Tong University Zhang Zhihua teacher's public course "Introduction to Machine learning", Course Link: http://ocw.sjtu.edu.cn/G2S/OCW/cn/CourseDetails.htm?Id=397 for three days, take notes. OK, straight to the subject.
(i) Basic Concepts data Mining and machine learning essence is matter son, ML more close to mathematics. (In my eyes the ML is lower, the data mining, computer vision, NLP all use it) machine learning definition (Mike jodan) A field that bridge computations and statistics, with ties to information theory, signal processing, algorithms, Control theory and optimization theory.&nb Sp ML can be expressed in such a formula: ml=matrix + statistics + optimization + algorithm 1.definition data $ X=[x_1,..., x_n]^t_{(n\times p)}$ is a $n\times p$ matrix that contains n sample. Sample $x _i= (x_{1i},..., X_{pi}) $ is a p-dimensional vector that contains P features (features). For each sample, you can give a label $y _i$. For example, a person is a sample, height and weight is feature, sex is a label. Often, we want to predict the label of sample, that is, the input sample-> output label classification problem: The value of the label is limited, if the label has two (generally 0/1 or -1/+1), it is a two classification problem, Otherwise, it is a multi-classification problem. Regression problem: The value of the label is infinite, for example $y\in \mathbb{r}$. supervised learning: First given some sample (training samples) and their label, and then predicted the new sample. Classification and regression belong to supervised learning. 2.linear model\[y=x^t A\] The linear model predicts the label by a linear combination of feature, in other words, each feature is considered to have weights and a feature weighted sum to predict Y. To determine weight $a $, the most straightforward is to estimate by the least squares in the statistics, that is, minimize \begin{align*} &nbsP L=&\frac{1}{2}\sum_{i=1}^n (y_i-x_i^ta) ^2 \ =&\frac{1}{2}\|y-xa\|_2^2 \end{align*} by derivation, \[\frac{\partial l}{\partial a}=x^t (Y-XA) =0\] If $x^tx$ reversible, we can solve \[ a= (X^TX) ^{- 1}x^ty . \] When $n>p$, $,x^tx$ is generally reversible. But sometimes feature a lot, sample not so much, when irreversible, there is no single solution (underdetermined). You can then add a penalty $\lambda P (a) $ to $l$ (that is, loss function), where $\lambda>0$. Often we make $p (a) =a^ta$, the problem becomes Ridge regression (ridge regression): \[L (a) +\lambda p (a) = \frac{1}{2}\|y-xa\|_2^2+\frac{1}{2}\ Lambda a^ta \] derivative: \[\frac{\partial l}{\partial a}=x^t (Y-XA)-\lambda a=0\] At this time because the $x^tx+\lambda i_p$ is the positive definite matrix inevitable reversible , we have \[ a= (x^tx+\lambda i_p) ^{-1}x^ty \] so $\lambda $ The value of this number how can we give it? To do this, we need to divide the data into three categories: Training data (Training), Validation data (validation) and test data. Training data is used to learn $a$, test data is used to adjust the $\lambda$, test data is to predict the data (or to verify the final result). In addition, $p (a) =\|a\|_1=\sum_{i=1}^p\|a_i\|$ is also more common, then becomes lasso problem, that is \[ \frac{1}{2}\|y-xa\|_2^2+\frac{1}{2}\ Lambda \|a\|_1 \]With the 1 norm as a penalty has such a feature, it will make $a$ some of the items are 0, so that can play the function of automatic selection feature. 3. Maximum likelihood estimation (MLE) Note that, in the discussion just now, the $y$ we get using a linear model is continuous, so how do we use the classification problem? For example two classification problem $y\in\{0,1\}$, one of the simplest methods is given a $\alpha$, if $y<\alpha$, then $y=0$, otherwise $y=1 (0<\alpha<1) $. In order to have a more rigorous mathematical basis, it can be assumed that Y obeys a Bernoulli distribution, $\{y_i\} i.i.d. ~ber (\alpha) $, which is distributed by Bernoulli, Loss function: \[l= \prod_{i=1}^n p (y_i ) = \prod_{i=1}^n \alpha^{y_i} (1-\alpha) ^{(1-y_i)} \] We need to consider how to link $l$ with data $x$, and how to set $\alpha$. \begin{align*} f=&-ln L \\=&-\sum_{i=1}^n [y_iln\alpha+ (1-y_i) ln (1-\alpha)] \end{align*} order $$\alpha=\ Frac{1}{1+exp (-x^ta)},$$ $f $ becomes a function of $a$, and this problem becomes an optimization problem. In addition, the same can be added penalty (penalty) or regularization (regularization). 4. Unsupervised and semi-supervised before referring to p very large situation, in addition to home plus can also be reduced dimension, that is, by some kind of transformation, from $x\in{\mathbb{r}^p}$ to $z\in\mathbb{r}^q (p<min\ {p,q\}) $ into a new feature representation. dimensionality reduction can be divided into two ways: the first is through a linear transformation, that is, $ z=bx,b\in\mathbb{r}^{q\times p} $, such as PCA. The second type is the nonlinear $z=f (x) $. Unsupervised Learning: consider only sample. In addition to dimensionality reduction, another typical unsupervised is a clustering problem, and only samples does not have a label, it divides the samples into several categories by feature. No test data, training data points. semi-supervised learning: a small sample with a label, a large number of sample without labEl
Introduction to Machine learning (i) Basic concepts