Organized from Andrew Ng's machine learning Course Week 4.
Directory:
- Why use neural networks?
- Model representation of neural Networks 1
- Model representation of Neural Networks 2
- Example 1
- Example 2
- Multi-Classification problem
1. Why use neural networks
When we have a lot of features: like $x_1, x_2,x_3.......x_{100}$
Suppose we now use a non-linear model with a polynomial maximum of 2 times, then for a nonlinear classification problem, if you use logistic regression:
$g (\theta_0+\theta_1x_1+\theta_2x_2+\theta_3x_1x_2+\theta_4x_1^2x_2+ ...) $
There are approximately $\frac{n^2}{2}$ characteristics, that is, O (N2), then when the number of polynomial is 3 times, the result is greater, O (N3)
The consequences of such a large number of features are: 1. The likelihood of overfitting is increased by 2. The calculation is expensive
To give a more extreme example, in the image problem, each pixel is equivalent to a feature, only for a 50*50 (already a very small picture) of the image, if it is a grayscale image, there are 2,500 features, RGB image has 7,500 features, for each feature has 255 values;
For such an image, if the use of two characteristics, there are about 3 million features, if it is also a logical return, the calculation of the cost is quite large
This time we need to use the neural network.
2. Neural network Model Representation 1
The basic structure of the neural network is as follows:
$x _0, x_1,x_2,x_3$ is the input unit, $x _0$ is also known as the bias unit, you can set the bias unit to 1;
$\theta$ are weights (or direct arguments) that connect input and output weight parameters;
$h _\theta (x) $ is the result of the output;
For the following network structure, we have the following definitions and calculation formulas:
$a _i^{(j)}$: The activation (which is the value of this unit) in the first unit of section J, the middle layer we call hidden layers
$s _j$: number of units on level J
$\theta^{(j)}$: Weight matrix, which controls the mapping from Layer J to Sub j+1,$\theta^{(j)}$ Dimension $s_{j+1}* (s_j+1) $
The formula for $a^{(2)}$ is:
$a _1^{(2)}=g (\theta_{10}^{(1)}x_0+\theta_{11}^{(1)}x_1+\theta_{12}^{(1)}x_2+\theta_{13}^{(1) x_3}) $
$a _2^{(2)}=g (\theta_{20}^{(1)}x_0+\theta_{21}^{(1)}x_1+\theta_{22}^{(1)}x_2+\theta_{23}^{(1)}x_3) $
$a _3^{(2)}=g (\theta_{30}^{(1)}x_0+\theta_{31}^{(1)}x_1+\theta_{32}^{(1)}x_2+\theta_{33}^{(1)}x_3) $
So in the same vein,
$h _\theta (x) =a_1^{(3)}=g (\theta_{10}^{(2)}a_0^{(2)}+\theta_{11}^{(2)}a_1^{(2)}+\theta_{12}^{(2)}a_2^{(2)}+\ theta_{13}^{(2)}a_3^{(2)}) $
3. Neural network Model Representation 2
Forward propagation:vectorized Implementation
The vectorization of the above formula means:
$z _1^{(2)}=\theta_{10}^{(1)}x_0+\theta_{11}^{(1)}x_1+\theta_{12}^{(1)}x_2+\theta_{13}^{(1) x_3}$
$a _1^{(2)}=g (z_1^{(2)}) $
The written vector is:
$ a^{(1)}=x= \begin{bmatrix} x_0 \ x_1 \ x_2 \ X_3 \end{bmatrix} $ $ z^{(2)}=\begin{bmatrix} z^{(2)}_1 \ z^ {(2)}_1 \ z^{(2)}_1 \end{bmatrix} $ $\theta^{(1)}= \begin{bmatrix} \theta^{(1)}_{10} & \theta^{(1)}_{11} &am P \theta^{(1)}_{12} & \theta^{(1)}_{13} \ \theta^{(1)}_{20} & \theta^{(1)}_{21} & \theta^{(1)}_{22} & \ theta^{(1)}_{23} \ \theta^{(1)}_{30} & \theta^{(1)}_{31} & \theta^{(1)}_{32} & \theta^{(1)}_{33} \ \end{ bmatrix}$
So:
$z ^{(2)}=\theta^{(1)}a^{(1)}$
$a ^{(2)}=g (z^{(2)}) $
Plus $a^{(2)}_0=1$:
$z ^{(3)}=\theta^{(2)}a^{(2)}$
$a ^{(3)}=h_\theta (x) =g (z^{(3)}) $
The above is the way to quantify the expression.
For each $a^{(j)}$ will learn different characteristics
4. Example 1
First look at a classification problem, Xor/xnor, for $x_1,x_2 \in {0,1}$, when X1 and X2 are different (0,1 or 1,0), Y is 1, same time y is 0;y=x1 xnor n2
For a simple classification problem and:
The following neural network structure can be used to obtain the correct classification results.
Similarly, for or, we can design the following networks and get the right results.
5. Example 2
Then the above example, fornot, the following network structure can be categorized:
Let's go back to the problem that was originally mentioned in the example: Xnor
When we combine these simple examples (and, or, not), we get the correct network structure to solve the Xnor problem:
6, multi-classification problems
In the neural network to solve the multi-classification problem, but also with the idea of one vs all, in the two classification problem, we are the output is not 0 is 1, and in the multi-classification problem, the result of the output is a one hot vector, $h _\theta (x) \in r^k$, K represents the number of categories
For example, for a 4-class problem, the output might be:
Category 1:$\begin{bmatrix} 0 \ 0 \ 0 \ 1 \end{bmatrix}$, category 2:$\begin{bmatrix} 0 \ 0 \ 1 \ 0 \end{bmatrix}$, category 3:$\begin{ Bmatrix} 0 \ 1 \ 0 \ 0 \end{bmatrix}$, etc.
You can not put $h_\theta (x) $ output to 1,2,3,4
Machine Learning's Neural Network 1