The Wunda "Deep learning engineer" Special course includes the following five courses:
1, neural network and deep learning;
2, improve the deep neural network: Super parameter debugging, regularization and optimization;
3. Structured machine learning project;
4, convolution neural network;
5, Sequence model.
The second lecture of the series "Neural Network and deep learning" is introduced today: the Foundation of Neural Network (upper).
Main content:
1, two classification problems;
2, logical regression and its corresponding cost function form;
3, using the calculation diagram to describe the positive and reverse propagation process of neural network;
4, the gradient descent algorithm is used in logistic regression.
1, two classification problems
The second category is the output y only the discrete value {0, 1} or {-1, 1}.
Take an image recognition problem as an example to determine whether there is a cat in the picture, and 0 represents non cat,1 for cat.
In general, color pictures contain RGB three channels. We first convert the image input X (dimension (64,64,3)) into a one-dimensional eigenvector. The method is a line-by-row extraction of each channel, and finally connected to the transformed input eigenvector dimension (64x64x3=12288). This eigenvector x is a column vector, and the dimension is generally recorded as NX.
If the training sample has M picture, then the whole training sample x is composed of a matrix, the dimension is (NX, M).
Note that the line NX of Matrix X here represents the number of each sample x (i) feature, and column m represents the number of samples.
The output y of all training samples also forms a one-dimensional line vector, which is written as a matrix, and its dimensions are (1,m). 2. Logistic regression
How to use logical regression to solve two classification problems.
In logistic regression, the predictive value h^=p (Y=1 | x) is expressed as a probability of 1, different from two classification, and the range of values is between [0,1].
Using linear model, the weighting parameter W and bias parameter B are introduced. The dimension of the weight w is (NX, 1), and B is a constant term. Thus, the linear prediction of logistic regression can be written as:
Y^=wtx+b
The linear output range of the upper type is the whole real number range, and the logical regression requires the output range to be between [0,1], so we need to introduce the sigmoid function to process the output:
Y^=sigmoid (wtx+b) =σ (wtx+b)
Where the sigmoid function:
Sigmoid (z) =11+e−z
In the sigmoid function, when the z value is very large, the function value tends to 1, when the z value is very small, the function value tends to 0. And when z=0, the function value is 0.5.
The first-order derivative of the sigmoid function can be expressed by itself:
σ′ (z) =σ (z) (1−σ (z))
In the logical regression, the weight parameter W and bias parameter B need to be obtained by iterative training. Therefore, we need to define a cost function. The corresponding W and b are obtained by optimizing the cost function.
For m training samples, we usually use superscript to represent the corresponding sample. For example (x (i), Y (i)) represents the first sample.
How do I define the cost function for all M samples?
From a single sample, we want the predicted value of the sample to resemble the True value y y^. We represent the cost function of a single sample with the Loss function, and we can construct a Loss function convexity, as follows:
L (y^,y) =− (Ylog y^+ (1−y) log (1−y^))
When Y=1, L (y^,y) =−log y^. If the y^ is closer to 1, L (y^,y) ≈0, the better the prediction effect;
When Y=0, L (y^,y) =−log (1−y^). If the y^ is closer to 0, L (y^,y) ≈0, the better the prediction effect;
Therefore, this loss function can well reflect the proximity of the predictive output y^ to the true sample output Y.
For M samples, we define the cost function, which is the average of the loss function of M samples, and the cost function can be expressed as:
J (w,b) =1m∑i=1ml (y^ (i), Y (i)) =−1m∑i=1m[y (i) log y^ (i) + (1−y (i)) log (1−y^ (i))]
The cost function is a function of the weight parameter W and the bias parameter B. Our goal is to iteratively compute the best W and B, minimizing the cost function.
3. Gradient Descent
We will use the gradient descent algorithm to calculate the appropriate W and B, thus minimizing the cost function J (w,b) for m training samples.
Since J (W,b) is a convex function, the gradient descent algorithm is to randomly select a set of parameters W and B, and then move forward one small step along the opposite of the gradient of W and b, and constantly revise W and B. Gradient descent algorithm for each iteration update, W and B are updated with the following expressions:
W:=w−α∂j (w,b) ∂w
B:=b−α∂j (w,b) ∂b
In the upper formula, Alpha is the learning rate (learning rate), indicating the step size of the gradient descent.
The larger the alpha, the greater the "pace" of each update of W and B.
The gradient descent algorithm can ensure that each iteration W and B can be directed toward the global minimization of J (w,b).
4. Calculation diagram
The training process of neural networks includes forward propagation (Forward propagation) and reverse propagation (back propagation).
We explain the two processes in the form of a computed graph (computation graph), for example, if the cost function is J (a,b,c) =3 (A+BC) and contains a,b,c three variables. We use U to express bc,v A+u, then j=3v. Its calculation diagram can be written as shown in the following figure:
Make a=5,b=3,c=2
Forward propagation Process:
From left to right, then u=bc=6,v=a+u=11,j=3v=33.
Reverse propagation process:
The partial derivative of J to parameter A. From right to left, J is the function of V, and V is the function of a. Using the derivation technique, you can get:
∂j∂a=∂j∂v⋅∂v∂a=3⋅1=3
The partial derivative of J to parameter B. From right to left, J is a function of V, V is a function of u, and U is a function of B. Can be deduced:
∂j∂b=∂j∂v⋅∂v∂u⋅∂u∂b=3⋅1⋅c=3⋅1⋅2=6
The partial derivative of J to parameter C. From right to left, J is a function of V, V is a function of u, and U is a function of C. Can be deduced:
∂j∂c=∂j∂v⋅∂v∂u⋅∂u∂c=3⋅1⋅b=3⋅1⋅3=9
For a single sample, the logical regression loss function expression is as follows:
Z=wtx+b
Y^=a=σ (z)
L (a,y) =− (Ylog (a) + (1−y) log (1−a))
Calculates the reverse propagation process for this logical regression:
Da=∂l∂a=−ya+1−y1−a
dz=∂l∂z=∂l∂a⋅∂a∂z= (−ya+1−y1−a) ⋅a (1−a) =a−y
DW1=∂L∂W1=∂L∂Z⋅∂Z∂W1=X1⋅DZ=X1 (a−y)
DW2=∂L∂W2=∂L∂Z⋅∂Z∂W2=X2⋅DZ=X2 (a−y)
Db=∂l∂b=∂l∂z⋅∂z∂b=1⋅dz=a−y
The gradient descent algorithm can be expressed as:
W1:=w1−αdw1
W2:=w2−αdw2
B:=b−αdb
The cost function of M sample:
Z (i) =wtx (i) +b
y^ (i) =a (i) =σ (z (i))
J (w,b) =1m∑i=1ml (y^ (i), Y (i)) =−1m∑i=1m[y (i) log y^ (i) + (1−y (i)) log (1−y^ (i))]
DW1=1M∑I=1MX (i) 1 (A (i) −y (i))
DW2=1M∑I=1MX (i) 2 (A (i) −y (i))
Db=1m∑i=1m (A (i) −y (i))
Thus, the gradient of W and b in each iteration has an average of M training samples computed. After each iteration, the W and B are updated according to the gradient descent algorithm:
W1:=w1−αdw1
W2:=w2−αdw2
B:=b−αdb
After repeated iterations, the whole gradient descent algorithm is completed.