Wunda "Deep Learning Engineer" Learning Notes (II.) _ Two classification

Source: Internet
Author: User

The Wunda "Deep learning engineer" Special course includes the following five courses:

1, neural network and deep learning;
2, improve the deep neural network: Super parameter debugging, regularization and optimization;
3. Structured machine learning project;
4, convolution neural network;
5, Sequence model.

The second lecture of the series "Neural Network and deep learning" is introduced today: the Foundation of Neural Network (upper).

Main content:

1, two classification problems;

2, logical regression and its corresponding cost function form;

3, using the calculation diagram to describe the positive and reverse propagation process of neural network;

4, the gradient descent algorithm is used in logistic regression.

1, two classification problems

The second category is the output y only the discrete value {0, 1} or {-1, 1}.

Take an image recognition problem as an example to determine whether there is a cat in the picture, and 0 represents non cat,1 for cat.

In general, color pictures contain RGB three channels. We first convert the image input X (dimension (64,64,3)) into a one-dimensional eigenvector. The method is a line-by-row extraction of each channel, and finally connected to the transformed input eigenvector dimension (64x64x3=12288). This eigenvector x is a column vector, and the dimension is generally recorded as NX.

If the training sample has M picture, then the whole training sample x is composed of a matrix, the dimension is (NX, M).

Note that the line NX of Matrix X here represents the number of each sample x (i) feature, and column m represents the number of samples.

The output y of all training samples also forms a one-dimensional line vector, which is written as a matrix, and its dimensions are (1,m). 2. Logistic regression

How to use logical regression to solve two classification problems.

In logistic regression, the predictive value h^=p (Y=1 | x) is expressed as a probability of 1, different from two classification, and the range of values is between [0,1].

Using linear model, the weighting parameter W and bias parameter B are introduced. The dimension of the weight w is (NX, 1), and B is a constant term. Thus, the linear prediction of logistic regression can be written as:

Y^=wtx+b

The linear output range of the upper type is the whole real number range, and the logical regression requires the output range to be between [0,1], so we need to introduce the sigmoid function to process the output:

Y^=sigmoid (wtx+b) =σ (wtx+b)

Where the sigmoid function:

Sigmoid (z) =11+e−z

In the sigmoid function, when the z value is very large, the function value tends to 1, when the z value is very small, the function value tends to 0. And when z=0, the function value is 0.5.

The first-order derivative of the sigmoid function can be expressed by itself:

σ′ (z) =σ (z) (1−σ (z))

In the logical regression, the weight parameter W and bias parameter B need to be obtained by iterative training. Therefore, we need to define a cost function. The corresponding W and b are obtained by optimizing the cost function.

For m training samples, we usually use superscript to represent the corresponding sample. For example (x (i), Y (i)) represents the first sample.

How do I define the cost function for all M samples?

From a single sample, we want the predicted value of the sample to resemble the True value y y^. We represent the cost function of a single sample with the Loss function, and we can construct a Loss function convexity, as follows:

L (y^,y) =− (Ylog y^+ (1−y) log (1−y^))

When Y=1, L (y^,y) =−log y^. If the y^ is closer to 1, L (y^,y) ≈0, the better the prediction effect;

When Y=0, L (y^,y) =−log (1−y^). If the y^ is closer to 0, L (y^,y) ≈0, the better the prediction effect;

Therefore, this loss function can well reflect the proximity of the predictive output y^ to the true sample output Y.

For M samples, we define the cost function, which is the average of the loss function of M samples, and the cost function can be expressed as:

J (w,b) =1m∑i=1ml (y^ (i), Y (i)) =−1m∑i=1m[y (i) log y^ (i) + (1−y (i)) log (1−y^ (i))]

The cost function is a function of the weight parameter W and the bias parameter B. Our goal is to iteratively compute the best W and B, minimizing the cost function.

3. Gradient Descent

We will use the gradient descent algorithm to calculate the appropriate W and B, thus minimizing the cost function J (w,b) for m training samples.

Since J (W,b) is a convex function, the gradient descent algorithm is to randomly select a set of parameters W and B, and then move forward one small step along the opposite of the gradient of W and b, and constantly revise W and B. Gradient descent algorithm for each iteration update, W and B are updated with the following expressions:

W:=w−α∂j (w,b) ∂w

B:=b−α∂j (w,b) ∂b

In the upper formula, Alpha is the learning rate (learning rate), indicating the step size of the gradient descent.
The larger the alpha, the greater the "pace" of each update of W and B.

The gradient descent algorithm can ensure that each iteration W and B can be directed toward the global minimization of J (w,b).

4. Calculation diagram

The training process of neural networks includes forward propagation (Forward propagation) and reverse propagation (back propagation).

We explain the two processes in the form of a computed graph (computation graph), for example, if the cost function is J (a,b,c) =3 (A+BC) and contains a,b,c three variables. We use U to express bc,v A+u, then j=3v. Its calculation diagram can be written as shown in the following figure:

Make a=5,b=3,c=2

Forward propagation Process:

From left to right, then u=bc=6,v=a+u=11,j=3v=33.

Reverse propagation process:

The partial derivative of J to parameter A. From right to left, J is the function of V, and V is the function of a. Using the derivation technique, you can get:

∂j∂a=∂j∂v⋅∂v∂a=3⋅1=3

The partial derivative of J to parameter B. From right to left, J is a function of V, V is a function of u, and U is a function of B. Can be deduced:

∂j∂b=∂j∂v⋅∂v∂u⋅∂u∂b=3⋅1⋅c=3⋅1⋅2=6

The partial derivative of J to parameter C. From right to left, J is a function of V, V is a function of u, and U is a function of C. Can be deduced:

∂j∂c=∂j∂v⋅∂v∂u⋅∂u∂c=3⋅1⋅b=3⋅1⋅3=9

For a single sample, the logical regression loss function expression is as follows:

Z=wtx+b

Y^=a=σ (z)

L (a,y) =− (Ylog (a) + (1−y) log (1−a))

Calculates the reverse propagation process for this logical regression:

Da=∂l∂a=−ya+1−y1−a

dz=∂l∂z=∂l∂a⋅∂a∂z= (−ya+1−y1−a) ⋅a (1−a) =a−y

DW1=∂L∂W1=∂L∂Z⋅∂Z∂W1=X1⋅DZ=X1 (a−y)

DW2=∂L∂W2=∂L∂Z⋅∂Z∂W2=X2⋅DZ=X2 (a−y)

Db=∂l∂b=∂l∂z⋅∂z∂b=1⋅dz=a−y

The gradient descent algorithm can be expressed as:

W1:=w1−αdw1

W2:=w2−αdw2

B:=b−αdb

The cost function of M sample:

Z (i) =wtx (i) +b

y^ (i) =a (i) =σ (z (i))

J (w,b) =1m∑i=1ml (y^ (i), Y (i)) =−1m∑i=1m[y (i) log y^ (i) + (1−y (i)) log (1−y^ (i))]

DW1=1M∑I=1MX (i) 1 (A (i) −y (i))

DW2=1M∑I=1MX (i) 2 (A (i) −y (i))

Db=1m∑i=1m (A (i) −y (i))

Thus, the gradient of W and b in each iteration has an average of M training samples computed. After each iteration, the W and B are updated according to the gradient descent algorithm:

W1:=w1−αdw1

W2:=w2−αdw2

B:=b−αdb

After repeated iterations, the whole gradient descent algorithm is completed.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.