Wunda "Deep Learning Engineer" Learning Notes (II.)

Wunda "Deep Learning Engineer" Learning Notes (II.) _ Two classification

Last Update:2018-08-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The Wunda "Deep learning engineer" Special course includes the following five courses:

1, neural network and deep learning;
2, improve the deep neural network: Super parameter debugging, regularization and optimization;
3. Structured machine learning project;
4, convolution neural network;
5, Sequence model.

The second lecture of the series "Neural Network and deep learning" is introduced today: the Foundation of Neural Network (upper).

Main content:

1, two classification problems;

2, logical regression and its corresponding cost function form;

3, using the calculation diagram to describe the positive and reverse propagation process of neural network;

4, the gradient descent algorithm is used in logistic regression.

1, two classification problems

The second category is the output y only the discrete value {0, 1} or {-1, 1}.

Take an image recognition problem as an example to determine whether there is a cat in the picture, and 0 represents non cat,1 for cat.

In general, color pictures contain RGB three channels. We first convert the image input X (dimension (64,64,3)) into a one-dimensional eigenvector. The method is a line-by-row extraction of each channel, and finally connected to the transformed input eigenvector dimension (64x64x3=12288). This eigenvector x is a column vector, and the dimension is generally recorded as NX.

If the training sample has M picture, then the whole training sample x is composed of a matrix, the dimension is (NX, M).

Note that the line NX of Matrix X here represents the number of each sample x (i) feature, and column m represents the number of samples.

The output y of all training samples also forms a one-dimensional line vector, which is written as a matrix, and its dimensions are (1,m). 2. Logistic regression

How to use logical regression to solve two classification problems.

In logistic regression, the predictive value h^=p (Y=1 | x) is expressed as a probability of 1, different from two classification, and the range of values is between [0,1].

Using linear model, the weighting parameter W and bias parameter B are introduced. The dimension of the weight w is (NX, 1), and B is a constant term. Thus, the linear prediction of logistic regression can be written as:

Y^=wtx+b

The linear output range of the upper type is the whole real number range, and the logical regression requires the output range to be between [0,1], so we need to introduce the sigmoid function to process the output:

Y^=sigmoid (wtx+b) =σ (wtx+b)

Where the sigmoid function:

Sigmoid (z) =11+e−z

In the sigmoid function, when the z value is very large, the function value tends to 1, when the z value is very small, the function value tends to 0. And when z=0, the function value is 0.5.

The first-order derivative of the sigmoid function can be expressed by itself:

σ′ (z) =σ (z) (1−σ (z))

In the logical regression, the weight parameter W and bias parameter B need to be obtained by iterative training. Therefore, we need to define a cost function. The corresponding W and b are obtained by optimizing the cost function.

For m training samples, we usually use superscript to represent the corresponding sample. For example (x (i), Y (i)) represents the first sample.

How do I define the cost function for all M samples?

From a single sample, we want the predicted value of the sample to resemble the True value y y^. We represent the cost function of a single sample with the Loss function, and we can construct a Loss function convexity, as follows:

L (y^,y) =− (Ylog y^+ (1−y) log (1−y^))

When Y=1, L (y^,y) =−log y^. If the y^ is closer to 1, L (y^,y) ≈0, the better the prediction effect;

When Y=0, L (y^,y) =−log (1−y^). If the y^ is closer to 0, L (y^,y) ≈0, the better the prediction effect;

Therefore, this loss function can well reflect the proximity of the predictive output y^ to the true sample output Y.

For M samples, we define the cost function, which is the average of the loss function of M samples, and the cost function can be expressed as:

J (w,b) =1m∑i=1ml (y^ (i), Y (i)) =−1m∑i=1m[y (i) log y^ (i) + (1−y (i)) log (1−y^ (i))]

The cost function is a function of the weight parameter W and the bias parameter B. Our goal is to iteratively compute the best W and B, minimizing the cost function.

3. Gradient Descent

We will use the gradient descent algorithm to calculate the appropriate W and B, thus minimizing the cost function J (w,b) for m training samples.

Since J (W,b) is a convex function, the gradient descent algorithm is to randomly select a set of parameters W and B, and then move forward one small step along the opposite of the gradient of W and b, and constantly revise W and B. Gradient descent algorithm for each iteration update, W and B are updated with the following expressions:

W:=w−α∂j (w,b) ∂w

B:=b−α∂j (w,b) ∂b

In the upper formula, Alpha is the learning rate (learning rate), indicating the step size of the gradient descent.
The larger the alpha, the greater the "pace" of each update of W and B.

The gradient descent algorithm can ensure that each iteration W and B can be directed toward the global minimization of J (w,b).

4. Calculation diagram

The training process of neural networks includes forward propagation (Forward propagation) and reverse propagation (back propagation).

We explain the two processes in the form of a computed graph (computation graph), for example, if the cost function is J (a,b,c) =3 (A+BC) and contains a,b,c three variables. We use U to express bc,v A+u, then j=3v. Its calculation diagram can be written as shown in the following figure:

Make a=5,b=3,c=2

Forward propagation Process:

From left to right, then u=bc=6,v=a+u=11,j=3v=33.

Reverse propagation process:

The partial derivative of J to parameter A. From right to left, J is the function of V, and V is the function of a. Using the derivation technique, you can get:

∂j∂a=∂j∂v⋅∂v∂a=3⋅1=3

The partial derivative of J to parameter B. From right to left, J is a function of V, V is a function of u, and U is a function of B. Can be deduced:

∂j∂b=∂j∂v⋅∂v∂u⋅∂u∂b=3⋅1⋅c=3⋅1⋅2=6

The partial derivative of J to parameter C. From right to left, J is a function of V, V is a function of u, and U is a function of C. Can be deduced:

∂j∂c=∂j∂v⋅∂v∂u⋅∂u∂c=3⋅1⋅b=3⋅1⋅3=9

For a single sample, the logical regression loss function expression is as follows:

Z=wtx+b

Y^=a=σ (z)

L (a,y) =− (Ylog (a) + (1−y) log (1−a))

Calculates the reverse propagation process for this logical regression:

Da=∂l∂a=−ya+1−y1−a

dz=∂l∂z=∂l∂a⋅∂a∂z= (−ya+1−y1−a) ⋅a (1−a) =a−y

DW1=∂L∂W1=∂L∂Z⋅∂Z∂W1=X1⋅DZ=X1 (a−y)

DW2=∂L∂W2=∂L∂Z⋅∂Z∂W2=X2⋅DZ=X2 (a−y)

Db=∂l∂b=∂l∂z⋅∂z∂b=1⋅dz=a−y

The gradient descent algorithm can be expressed as:

W1:=w1−αdw1

W2:=w2−αdw2

B:=b−αdb

The cost function of M sample:

Z (i) =wtx (i) +b

y^ (i) =a (i) =σ (z (i))

J (w,b) =1m∑i=1ml (y^ (i), Y (i)) =−1m∑i=1m[y (i) log y^ (i) + (1−y (i)) log (1−y^ (i))]

DW1=1M∑I=1MX (i) 1 (A (i) −y (i))

DW2=1M∑I=1MX (i) 2 (A (i) −y (i))

Db=1m∑i=1m (A (i) −y (i))

Thus, the gradient of W and b in each iteration has an average of M training samples computed. After each iteration, the W and B are updated according to the gradient descent algorithm:

W1:=w1−αdw1

W2:=w2−αdw2

B:=b−αdb

After repeated iterations, the whole gradient descent algorithm is completed.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Wunda "Deep Learning Engineer" Learning Notes (II.) _ Two classification

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Wunda "Deep Learning Engineer" Learning Notes (II.) _ Two classification

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support