Coursera Machine Learning 5th Chapter Neural Networks:learning Study notes

Source: Internet
Author: User

5.1 Section cost Function
The cost function of a neural network.

Review some of the concepts in neural networks:

L the total number of layers of the neural network.

Number of units of the SL-L layer (excluding deviation units).

Category 2 Classification questions: Two-dollar classification and multivariate classification.

The loss function of the neural network is shown, and it is noted that this is a regularization form.

The regularization part, I, J is not 0. Of course I, J can be 0, at this time the loss function will not be too big difference, just when I, J is not 0 form more common.

5.2 Knots BackPropagation algorithm
The algorithm of minimizing loss function--the inverse propagation algorithm: Finding the right parameter is the smallest of J (θ).

Such as. The Minimize loss function can use a gradient descent algorithm or some other advanced optimization algorithm. But you need to calculate the loss function and the partial derivative of each item first.

Simply review the forward propagation. There are 4 layers of network structure.

Theta (i) the weight of Layer I to section i+1

A (i) Level I excitation value


A simple introduction to the reverse propagation algorithm (only 1 instances).

The Meaning of 1.δ (l) j is the error of the excitation value of the J Neuron of section L.

The result of derivation of the 2.S type function: G ' (z) =g (z) * (1-g (z)). The form of a vector is the form of a diagram.

3. The error is obtained from the backward forward. The implementation of the algorithm is shown in detail.

4. There is no error on the 1th level.

5. It can be shown that the partial derivative of θ (l) ij about J (θ) is exactly a (l) j Δ (L+1) I, while ignoring the regularization portion (λ=0).

When there are a large number of instances, the implementation of the forward propagation algorithm is as follows. In fact, the following process is mainly the process of solving partial derivatives:

There is a training set {(X1,Y1) ... (Xm,ym)}, where Δ (l) IJ is used to find the result of a J (θ)-biased derivative of θ (l) ij.

Each sample in the For loop training set.

In each cycle

1.a (1) =x (i), calculates a (l) value in forward propagation (l=2,3,...,l)

2. Calculate the deviation term δ (L) =a (L)-y (i), then compute δ (i) (I=l-1,..., 2)

3. Additive Δ (L) ij=δ (L) ij+a (j) L Δ (l+1) J

After the loop exits, D (l) IJ represents the result of a J (θ) to the θ (l) IJ, noting the distinction between j=0 and j≠0.

5.3 Knots BackPropagation Intuition
Discuss the complex steps of reverse propagation and understand what these steps are doing.
The following is the principle of forward propagation. For example, Z (3) 1=θ (2) 10*1+θ (2) 11*a (2) 1+θ (2) 12*a (2) 2. Note A (i) j=g (Z (i) j).


A post-propagation equation is discussed in the case of only one output unit. (The cost (i) on the chart is wrong). For example, δ (2) 2=θ (2) 12*δ (3) 1+θ (2) 22*δ (3) 2.

Δ is the partial differential of the loss function about the intermediate Z (why?). ). Δ Measures how much we change the weight of the neural network, how much it affects the amount of computation in the middle, and how much it affects the output h (x) of the whole neural network and the value of the whole generation.

The δ corresponding to the deviation term is not a necessary part of the calculation of the differential, so it is generally ignored.

5.4 Knots Implementation Note:unrolling Parameters
The previous lesson discussed how to use the inverse propagation algorithm to calculate the derivative of the cost function. This lesson introduces the implementation of the transformation between the matrix form of the parameter and the vector form, and how to expand the parameters from the matrix into vectors for use in advanced optimization steps.


, The Matrix D is the returned gradient value gradient (the loss function has a partial derivative of the specific parameter θ).

In logistic regression, the form of the loss function is first written, and then the Advanced Optimization Algorithm (FMINUNC) is called. The loss function, the initial parameter vector θ (theta), the iteration requirement as the parameter input, and then returns the optimized parameter vector θ after processing. However, in a neural network, θ is a matrix rather than a vector form, and therefore requires a form of transformation.

To give a specific example:

is a 4-layer neural network system (picture error).

The theta parameters of matrix form are expanded into vector form thetavec:thetavec=[theta1 (:); Theta2 (:); Theta3 (:)]

Thetavec the first 1-110 elements of a vector form into a matrix form 10*11 Theta1:theta1=reshape (Thetavec (1:110), 10,11)

Overview of basic algorithms:

In the 1.fminunc function, the initial weight matrix parameter θ (1), θ (2), θ (2) are expanded into a vector form of initialtheta. Pass Initialtheta as a formal parameter to the Fminunc function.

2. In the cost function, the vector form of the Thetavec is converted to θ (1), θ (2), θ (2) in the form of a matrix, using forward/back propagation to calculate D (1), D (2), D (3), and J (Θ). Expand D (1), D (2), D (3) to form a Gradientvec return.

The advantage of matrix form is that it is more convenient to make full use of the vectorization process when forward propagation and reverse propagation. The advantage of vector form is that there are matrices like Thetavec or Dvec, and when you use advanced optimization algorithms, these algorithms usually require that all parameters be expanded into a long vector form.

5.5 Knots Gradient Checking
Reverse propagation, as an algorithm with a lot of detail, can be a bit more complex when implemented. And there are a lot of small errors when it comes to reverse propagation, so if you run it along with the gradient descent or other optimization algorithms, it might look like it's working, and your cost function, J (Θ), will eventually decrease with each gradient descent iteration. But maybe the last neural network error you get is higher than no error, and you probably just don't know that it's the result of these little mistakes.

Here's a quick way to verify the error: Gradient test. This method is suitable for more complex models to verify that the algorithm implementation is error-prone.

Here, the gradient test: The method used to distinguish whether the reverse propagation implementation is correct.

First look at the approximate value of the derivative solution:

One-sided differential and two-sided differential, here with two-sided differential.

An after-school exercise:

The idea of gradient test is to calculate the numerical approximation vector gradapprox of partial derivative, then compare it with the Dvec obtained by the convenient method, if the result is similar, the previous solution is correct.

Mathematical form:

Summarize:

1. Calculate Dvec using reverse propagation.

2. Use gradient test to calculate gradapprox.

3. Compare the Dvec and Gradapprox numeric sizes.

4. Turn off the gradient test. Use the reverse propagation code for learning.

Pay attention to turn off the gradient test. Because gradient tests are slower than reverse propagation.

5.6 Section Random Initialization
This is the introduction of random initialization.
In logistic regression, it is feasible to set the initial value of theta to 0, but it is not feasible in neural network. If the initial values are the same, then all the theta are the same, with no training meaning. In fact, for the first L layer, Θ (l) is a matrix, and the matrix has the same initial value of the ownership of the matrices. This problem is also known as symmetric weighting.

For example:

If Θ (l) IJ is initialized to the same value, it can be seen in the diagram that forward propagation, a (2) 2=a (2) 1; In the back propagation, δ (2) 1=δ (2) 2, thereby ∂j (θ)/∂ (θ (1) 11) =∂j (θ)/∂ (θ (1) 21). Then when using gradient descent, θ (1) 11=θ (1) 11-α*∂j (θ)/∂ (θ (1) 11), θ (1) 21=θ (1) 21-α*∂j (θ)/∂ (θ (1) 21). So, the last theta (1) 11=θ (1) 21. Similarly, Θ (1) 10=θ (1) 20,θ (1) 12=θ (1) 22. That is, after the gradient descent algorithm, there is still a (2) 2=a (2) 1, as if all the cells in the first layer of hidden layer are computing the same characteristics, all the hidden units are calculating the same input function, the final output layer per unit output value is the same. therefore, when initializing the parameter θ, the weight matrix θ (l) of the L layer cannot be initialized to the same value as the θ (l) IJ element.

Here to do the title:

Answer: BD

Rand (A, b) applies for a a*b matrix with a matrix element size (0,1).


In general, the weights are randomly initialized, then propagated in reverse, followed by a gradient check (using gradient descent or advanced optimization algorithms)

5.7 Knots putting It Together
Make a general review of all the contents of the neural network algorithm.

The first step in a neural network is to build a broad framework:

The leftmost of the graph is the most basic neural network structure: input layer, hidden layer, output layer.

Note that if there are multiple hidden layers, the default number of hidden layer neurons is the same. The number of neural units in the general input layer is the dimension of the example, and the number of neural units of the output layer is the number of categories. In general, the more the number of hidden layer neurons, the better, but the amount of computation will also increase correspondingly. The number of hidden cells is matched to other parameters.

The following is the implementation of neural network training process, a total of 6 steps:

1. Random initialization weights, initialization of the weight value is generally small.

2. Perform forward propagation for each x (i) to calculate hθ (x (i)).

3. Execute the calculation code for J (Θ).

4. Perform reverse propagation and calculate partial derivative ∂j (θ)/∂ (θ (1) JK). For M test instances, the excitation values A and δ and δ are calculated using a For loop.

5. The partial derivative ∂j (θ)/∂ (θ (1) JK) is tested for gradients. After the partial derivative code does not have a problem, close the Gradient check section code.

6. Use gradient descent or other advanced algorithms to perform reverse propagation to find the θ values for minimizing j (θ).

This paper describes the gradient descent algorithm in neural networks: starting from the random initial point, descending step by step, until the local optimal value is obtained. Algorithms such as gradient descent can at least guarantee convergence to local optimal values. The goal of the inverse propagation algorithm is to calculate the direction of the gradient descent, and to reduce the cost function to the local optimal value continuously.

5.8 Knots Autonomous Driving
This paper introduces an important example of neural network learning: Using neural network to realize automatic driving.

Practice:

Answer: BD

Answer: AB

Answer: BC

Do not forget the course corresponding homework, it is helpful to comb the knowledge.

Note 2 points:

1. The regularization portion of the regularization cost function cannot be counted as the weighted theta of the deviation term.
Regularization formula:


2. In reverse propagation, when calculating the partial derivative corresponding to the weight, all of the weights derived from the deviation term (the deviation Neural unit) do not have to be calculated.

Coursera Machine Learning 5th Chapter Neural Networks:learning Study notes

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.