convolutional Neural Networks (i)

Source: Internet
Author: User

Excerpt from UFLDL tutorial, Link: Http://deeplearning.stanford.edu/wiki/index.php/UFLDL%E6%95%99%E7%A8%8B

I. Overview

Taking supervised learning as an example, assuming we have a training sample set, the neural network algorithm can provide a complex and nonlinear hypothetical model with parameters that can fit our data with this parameter.


To describe a neural network, we start with the simplest neural network, which is made up of only one neuron, the following is the diagram of this neuron:


This "neuron" is an arithmetic unit with an input value for the Intercept, whose output is, where the function is called an "activation function". In this tutorial, we use the sigmoid function as the activation function

It can be seen that the input-output mapping of this single "neuron" is actually a logistic regression (logistic regression).


Although this series of tutorials uses the sigmoid function, you can also choose the hyperbolic tangent function (Tanh):


The following are functional images of sigmoid and Tanh, respectively

A function is a variant of the sigmoid function that takes a range of values, rather than a sigmoid function.


Note that unlike other places, including the Openclassroom public Course and the Stanford University CS229 course, we are not making it here. Instead, we use separate parameters to represent the intercept.


Finally, there is an equation that we will often use later: if you choose, that is, the sigmoid function, then its derivative is (if you choose the Tanh function, then its derivative is that you can deduce the equation yourself based on the definition of the sigmoid (or tanh) function.


Neural network Model

The so-called neural network is a number of single "neurons" linked together, so that the output of a "neuron" can be another "neuron" input. For example, it is a simple neural network:

We use a circle to represent the input of the neural network, and the circle labeled "" is called the offset node , which is the intercept term. The leftmost layer of the neural network is called the input layer , and the right layer is called the output layer (in this case, the output layer has only one node). A layer of all the nodes in the middle is called the hidden layer , because we cannot observe their values in the training sample set. At the same time, we can see that the above example of neural network has 3 input units (the bias unit is not included), 3 hidden units and an output unit .


We used to represent the number of layers in the network, in this case, we will write the first layer, and then the input layer, the output layer is. In this case, the neural network has parameters, in which (in the following equation) is the connection parameter between the element of the first layer and the element of the first layer (in fact, the weight on the connecting line, note the order of the label), is the offset of the element of the first layer. So in this case, the. Note that no other cells are connected to the biasing unit (that is, the bias unit is not entered) because they are always output. At the same time, we use the number of nodes that represent the first layer (the bias unit does not count).


We use the activation value (output value) that represents the element of the first layer. At that time, that is, the first input value (the first character of the input value). For a given set of parameters, our neural network can calculate the output according to the function. The calculation steps of this neural network are as follows:



We enter the weighted sum (including the offset unit) with the representation of the first element, for example,.


So that we can get a more concise notation. Here we extend the activation function to be represented as a vector (component), that is, the above equation can be more succinctly expressed as:



We called the above calculation steps forward propagation . Recall that we used to represent the activation value of the input layer, then given the activation value of the first layer, the activation value of the first layer can be calculated as follows:



By using the matrix-vector operation method, we can quickly solve the neural network by using the advantages of linear algebra.


So far, we've discussed a neural network, and we can also construct a neural network of another structure (where the structure refers to the connection pattern between neurons), which is a neural network with multiple hidden layers. The most common example is a layer of neural network, the first layer is the input layer, the first layer is the output layer, each layer in the middle is closely connected with the layer. In this mode, in order to calculate the output of the neural network, we can follow the equation described earlier, proceed to forward propagation, calculate all the activation values of the first layer, then the activation value of the first layer, and so on to the first layer. This is an example of a feedforward neural network because there is no closed loop or loop for this connection diagram.


Neural networks can also have multiple output units. For example, the following neural network has two layers of hidden layers: And, the output layer has two output units.



To solve such a neural network, a sample set is required. This neural network works well if you want to predict the output to be multiple. (for example, in medical diagnostic applications, the patient's sign indicator can be used as a vector input value, and different output values can indicate the presence or absence of different diseases.) )

two. Reverse conduction algorithm

Suppose we have a fixed sample set, which contains a sample. We can use the batch gradient descent method to solve the neural network. Specifically, for a single sample, the cost function is:

This is a (one-second) variance cost function. Given a dataset containing a sample, we can define the overall cost function as:

The first item in the above formula is a mean variance term. The second item is a regular item (also called a weight decay ), which is designed to reduce the amplitude of the weights and prevent overfitting.


[Note: The calculation of weight decay does not use offsets, for example, we do not use them in the definition.] In general, the inclusion of a bias in a weight decay item only has a small effect on the final neural network. If you've ever had a CS229 (machine learning) course at Stanford, or seen a course video on YouTube, you'll find that this weight decay is actually a variant of the Bayesian rule method mentioned in class. In the Bayesian rule method, we introduce a Gaussian priori probability into the parameter to calculate the map (maximum posteriori) estimate (rather than the maximum likelihood estimate). ]


The weight decay parameter is used to control the relative importance of two items in a formula. Here we reiterate the meanings of these two complex functions: the variance cost function computed for a single sample, and the overall sample cost function, which contains the weight decay term.


The cost functions above are often used for classification and regression problems. In the classification problem, we use or, to represent two types of labels (recall, this is because the value of the sigmoid activation function is, if we use the hyperbolic tangent activation function, then we should choose and as a label). For the regression problem, we first have to transform the output value range (The Translator note: that is), to ensure that its scope is (similarly, if we use the hyperbolic tangent activation function, to make the output domain is).


Our goal is to find the minimum value of the function for the parameter. To solve a neural network, we need to initialize each parameter to a small, nearly 0 random value (for example, a random value generated using a normal distribution, which is set to), and then use the optimization algorithm such as the batch gradient descent method for the target function. Because it is a non-convex function, the gradient descent method is likely to converge to the local optimal solution, but in practical applications, the gradient descent method usually obtains satisfactory results. Finally, you need to emphasize again that you want to initialize the parameters randomly, not all of them. If all parameters use the same value as the initial value, then all hidden layer cells will eventually get the same function as the input value (that is, for all, the same value will be taken, then for any input will have:). The purpose of random initialization is to invalidate the symmetry .


Each iteration of the gradient descent method updates the parameters according to the following formula:

Which is the learning rate. The key step is to calculate the partial derivative. Now let's talk about the inverse propagation algorithm, which is an effective method for calculating partial derivatives.


Let's first describe how to use the inverse propagation algorithm to calculate and, these two are the partial derivative of the cost function of a single sample. Once we find this derivative, we can push the derivative of the overall cost function:


The above two lines have a slightly different formula, the first line is one more than the second, because the weight decay is action rather than.


The idea of a reverse propagation algorithm is as follows: Given a sample, we first perform a "forward conduction" operation to calculate all the activation values in the network, including the output values. Then, for each node of the first layer, we calculate its "residuals", which indicate how much the node affects the residual of the final output value. For the final output node, we can directly calculate the difference between the activation value generated by the network and the actual value, and we define the gap as (the first layer represents the output layer). What do we do with hidden units? We will compute the weighted average of the residuals based on the node (translator Note: Layer node) as input. The details of the inverse conduction algorithm are given below:


    1. The feedforward conduction is calculated and the activation value of the output layer is obtained by using the forward conduction formula.
    2. For each output unit of the first layer (output layer), we calculate the residuals according to the
      following formula: [
      translator Note:]
    3. For each layer, the residual of the first node of the first layer is calculated as follows
      : {Translator Note
      : The relationship between the upper and the other is replaced with the relationship, you can get
      : The above successive derivative from the forward process is the original meaning of "reverse conduction". ]
    4. Calculate the partial derivative we need to calculate the method as follows:


Finally, we use matrix-vector notation to rewrite the above algorithm. We use "" to denote the vector product operator (denoted by ". *" in MATLAB or octave, also known as the Adamaz House product). If, then. In the previous tutorial we expanded the definition to include vector operations, and here we have also done the same for the partial derivative (and so on).


Then, the reverse propagation algorithm can be represented as the following steps:

    1. The feedforward conduction is calculated and the activation value of the output layer is obtained by using the forward conduction formula.
    2. For the output layer (the first layer), calculate:
    3. For each layer, calculate:
    4. Calculate the value of the partial derivative that is ultimately required:


Note in the implementation: in steps 2nd and 3rd above, we need to calculate each value for it. The assumption is the sigmoid function, and we've got it in the forward conduction operation. Then, using the expression we deduced earlier, we can calculate it.


Finally, we will make a comprehensive summary of the gradient descent algorithm. In the pseudo-code below, it is a matrix with the same matrix dimension, which is the same vector as the dimension. Notice here that "" is a matrix, not "with multiplication". Below, we implement one iteration of the batch gradient descent method:


    1. For all, make, (set to all 0 matrices or all 0 vectors)
    2. For the To,
      1. Use the inverse propagation algorithm to calculate and.
      2. Calculation.
      3. Calculation.
    3. Update weight parameters:

Now, we can repeat the iterative steps of the gradient descent method to reduce the value of the cost function, and then solve our neural network.

convolutional Neural Networks (i)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.