MXNET: Multilayer Neural Networks

Source: Internet
Author: User
Tags mxnet

Multilayer Perceptron (Multilayer perceptron, or MLP) is the most basic deep learning model.
Multilayer Perceptron introduces one or more hidden layers (hidden layer) on the basis of a single layer neural network. The hidden layer is between the input layer and the output layer. The neurons in the hidden layer are fully connected to each input layer, and the neurons in the output layer are fully connected to each neuron in the hidden layer. Therefore, the hidden layer and the output layer in multilayer perceptron are all connected layers.

Affine transformations

Before describing the calculation of hidden layers, let's look at how the multilayer perceptron output layer is calculated. The input to the output layer is the output of the hidden layer, which is often referred to as hidden layer variables or hidden variables.

Given a small batch of samples, its batch size is n, the input number is x, the output number is Y. Suppose that the multilayer perceptron has only one hidden layer, in which the number of hidden units is H, the hidden variable \ (\boldsymbol{h} \in \mathbb{r}^{n \times h}\). Assuming that the output layer's weight and deviation parameters are \ (\boldsymbol{w}_o \in \mathbb{r}^{h \times y}, \boldsymbol{b}_o \in \mathbb{r}^{1 \times y}\) , Multilayer perceptron output
\[\boldsymbol{o} = \boldsymbol{h} \boldsymbol{w}_o + \boldsymbol{b}_o\]

In fact, the output \ (\boldsymbol{o}\) of the multilayer perceptron is an affine transformation of the output \ (\boldsymbol{h}\) of the previous layer (affine
Transformation). It consists of a linear transformation multiplied by the weight parameter and a translation once by adding the deviation parameter.

So what happens if the hidden layer is only doing affine transformations on the input? A single sample is characterized by \ (\boldsymbol{x} \in \mathbb{r}^{1 \times x}\), and the weight and bias parameters of the hidden layer are \ (\boldsymbol{w}_h \in \mathbb{ r}^{x \times h}, \boldsymbol{b}_h \in \mathbb{r}^{1\times h}\). Assume

\[\boldsymbol{h} = \boldsymbol{x} \boldsymbol{w}_h +\boldsymbol{b}_h\] \[\boldsymbol{o} = \boldsymbol{h} \boldsymbol{w}_o + \boldsymbol{b}_o\]

\boldsymbol{o} = \boldsymbol{x} \boldsymbol{w}_h \boldsymbol{w}_o + \boldsymbol{b}_h \boldsymbol{w}_o + \ boldsymbol{b}_o\): It is equivalent to the output of a single-layer neural network \ (\boldsymbol{o} = \boldsymbol{x} \boldsymbol{w}^\prime + \boldsymbol{b}^\ prime\), where \ (\boldsymbol{w}^\prime = \boldsymbol{w}_h \boldsymbol{w}_o, \boldsymbol{b}^\prime = \boldsymbol{ B}_h \boldsymbol{w}_o + \boldsymbol{b}_o\). Therefore, using only the hidden layers of affine transformations makes the multilayer perceptron no different from the single-layer neural network described earlier.

Activation function

As can be seen from the above example, we have to use other transformations in the hidden layer, such as adding nonlinear transformations, so that the multilayer perceptron becomes meaningful. We refer to these nonlinear transformations as activation functions (activation function). The activation function can manipulate the input of any shape by element and does not change the shape of the input.

Relu function

The ReLU (rectified linear unit) function provides a very simple nonlinear transformation. Given the element x, the output of the function is \ (\text{relu} (x) = \max (x, 0) \), the Relu function retains only the positive elements, and the negative elements are zeroed.

sigmoid function

The sigmoid function can transform the value of an element to between 0 and 1:\ (\text{sigmoid} (x) = \frac{1}{1 + \exp (×)}\), and we'll back "loop neural network" Chapter describes how to use the sigmoid function to control the flow of information in a neural network by using the attribute range from 0 to 1.

Tanh function

The Tanh (hyperbolic tangent) function can transform the value of an element between 1 and 1:\ (\text{tanh} (x) = \frac{1-\exp ( -2x)}{1 + \exp ( -2x)}\). When the element value approaches 0 o'clock, the tanh function approaches a linear transformation. It is worth mentioning that its shape and the sigmoid function are very similar, and when the element is evenly distributed on the real field, the mean value of the Tanh function value is 0.

from mxnet import ndarray as ndX = nd.array([[[0,1], [-2,3], [4,-5]], [[6,-7], [8,-9], [10,-11]]])print X.relu(), X.sigmoid(), X.tanh()
Multilayer Sensing Machine

Now, we can give the vector computation expression of multi-layer perceptron.

\[\begin{split}\boldsymbol{h} = \phi (\boldsymbol{x} \boldsymbol{w}_h + \boldsymbol{b}_h), \\boldsymbol{O} = \ BOLDSYMBOL{H} \boldsymbol{w}_o + \boldsymbol{b}_o,\end{split}\]

In the classification problem, we can do softmax operations on the output O and use the cross-entropy loss function in the Softmax regression. In the regression problem, we set the output layer output number to 1, and output o directly to the linear regression used in the square loss function.

We can add more hidden layers to construct a deeper model. It should be noted that the number of layers of multilayer perceptron and the number of hidden elements in each hidden layer are hyper-parameters.

Random initialization of model parameters

In a neural network, we need to initialize the model parameters randomly.
Mxnet default random initialization, we use Net.initialize (init. Normal (sigma=0.01) makes the weight parameter of model NET take the random initialization method of normal distribution. If you do not specify an initialization method, such as net.initialize (), we will use the default random initialization method of mxnet. When initializing under default conditions, each element of the weight parameter is randomly sampled in a uniform distribution between 0.07 and 0.07, and all elements of the deviation parameter are zeroed.

Xavier Random Initialization

There is also a more commonly used random initialization method called Xavier Random initialization, assuming that the input number of an all-connected layer is: math:a, the output number is B,xavier random initialization of the layer weight parameters of each element randomly sampled in the uniform distribution
\[\left (-\sqrt{\frac{6}{a+b}}, \sqrt{\frac{6}{a+b}}\right). \]

Its design mainly considers that after the model parameters are initialized, the variance of each layer output should not be affected by the number of inputs, and the variance of each layer gradient should not be affected by the number of output of the layer. These two points are related to the forward propagation and reverse propagation that we will be introducing later.

MXNET: Multilayer Neural Networks

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.