Deeplearning.net 0.1 Document-multilayer Perceptron

Source: Internet
Author: User

Multilayer Perceptron

Below we use Theano to introduce a single hidden layer multilayer perceptron (MLP). The MLP can be seen as a logistic regression classifier that uses an already learned nonlinear converter to process input. The converter transforms the input into a linearly separable space. The middle tier is considered a hidden layer . A single hidden layer is enough to make the mlps universally approximate, but we will see later that using layers of hidden layers is very effective.

The Model

An MLP that has only a single layer of hidden layers can be represented as a form:

Typically, a single hidden layer of MLP is a function. where d is the size of the input x, and L is the size of the output vector f (x), The Matrix of f (x) is represented as follows:


Where the bias item B, the weight item w and the Activation function g and S.

Vectors make up a hidden layer. is a weighted matrix link input layer and hidden layer. Each column represents the weighted value of each input to the first hidden layer. The traditional choice of s includes the Tanh function, or the sigmoid function. In this tutorial we use the Tanh function because it trains faster (sometimes to a better local minimum). The Tanh and sigmoid functions are scalar-to-scalar functions, but they can be naturally extended to vectors or tensor using elements in sequence.

The output vector is. The reader needs to know the form of our previous lesson. Previously, the member probabilities of a category could be represented by the Softmax function.

To train a MLP, we are going to learn all The parameters here we use a small batch of random drops. The set of parameters to learn is. A back propagation algorithm can be performed using gradients. Thankfully, Theano will automatically implement this process, which is not detailed here.

Going from logistic regression to MLP

This tutorial focuses on the single hidden layer of MLP. We implement a class to represent a hidden layer. In order to construct MLP we will use the logistic regression layer at the top level in the back.

class  hiddenlayer  :  def  __init__  :   self.input = input  
applied to hidden layers

The initialization of the hidden layer weights needs to be sampled within the symmetric interval of the corresponding activation function. For the Tanh function, where (fan) in is the number of layer i-1 units, (fan) out is the number of layer I units. For the sigmoid function. This initialization ensures that the neural operation information of the activation function can be transmitted forward or back more easily during training.

    # ' W ' is initialized by ' w_values ', interval is tanh [sqrt ( -6./(n_in+n_hidden))    # to sqrt (6./(N_in+n_hidden))].     # Defining the output of the Dtype as Theano.config.floatX facilitates operation on the GPU.     # Hint: The best weight initialization depends on the use of the activation function.     # for example: [XAVIER10] results suggest you, compare tanh use 4 times times the weight of sigmoid    # But we don't have the information for other functions, so we use the same as Tanh. If W is none:w_values = NumPy. Array(RNG. Uniform(Low =-numpy. sqrt(6./(n_in + n_out)), high = NumPy. sqrt(6./(n_in + n_out)), size = (n_in, n_out)), Dtype = Theano. config. Floatx) If activation = = Theano. Tensor. Nnet. sigmoid: W_values *=4W = Thenao. Shared(Value=w_values, Name=' W ', borrow=true) if B is none:b_values = NumPy. Zeros(N_out,), Dtype=theano. config. Floatx) B = Theano. Shared(Value=b_values, Name=' B ', borrow=true) Self. W= W Self. b= b

We use a given nonlinear function as the activation function of the hidden layer. Tanh is used by default, but in some cases we use other functions.

    selfself.b    self.output = (        ifis None        else activation(lin_output)    )

If you have seen the theoretical calculation of the class implementation diagram. If given this graph as input to the previously implemented Logisticregression class, you will get the output of MLP.

 class MLP(object):    "" " muti-layer Perceptron Class A multilayer perceptron is a feedforward artificial neural network that has one or more hidden elements and a non-linear activator. The middle tier has an activation function Tanh or sigmoid function (Hiddenlayer Class), the other top layer is a Softmax layer (logisticregression) "" "     def __init__(self, rng, input, n_in, N_hidden, n_out):        # When we deal with a hidden layer, it links an activation function tanh and a logistic regression layer.         # activation function can be replaced by sigmoid function or otherSelf.hiddenlayer = Hiddenlayer (rng = rng, Input=input, n_in=n_in, N_out=n_hid Den, Activation=t.tanh)# Logistic regression to get input for hidden layersSelf.logregressionlayer = logisticregression (input = self.hiddenLayer.output, n_in = N_hidden, N_out=n_out)

In this tutorial we will use the L1 and L2 regular

    # L1 RegularSelf. L1 = (ABS (self. Hiddenlayer. W). Sum() + ABS (self. Logregressionlayer. W). Sum()    )# L2 RegularSelf. L2_SQR = ((self. Hiddenlayer. W**2). Sum() + (self. Logregressionlayer. W**2). Sum()    )# negative logarithm likelihoodSelf. Negative_log_likelihood = (self. Logregressionlayer. Negative_log_likelihood)# Error RateSelf. Errors= Self. Logregressionlayer. Errors    # Two-layer parametersSelf. Params= Self. Hiddenlayer. Params+ Self. Logregressionlayer. Parmas

Before that, we trained the model with a small batch of random descent methods. The difference is that we use the loss function with L1 and L2 regular formula. L1_reg and L2_reg are the regular formula for controlling weights.

    cost = (        classifier.negative_log_likelihood(y)        + L1_reg * classifier.L1        + L2_reg * classifier.L2_sqr    )

We then use gradients to update the parameters of the model. This code is similar to the Logsitic function. Only the number of parameters is different. To do this (code that can run on a different number of parameters), we use the parameter list to calculate the gradient in each generation.

    # Derivation of the parameters of the listGparams = [T.grad (Cost,param) for param inchClassfier.params]# Convention How to update weights in the form (value, update expression) on    # Given two lists of the same length a=[a1, A2, A3, A4] and    # B = [B1, B2, B3, b4],zip operation generates a    # C = [(A1, B1), ..., (A4, b4)]Updates = [(param,param-Learning_rate * Gparam) for param, GparaminchZip (classifier.params, Gparams)]# Compile a Theano function ' Train_model ' return loss and update weightsTrain_model = theano.function (inputs = [index], outputs = cost, updates = updates, Givens = { X:train_set_x[index * Batch_size: (index +1) * Batch_size] Y:train_set_y[index * batch_size: (index +1) * Batch_size]})
Putting it all Together
    • See source code.
    • You can see the results of different models here.
Tips and Tricks for training MLPs

There are some hyper-parameters in the above code, which cannot be optimized with gradient descent. Strictly speaking, finding the best solution set is not a feasible problem. First, we cannot optimize each parameter independently. Second, we cannot apply the gradient descent technique described earlier (partly because some parameters are discrete and others are real ones). Thirdly, the optimization problem is not a convex optimization, and finding a (local) optimal solution involves a lot of unimportant work.

For the past 25 years, the good news has been that researchers have devised many rules for selecting hyper-parameters in neural networks. Want to learn more about Yann Lecun,leon Bottou,genevieve Orr and Klaus-robert efficient Backpro. Here, we summarize some of the methods that we apply to the code in the focus.

Nonlinearity

Two common methods are sigmoid and Tanh, for reasons in Section 4.4. The nonlinearity is initially symmetric because it causes the input to become the input with an average of 0 to the next layer. Experience tells us that Tanh has better convergence properties.

Weight initialization

In initialization we think that the weights are small enough at first, so that the operation of the activation function will work in a linear state, so that the derivative is the largest. Another suitable property, particularly in depth networks, is to preserve the variance of the activator and the inverse propagation gradient between layers and layers. This allows information to flow up and down well, reducing the difference between layers. Under certain assumptions, a compromise of two limits leads to the following initialization:
Tanh Initialization weights
Sigmoid initialization weights
Math deduction See Xavier10

Learning Rate

There are good ways to deal with the literature. The simplest method is the constant learning rate. The rule of thumb is: Try a few logarithmic spatial values (10-1, 10-2, ...) and narrowing (logarithmic) the lattice to find out where to get the minimum test error.

It is sometimes a good way to reduce the learning rate several times. A simple rule is that u0 is the initial learning rate, and D is the descending constant to control the descent rate (usually a small positive number, 10 of 3 or less), and T is epoch/stage.

Section 4.7 See more information.

Number of hidden units

Hyper-parameters are very dependent on the data set. Vaguely speaking, the more complex the input distribution, the stronger the network's ability to model, the more hidden layers it needs.

Unless we use regularization, a typical number of hidden layers vs. generalization behaves like a U-word.

Regularization parameter

The classic method is to use the L1/L2 regular parameter, 10 of -2,10-3, ..., in the future framework, the optimization of this parameter will not bring any significant changes, but sometimes it is worth trying.

Deeplearning.net 0.1 Document-multilayer Perceptron

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.