Tricks efficient BP (inverse propagation algorithm) in neural network training

**Tricks efficient BP****(inverse propagation algorithm)** in neural network training

[Email protected]

Http://blog.csdn.net/zouxy09

tricks! It's a word that's filled with mystery and curiosity. This is especially true for those of us who are trying to solve certain problems with the use of machine-learning technology. Remember, we racked our brains, rubbing hands duzu, shouting, "Why do I run the model not work?" "Why did I achieve so poorly?" "Why didn't the result of my reappearance be as good as he said in his paper?" ”。 Someone will tell you, "you don't understand the argument!" There are a lot of tricks inside! "," Perhaps the author does not fully describe the implementation process of the tricks, you send an email to ask the original author! "My goodness, tricks!. Why are you so mysterious, but so far away from me? How can I get close to you? This is derived from the previous accumulation of tricks, how can I wait for the fledgling young scholar knows. I seek, no one to guide, the ends of the earth, where to find.

Tricks has been turned into a machine learning in the martial arts cheats? A lot of Daniel are hiding, for fear of the lake know? No! The real Daniel is close to our common people. They have made a remarkable contribution to the machine learning community. Here, thank Daniel, the beginner's "soul" mentor.

Haha, may be a bit of a crowd. The following is the lecun, such as the "neural networks:tricks of the Trade" book in the second edition of the first chapter efficient Backprop. Thanks to the authors of the book. This blog post is not a direct translation of the original, but after a certain degree of additions and deletions, at the same time most of the places have increased the personal understanding (words of saliva, bad habits), if there is no understanding of the place, but also hope that everyone to correct.

The ability of artificial neural network everyone is obvious, in machine learning field but occupy a certain position. There should be no doubt about this. It can model arbitrarily complex functions. Although the ability to be big sometimes is not good, because it is easy to fit. But the ability to be small, there is no way to model complex functions, that is, to give you data, you can not digest. On the introduction of neural network, here do not say, developed so long, the introduction of neural network books or data too much. Do you remember what we were doing? We want to know the tricks! of training neural networks. Well-known (if you don't know, don't look down first), training neural network method is the classic BP algorithm! It is important to understand how the BP algorithm works because, in practice, some of the phenomena you encounter can be obtained by analyzing the convergence of the BP algorithm. At the same time, BP algorithm has weaknesses and some bad characteristics, how to avoid these bad characteristics of the model is very important to the success. This is the moment when tricks debut, these tricks rarely an article mentioned, so let's introduce these tricks and explain how they work.

**First, Introduction**

BP algorithm is a very popular algorithm for training neural networks, because it is not only conceptually simple, but also easy to implement, of course, it is also effective. However, the use of it is more like an art, not just science. Designing or using a BP algorithm to train a neural network may seem simple, and many seemingly simple choices are made, such as the type of neuron node, the number of layers, the number of levels, the learning rate, the training and test set, and so on. In fact, it's very critical to their choice! However, it is also regrettable to tell you that there is no strong guide to how to choose them in reality. Because this is a very big problem and is related to specific tasks and data. However, it is also a pleasure to tell you that, in fact, there are many heuristics and potential theories that can guide practitioners to make better choices for them. Let us first introduce the basis of the relevant, and then to the tricks of the word.

**Ii. Learning and generalization**

There are many methods of machine learning, but most successful methods can unify them as gradient-based learning methods. The learning framework is as shown. Our model is actually learning a function from input to output, represented here as M (Zp, W), where input is Zp, which represents the P input sample. W is the parameter that the model can learn. Inside the neural network is the connection weight between the two layers. What is the principle to adjust the model or learn the parameter W? We want the model to be able to learn our training data, which is to fit our training data, so we need a measure of this fit. This is the cost function, which is expressed as ep=c (Dp, M (Zp, W)), which measures the difference between the output M (Zp, W) of the network and the "correct" output Dp (that is, the label of the training sample we normally say) when the sample input network of P is entered. Assuming our training set contains P samples {(Z1, D1),..., (ZP, DP)}, our cost function in the training set is to take an average across the entire sample set.

The problem with machine learning, in general, is to adjust the model parameter W to minimize the error of the cost function or model fitting data. In fact, everyone is not very concerned about the error of the model in the training set, but more concerned about the error of the model in this task, because this model is to be used in practice, in other words, we trained the model is to be able to correctly predict the new sample. This performance is estimated by a test set that does not overlap the training set. The most common cost function is the mean square function:

How can you adjust the model's parameter w to minimize the cost function? That's what we're going to discuss in this chapter. There are some strategies involved, but these strategies must be used in conjunction with the generalization capabilities of the maximized network, so that the learning model can better predict unknown samples.

In order to understand the generalization better, we analyze the working principle of BP. In fact, the acquisition of training samples is noisy, so for example, you collect a lot of samples of the collection, you can think that because of sampling at different sampling points, so different sets of samples have noise, resulting in differences. Therefore, each collection, when used to train the network, tends to make the network more inclined to itself, so that the learning network is different from the other collections.

There are many theories of minimizing errors in the training set, which is called experiential risk minimization. There are some theories that decompose the generalization error into two parts: bias and variance, bias, and variance. Bias measures the difference between the output of the network and the target output, which is the average error in all samples. Variance measures how much the output of a network differs in different data. At the beginning of the network training, the bias is very large, because the network has not learned, the output of the network and the output of the target is generally very different. But the variance is small because the data has a small impact on the network. As the training progresses, the bias will slowly become smaller as the network slowly begins to learn the potential functions, that is, to begin to fit the data. However, if the training is too long, then the network will learn the noise of a particular database, which is overtraining. At this point, the variance becomes very large because there are different noises in different databases. However, when the bias and variance are minimized, the total error is minimized.

There are many techniques available to minimize the generalization of the network, such as common early stopping (early stop training), and rules.

This chapter is mainly about how to implement a minimized strategy at the time of a given cost function, and how to minimize or train the quality and speed of the process. It is worth mentioning, however, that the selection of models, architectures, and cost functions is critical to getting a network with good generalization performance. So, remember, if you choose the wrong model and do not use the appropriate model selection strategy, it is an exceptionally good minimization strategy is also unable to go to the next. In fact, the existence of over-training has made it possible for many scholars to think that a less precise minimization algorithm may have better results.

**Three, Standard BP**

The tricks and analysis in this paper are analyzed in the background of multilayer feedforward neural network, however, most of these tricks can be applied to other gradient-based learning algorithms.

The simplest form of a multilayer network based on gradient learning is to iterate over a module, each of which is a layer of the model. This module can be represented as the following function: Xn=fn (Wn, Xn-1). This is the well-known forward propagation process in neural networks. The vector Xn-1 is entered into the module FN, and then the output vector xn. This model can represent a vector wn with tunable parameters. Stacking multiple, and then the output of this layer is the next level of input on the line. The first layer of input is X0, which is our input data ZP.

If the network error EP is known for the derivative of Xn, then the derivative of the EP to WN and Xn-1 can be obtained by reverse propagation:

, ∂f (wn,xn-1)/∂w is F on the Jacobian Jacobian determinant of W at point (Wn, Xn-1). The Jacobian of a vector function is a matrix, and the matrix element is the spatial derivative of all output about all inputs. If the above formula is applied from the nth level to the first layer, then the derivative of the cost function to all the parameters of the network can be obtained. This method of calculating gradients is BP.

The traditional multilayer neural network is a special example of the above system, where each module is an alternating matrix multiplication (parameter) and element-wise sigmoid function (neuron):

The WN is a matrix in which the number of columns is consistent with the Xn-1 dimension, and the number of rows is consistent with the dimensions of the xn. f is a vector function that calculates the sigmoid function for each element of the input. Yn is a vector, and each element is a weighted sum of all inputs of the nth layer.

Applying the chain rule to the above equation, the classical BP algorithm has the following:

The upper form can be written as a matrix:

The simplest minimization process is that the gradient drops and the W is iteratively adjusted:

Wherein, η is the learning rate, the simplest case is set to a scalar constant value. A more granular approach is to use variables that change as the iterations evolve. There is also a method, which is a diagonal array, or an estimate of the inverse Hessian matrix of the cost function. The second-order guide. Examples include the Newton and Quasi-Newton methods. The choice of learning rate is actually very important, as will be mentioned later in this article.

**Iv. Some practice tricks**

The real dry goods are finally coming up. As mentioned above, BP is a ladder-degree method, so BP is very slow. Especially in multi-layer networks, the cost function surface is generally two, non-convex and high-dimensional, so there are a lot of local minimum values and (or) flat flat area. Therefore, the BP algorithm can not guarantee: 1) The network will converge to a good solution; 2) convergence is rapid; 3) convergence always occurs. However, in this section we will discuss a series of tricks, these tricks generally can be very effective to increase the chances of these situations, that is, can be an order of magnitude reduction in convergence time based on a good solution. Attention is an order of magnitude oh. is not looking forward to it.

**1) Stochastic Learning vs Batch Learning**

Because our cost function is averaged across the entire training set, in each iteration of the weights, all the samples in the database need to be repeated, and then the average or true gradient is solved. This is called batch learning because the full batch data needs to be considered for each parameter update. In contrast, we can use either random or online stochastic to learn the gradient by selecting (for example, randomly) a sample (for instance, "Zt, Dt") from the training set. At this point, the estimate of the gradient is obtained only by this one sample, when the T-moment updates the model parameters to:

Because this estimate of the gradient is noisy, in each iteration, the parameters may not be very precise along the gradient descent way. However, we will see that this "noise" introduced by each iteration is actually advantageous. Random learning is popular for the following three reasons:

A, random learning is generally faster than batch learning convergence;

B, random learning will generally get a better solution;

C, random learning is useful for tracking changes in the network.

First, let's analyze the 1th. Random learning is much faster than batch learning in most cases, especially in large-scale redundant databases. The reason is simple. Suppose we have a database that has 1000 samples, but because of our carelessness, the 1000 samples were copied 10 copies of 100 samples. Therefore, the gradient average of all 1000 samples is identical to the results of the gradient averaging of the 100 samples. Therefore, the batch gradient drop is wasteful because it is repeated 10 times for a parameter update. Did a lot of useless work. The random gradient is just 10 iterations of the training set compared to a 100 sample (assuming that batch includes 100 samples). In fact, in the database, a sample rarely occurs two times, but there are many very similar samples in the database. For example, in phoneme classification, all patterns that contain phoneme/æ/basically contain similar information, so this redundancy makes batch learning much slower than online learning.

2nd, random learning can get a better solution because it brings noise to our gradient update. Favorable noises can be called disturbances by the United States. Nonlinear networks generally have local minima of many different depths. The goal of training is to find one of the minimum values. The minimum value found by batch learning is based on the parameter initialization on the surface of the cost function of a pit above, so if the parameter is initialized, because the gradient drop is to go to the low place, so the final destination is generally determined, it must fall into the pit inside. If life is a routine, don't believe in life. However, in random learning, due to the presence of noise, it is sometimes possible to jump a parameter to another pit, which is likely to find a deeper local minimum. A deeper local minimum means a smaller cost function value, i.e. a model that fits the data more. So the life is colorful, still has the different harvest.

There is also a useful scenario for random learning, which is that the function we are modeling is changing in time. A very common scenario is in the field of industrial applications. The distribution of the data here varies over time (for example, due to machine wear). If our models cannot detect and adapt to this change, they will not be able to learn the data, leading to very large generalization errors. Times are developing and ideas are reformed. For batch, because we need to average in a handful of rules, this change is difficult to detect, which can result in bad results. But on-line learning, if done properly, can track this change, resulting in more approximate results.

Although random learning is a favorite place, batch learning is not useless, or there are some reasons for us to consider batch learning. Benefits of Batch Learning:

A, convergence conditions are easy to understand;

B, a lot of accelerated methods, such as conjugate gradient, only effective in batch learning;

C, the theoretical Analysis of parameter change and convergence rate is simpler.

For equality, we also analyze these advantages of batch. Everything has two sides, this "noise" also has pros and cons, its "Lee" achievement of random learning, and its "disadvantage" will be a part of the random learning of love to the batch study. In other words, batch learns these advantages simply because there is no such "noise" to bring. This noise is critical to finding a better local minimum, but it also prevents full convergence to the local minimum, which causes the cost function to wander around the minimum, and it wants to go down, but powerless because it is in noise and is bound by noise. This noise causes the parameters of the model to wobble, even in the vicinity of local minima, is not stable, has been tossing. The size of this jitter also depends on the random updated noise size. The variance and the learning rate of jitter near local minima are proportional. So in order to reduce this jitter, we have two methods: 1) Reduce the learning rate (annealing); 2) Use an adaptive batch size. For the first method, it is theoretically stated that the optimal annealing process for the learning rate adjustment is: η=c/t. C is a constant, and T is the number of samples. In practical applications, the learning rate is still a bit large.

Another method is also very natural, the bell must be bell people, who brought the mess on who to clean up. That's trying to get rid of the noise. Hey? The advantages of random learning are not also lost because of noise removal? All things have a mean, there is the mean of the law, extremely undesirable, then take the contradiction of the compromise. Mini-batches formally boarded the historical stage to balance the pressure on all sides. Very simply, at the beginning of the training, our parameters were just initialized, so far from the minimum, we had to speed it up, so with the convergence rate of random learning, we used a very small mini-batches, that is, each batch contains not many training samples. As the training progresses, getting closer to the minimum, we have to slow down, otherwise rushed across the bounds, and shook again. So we increase the size of the mini-batches, which reduces noise. However, the introduction of each method introduces additional parameters that need to be considered, in this case, what growth rate should be chosen for the size of the mini-batches? This is actually as difficult as choosing a learning rate. In the same way, effective adjustment of learning rate and effective adjustment of mini-batches size growth rate effect is similar.

However, it is important to note that because of the generalization capability of the model, it is considered less critical to remove the noise from the data. It has already been trained (over fitting) because it has not yet met these ills of noise.

Another advantage of batch training is that the second-order optimization method can be used to speed up the learning process. The second-order method not only estimates the gradient of the cost function surface at a certain point (one-stage information), but also estimates the curvature of the curved surface (second level information). Once the curvature is obtained, it can estimate the approximate position of the true minimum value for a strong acceleration.

Although batch learning has such advantages, random learning is the preferred method, especially when training databases are very large, it is indeed faster. The World martial arts, only fast not broken!

**2) shuffling to disrupt the learning sequence of the sample**

One principle is that the network learns the fastest from the unexpected samples. Sean, like if the first day of class, you are very prominent, very mischievous, then your teacher must remember you, and not other "ordinary" students. So thought is very simple, in order to accelerate learning, in each iteration we select a system and the most dissimilar, the most discordant sample let the network to learn. Bring out the "head bird" and take the Thief to the King first. Obviously, this method is only effective for random learning, because batch is no matter the order, regardless of the first served, all have to wait for the person, only to send the forage (calculate the total error of all samples to do gradient update). Of course, there is no easy way to know exactly which input sample carries the most abundant amount of information on the system, but there is a simple trick to choose from a sample of different classes, in other words, that is, if in the first iteration I am using Class I samples to learn, then in section t+ 1 iterations, select a sample of classes other than Class I to learn. Because the training sample of the same class is very likely to carry a similar message, so I see you this time, next time you do not want to see the same people you look like, no information, aesthetic fatigue.

Another heuristic to judge how much new information a training sample carries, is to test the error size of the network's output value and the target output value when the sample is entered into the network. The greater the error, the more the network has not learned the sample, so it has more new information. Just like suddenly your world appears a new thing, the kind of "oops" feeling. Therefore, it is meaningful for us to take this sample multiple times into the network to learn. Of course, the "big" of this error is relative to other training samples. As the network is trained, this error of each input sample changes, so the number of times each sample is entered into the network will change. One way to modify the probability or number of each sample is called emphasizing scheme:

A, disrupt the training set, so that the adjacent samples will hardly belong to the same class;

B, select the network to generate more error sample input network learning.

However, it is important to be careful that disrupting the normal frequency of input samples being learned can change the importance of each sample to the network, which may not be so good. Let some people get rich first, when they get rich, they don't care about the rich, the gap between rich and poor is big. Million Pet collection in a few people, Shumon wine meat smelly, the road has been neglected to die samples. This policy that is detrimental to social harmony should not be too extreme. As an extreme example, if the training focuses on outliers outliers, it will have disastrous consequences. Because outliers can produce large errors, it is clear that it should not be sent to the network many times to train, which will disrupt the normal learning of the network. The network for this sample adjusted half-day parameters, and then found that this is a very abnormal sample, that obediently, unable to spit groove. However, this trick is useful for a situation where the performance of normal but rarely occurring input patterns can be accelerated, such as/z/the phoneme in phoneme recognition. If the sample is a normal niche, it is beneficial to have the network learn it many times. Focus on weak groups and build a harmonious society.

**3) standardize the input normalize**

If the average value of each input variable (feature dimension) in the training sample is close to 0, the convergence will generally be faster. Let's consider an extreme situation. That is, all the input on the network is positive. The parameter update value of the neuron of the first hidden layer is proportional to the Δx, δ is the neuron's error, and x is the input vector. When all elements of X are positive, the updated values for the neuron's parameters have the same symbol (because x is a positive number, so the symbol for the updated value is the same as the Δ symbol, and δ is a scalar). This leads to the addition of all of these parameters to a given input sample, either all (δ is positive) or all (δ is negative). So, if a parameter vector reaches the optimal value it has to change direction, then it will go along the path of the "It" shape, which is very inefficient, so it will lead to very slow convergence.

In the example above, all the inputs are positive. However, in practice, if the mean value of the input variable of the training sample is away from 0, the parameter updates tend to be in a particular direction, thus reducing the speed of learning. Therefore, it is advantageous to shift the mean value of the input variable of each sample of the entire training set to 0. Moreover, this heuristic method should be used on every layer of the network, in other words, we want the output of each node to be close to 0, because the output is actually the input of the next layer. However, it is possible to consider the transformation of the input and the selection of the sigmoid activation function. Here we discuss the transformation of the input. The sigmoid function is discussed later.

In addition to panning the sample, there is a way to accelerate convergence by scaling the sample so that each feature dimension has the same covariance. Why does scaling accelerate learning? Because it balances the learning rate of the parameters that are connected to the input node. What do you mean? The parameter update values for the first hidden layer of neurons are proportional to the Δx, and if the values of some elements in X are large and some of the elements are small, it is clear that a large value will result in a large update value for the parameter. The value of small update value is also small, oh, the gap between the rich and poor came. However, the above only their variance value to the same, how much should be taken? This value should match the sigmoid selection. For the following given sigmoid function (given below, it must be below to see, embarrassed), covariance 1 is a good choice.

There are exceptions, however, when you know beforehand that some input variables are less important than other input variables. In this case, you can narrow down the input variable with small importance so that the learning algorithm can "ignore" it slightly. The Transcendental of man is great.

The above tricks to translate and scale the input is easy to implement. There is also a very effective, but more difficult to achieve tricks is the input to solve the correlation. Consider the simple network shown. If the input is independent, which is irrelevant, then the W1 solution can be obtained by minimizing it, without having to consider W2 and vice versa. If the input is relevant, then it is necessary to solve the two parameter w simultaneously, which is obviously more difficult. So how do you relate the input variables to the solution? The famous PCA ascended the historical stage. But it's also limited in ability, only to remove the linear correlations of the inputs (only the second order, the higher orders are beyond).

In fact, if the input is linearly independent (related extremes) it can also lead to some kind of degradation that reduces learning speed. Consider one case where an input variable is always twice times the z2=2z1 of another input. The output of the network along the line w2=v-(W1) (V is a constant) is constant. Therefore, the gradient in this direction is 0. Therefore, moving on these lines does not have any effect on learning. We wanted to try to solve a two-dimensional problem, but the problem actually worked in one dimension. Thankless. Therefore, ideally, we want to remove this input and reduce the size of the network.

In summary, the transformation of the input is as follows:

A, the average value of each input variable of the training set is close to 0;

B, the input variables are scaled, so that their variance has the same value;

C, the input variable is best not relevant.

This process can be expressed as follows: 1) The translation input allows them to have a mean value of 0;2) to the input solution, 3) to equalize the covariance. As shown in the following:

**4) sigmoid function**

The use of nonlinear activation functions gives neural networks the ability to model nonlinear functions. Without him, countless hidden layers of neural networks are still a linear network. A household name activation function is not sigmoid. It is monotonically increasing by a finite value at the time of positive or negative infinity. Generally take the standard logic function f (x) =1/(1+e-x) and the hyperbolic tangent function f (x) =tanh (x). People tend to prefer the sigmoid function (hyperbolic tangent function) to the symmetric version of the origin, as we mentioned above that the input should be normalized, so the output of this function is more likely to create an input with a mean close to 0 for the next layer. Instead, the logistic function is always positive because the output is always positive.

(a) standard logistic functions. (b) Hyperbola tangent function f (x) =1.7159tanh (2X/3)

The tricks of the Sigmoids function is as follows:

A, symmetric sigmoids functions such as hyperbolic tangent functions tend to converge faster than standard logistic functions.

B, a suggested activation function is f (x) =1.7159tanh (2X/3). Because the Tanh function is computationally time-consuming, it is generally possible to approximate the coefficients of a polynomial.

C, sometimes it can be useful to add a linear term, such as f (x) =tanh (x) +ax, which avoids the flat of the cost function surface.

We suggest that you use the activation function, even the parameters are given to you to choose. Because when you use a normalized input, the variance of the output of this activation function is also close to 1, because the sigmoid effective gain (effective gain?). ) is roughly 1 within its effective range. This special version of the sigmoid has the following properties: a) f (plus or minus 1) = positive/negative 1;b) the largest two derivative appears in the place of X=1; c) The effective gain is close to 1.

Of course, there are still two sides to everything, the use of symmetry sigmoid also has its shortcomings, that is, it will make the error surface near the origin of the place will be very flat flat. For this reason, it is best to avoid initializing the network parameters to a very small value. Because of the saturation of the sigmoids, the error surface is flat when it is far away from the origin point. Adding a linear item to the sigmoid can sometimes avoid these flat areas.

**5) Selection of target values**

On the classification issue, the target value is generally two values, for example { -1,+1}. Many wise people recommend setting the target value to the sigmoid progressive line. However, this approach has some drawbacks:

First, it leads to instability. We know that the network training will do its best to make the network output as close as possible to the target value, of course, can only gradually close. In this way, the parameters of the network (the output layer, even the hidden layer) will become larger, and in these places, the sigmoid of the guide value is close to 0. These very large parameters increase the value of the gradient, however, these gradients are then multiplied by a very small sigmoid derivative (unless you add a twisting skew item, which is what you said earlier to add a linear item ax), which results in a final parameter update value close to 0. The final result is that the parameters are stuck, and the results are not moving.

Second, when the output is saturated, the network cannot give an indication of the confidence level. First of all, what is a confidence? To put it bluntly, the belief is that when you input a sample to the neural network, I, the neural network, to classify you is not. I go through a round of hard calculations and give you a breakdown of the results right. So you believe the percentage I gave you was right? As a responsible network, do I have to tell you how credible I am to make this categorical judgment? This will give you the next decision to provide a reference, believe it or not. For example, when an input sample falls near the decision boundary, the decision value of the network output is actually indeterminate. Ideally, this confidence should be reflected in the network, such as the output of a value between two possible target values, rather than the place at both ends of the progressive line. However, large parameters force all outputs to fall on the tail of the sigmoid without considering uncertainties. A typical arrogance. So, without giving any indication of the low level of confidence in the result, the network could have given a wrong category result, which is not deceptive. Large parameters can cause the neuron to saturate, blinding its eyes, thus losing its basic ability to differentiate the sample.

One way to solve this problem is to set the target value within the valid range of the sigmoid, rather than in the area of the progressive line. It is also important to be careful that in order to ensure that the nodes are not confined to the linear part of the sigmoid, the target value can be set at the position of the maximum second derivative of the sigmoid, which can not only take advantage of the nonlinearity, but also avoid sigmoid saturation. This is also the reason why the sigmoid function in B is a good choice. It has the largest second derivative in the place of positive and negative 1, and the positive and negative 1 corresponds to the typical two value target of the classification problem.

Trick: Target value: Select the position of the target value at the maximum second derivative of the sigmoid function. Thus avoiding saturation of the output nodes.

**6) Initialization of parameters**

The initial value of the parameter has a significant impact on the training process. Our principle of initializing parameters is that parameters should be randomly initialized in values that enable the sigmoid function to be activated in a linear region. If the parameters are all very large, then the sigmoid is saturated at first, so that a very small gradient value is obtained, and the parameter updates are slow and the training is slow. If the parameters are too small, the gradients will be small, which will also result in slow training. Doctrine! There are several advantages to the area where the parameter is in the sigmoid linear Range: 1) The gradient can be large enough to allow the learning to proceed normally; 2) the network can learn the linear part of the map before learning the very difficult nonlinear part of the mapping.

To achieve this goal is not sigmoid the power of a family can be completed. It requires harmonization of data normalization, selection of sigmoid and selection of parameter initialization. First, we require that the standard deviation of each node's output should be close to 1, which can be transformed by using the data normalization previously mentioned. In order for the output of the first hidden layer to get the same output as the standard deviation of 1, we only need to use the Sigmoid function suggested above, and also require the standard deviation of sigmoid input to 1. Assuming that a node's input Yi is irrelevant and the variance is 1, the standard deviation of that node is the weighted sum of the parameters:

Therefore, in order to ensure that this variance is approximate to 1, the parameter should be from a mean value of 0, the standard deviation is: ΣW=M-1/2 of the distribution of random sampling (M is fan-in, that is, the number of connections to this node, that is, the number of nodes in the previous layer, if it is an all-connected network).

Trick: Initialization of parameters:

Suppose: 1) The training set has been standardized; 2) sigmoid is the choice f (x) =1.7159tanh (2X/3).

The parameter should be sampled from a distribution that has a mean value of 0 and a standard deviation of σw=m-1/2 (for example, a normal distribution).

**7) Choice of learning rate**

The choice of learning rate is also a brainiac. The size of the batch is the same as said above. The learning rate, is the parameter update each step to go how far, this parameter is very key. If the settings are too large, then it is easy to hover over the optimal value, because you are too much at the pace. For example, from Guangzhou to Shanghai, but your step distance is Guangzhou to Beijing so far, no half-step of the argument, I can take so big stride, is lucky? Or is it unfortunate? There are always two sides of things, it is the advantage of it can quickly from the best value from the place to return to the nearest to the optimal value, just near the optimal value of the time, it powerless. But if set too small, that convergence speed is too slow, like a snail, although it will fall in the best point, but this speed if it is se years, we do not have this patience ah. So some of the improvements are in this place where the learning rate is under the knife. I began to iterate is that the study rate is large, slowly close to the optimal value of the time, my learning rate becomes smaller. The essence of the so-called mining both Ah!

In fact, the learning method after so long development, the study rate of research is still a lot of results or experience. There is at least one good way to estimate the ideal learning rate. And there are many other ways to automatically adjust the learning rate, but most of them are empirical.

Most of these methods reduce the learning rate when the parameters are vibrating, and increase the learning rate when the parameters move in one direction with relative stability. The main problem with this approach is that it is inappropriate for random gradients or online learning, because parameters are dithered in all training processes. From the start, to the end, it's all shaky. Life is so bumpy, hope is where.

In contrast to selecting an equal global learning rate for all parameters, you can choose a different learning rate for each parameter, which generally speeds up convergence. A good way to do this is to calculate the second derivative, which is mentioned later. The most necessary thing to be sure about this method is that all the parameters in the network converge at almost the same speed. Depending on the curvature of the error surface, some parameters may require a small learning rate to avoid divergence, while some parameters require a large learning rate to speed up convergence. For this reason, the low-level learning rate is generally higher than the upper levels. This is because for most neural network frameworks, the second derivative of the cost function to lower-level network parameters is smaller than that of the upper-layer parameters.

If shared parameters such as Tdnn or CNN are used on the network, the learning rate should be proportional to the square root of the number of connections that share the parameters. Because we know that gradients are something more or less independent of the and.

Tricks:

A, give each parameter their own learning rate;

B, the learning rate should be proportional to the square root of the input number of the node;

C, the learning rate of low-level parameters should be larger than the high.

Some of the tricks that accelerate convergence include:

Momentum:

When the cost function surface is highly non-spherical, the momentum can improve the convergence speed. Because it can limit the size of the large curvature direction, it is possible to obtain a more effective learning rate in the direction of low curvatures. (μ measures the strength of the moment term). In the lake, there is a saying that the moment in batch learning is much more effective than in random mode, but there is no systematic study of this argument.

Adaptive Learning Rate:

It is mainly to adjust the learning rate in real time according to the error in training. (Because the problem is larger, it is omitted here.) Interested in the original text).

**8) RBF vssigmoid node**

Although most systems use dot-product and sigmoid-based neurons, there are other options. A more typical RBF radial basis network. In the RBF network, the dot product of the parameter and input vector is replaced by the Euclidean distance of the two, and the sigmoid function becomes the exponential function. For example, for an input x, its output is:

VI (ŌI) is the mean and standard deviation of the first Gauss. These nodes can replace standard nodes, or they can coexist with them. They generally learn the mean and variance of the RBF nodes through gradient descent (output layer) and unsupervised clustering algorithm. This is a reference to my blog post.

Unlike sigmoid nodes, sigmoid can cover the entire space. However, a RBF node can only cover a small area of the input space. This is beneficial for fast learning. RBF can also build a set of basic functions that can better model the input space, but this is a matter of independence. RBF still has shortcomings, its local characteristics may bring some bad performance, especially in the high-dimensional space, it takes a lot of nodes to cover the entire space. However, some people also get some experience in the network setup, that is, in the low-level network (high-dimensional) with sigmoid, in the upper (lower-dimensional) use of RBF.

The following chapters of this chapter talk about the analysis of gradient convergence and the second-order optimization method and so on. It's not going to go down here. If you are interested, please refer to the original works. Thanks again for the share and dedication of Daniel. Give you a high respect!

Tricks efficient BP (inverse propagation algorithm) in neural network training