Fifth chapter (1.6) Depth learning--the common eight kinds of neural network performance Tuning Scheme _ Neural network

Source: Internet
Author: User

First, the main method of neural network performance tuning the technique of data augmented image preprocessing network initialization training The selection of activation function different regularization methods from the perspective of data integration of multiple depth networks

1. Data augmentation

The generalization ability of the model can be improved by increasing the amount of data without changing the image category.
The data augmentation methods of natural images include many, such as the commonly used horizontal flip (horizontally flipping), a certain degree of displacement or clipping and color jitter (colour jittering). In addition, you can try a combination of various operations, for example, simultaneous rotation and random scale transformations can also be used to increase the saturation and lightness of all pixels in the HSV color space by 0.25-4 power in each patch, multiplied by a factor between 0.7 and 1.4, plus a value between -0.1-0.1. Similarly you can add a -0.1~0.1 value between each pixel in the tonal channel (H) for each image or patch.

2. pretreatment

2.1 Simplest preprocessing method 0 standardization of the mean value

2.1.1 Why 0 mean-value data with too large a mean may cause the gradient of the parameter is too large, if there are subsequent processing, may require data 0 mean, such as PCA. 0 mean-value does not eliminate the relative difference between pixels, and people's uptake of image information usually comes from the relative chromatic aberration between pixels, rather than the height of the pixel value.

Why should 2.1.2 be normalized

Normalization is to allow different dimensions of data to have the same distribution.

If the two-dimensional data (X1,X2) Two dimensions are subjected to a normal distribution with a mean value of zero, the X1 variance is 100,x2 variance is 1. The image that is drawn in a two-dimensional coordinate system for random sampling (X1,X2) should be a long, narrow oval.

Feature extraction of these data will use the following form of expression:
S = w1*x1 + w2*x2 + b;
Then the gradient of the parameter w1,w2 is:
DS/DW1 = x1; DS/DW2 = x2

Because of the great difference in the distribution scale between X1 and x2, the derivative of W1 and W2 will vary enormously. The surface of the target function (not s) is drawn at this time, like a deep canyon, which changes along the canyon's W2, the slope is very small, the vertical direction of the canyon is W1, the slope is very steep, as shown in Figure 1, and the objective function we expect is fig. 2.

The objective function is very difficult to optimize because the gradient difference between W1 and W2 is too large, so different iterations are required on two dimensions. But in practice, for convenience, we usually set the same step size for all dimensions, and as the iteration progresses, the reduction of step size is synchronized in different dimensions. This requires that the distribution of W at different latitudes is roughly the same, and all this begins with the normalization of data.

2.1.3 Python Implementation Code

X-=numpy.mean (x,axis=0) # that is, each column of X minus the average of that column

X/=NUMPY.STD (x,axis=0) # Data Normalization

Note: For grayscale images, 0 mean can also be subtracted from the average of the whole picture: X-=numpy.mean (x) x is the input data (number of pictures x picture dimensions).

Another way to standardize (normalize) is to standardize each dimension to ensure that the maximum and minimum values for each dimension are-1 and 1. This method of preprocessing only makes sense when the dimensions or units of each feature are entered. Take picture pixel as input as example, all pixel value scale is between 0-255 of this scale, so it is not necessary to strictly perform this preprocessing operation. When training on natural images, it can be done without normalization, because the statistical properties of any part of an image should be the same as the other parts, and this characteristic of the image is called smoothness (stationarity)

2.2 Albino (whitening)

Albinism is equivalent to inserting a rotation operation between 0 homogenization and normalized operations, projecting data onto the spindle. After a picture has been bleached, it can be thought that each pixel is statistically independent. However, albinism is rarely used in convolution neural networks, possibly because the image information is based on the relative differences between pixels to reflect, albinism allows the pixels to be related, so that the difference becomes uncertain, loss of information.

First, the data 0 is averaged, and the covariance matrix (convariance matrix) is computed to observe the correlation structure in the data.
X-=np.mean (x,axis=0)
Cov=np.dot (x.t,x)/x.shape[0] #计算协方差矩阵

Then do the related operations by projecting the original data (0-valued data) into the feature base space (eigenbasis).

U,S,V=NP.LINALG.SVD (CoV) #计算数据协方差矩阵的奇异值分解 (svdfactorization)
Xrot=np.dot (x,u) #对数据去相关
The final step is albinism, which is to standardize the scale by dividing the data in the characteristic base space by the eigenvalues of each dimension.
Xwhite=xrot/np.sqrt (s+1e-5) #除以奇异值的平方根, note that a 1e-5 is added to prevent the denominator from being 0.

One disadvantage of PCA whitening is that it increases noise in the data because it extends all dimensions of the input data to the same size, which contains noise dimensions (often in unrelated and with less variance). This disadvantage can be used to introduce a stronger smoothing by increasing the 1e-5 to a larger value in practice.

3. Initialize

3.1 Do not initialize all parameters to zero

Almost all CNN networks are stacked into structures, and the initialization of parameter 0 causes the data flowing through the network to be symmetrical (all 0), and there is no way to break the symmetry of the data without being disturbed, resulting in the network being unable to learn.

Parameter 0 when initialized, no matter what the input is, the activation value of the intermediate neurons is the same (the activation value of any neuron is a=f (WTX), when the weight w is 0 vector, WTX is also 0 vector, so the activation value is the same after activating the function), the gradient in the reverse propagation process is the same, Each weight parameter is updated so that the network loses its asymmetry.

3.2 Initialized with a small random number

Initialization parameters should be very close to 0 (but not all equal to zero) to break the symmetry of the network. The initial parameters should be random and independent to ensure that each parameter update process is different. Give each parameter a random value of close to 0:
W=0.01*numpy. RANDOM.RANDN (D,H)

The Randn method produces a 0 mean, the variance is 1 of the normal distribution of the random number, also can be replaced by a small number of uniform distribution to produce initialization parameters, but in practice, the final results of the two methods are not very different

3.3 Normalization of variance

Initializing a parameter with a random initialization method causes the variance of the output s to increase as the input quantity (dimensions of the x or W vector) increases.
The variance of independent random variables has the following properties:
var (a+b+c) =var (A) + var (B) + var (C)
S = w1*x1 + w2*x2 +...+wi*xi+ b

S is the weighted sum of multiple random variables, assuming that the dimensions of the W are independent of each other, as the data dimension grows, the variance of s will accumulate linearly, because the data dimension is not controllable with the task, so we want to normalized the variance of s, which only needs to be processed on W:
W=NUMPY.RANDOM.RANDN (n)/sqrt (n) #n是数据维度

The above operation is the correct derivation process:
N*var (W) = 1 to get std (w) = 1/sqrt (n)
Note: Now more papers in the actual use of the Relus function, for the Relus recommended:
W=NUMPY.RANDOM.RANDN (n) *sqrt (2.0/n)

4. During the training process

The 4.1 convolution filter and the pool layer size input data is best 2 of the integer power side, such as CIFAR-10 (picture size), 64,224 (imagenet common size). In addition, a smaller size filter (example 3x3), small step size (example 1) and 0 value fill, not only reduce the number of parameters, but also improve the accuracy of the entire network. When using a 3x3 filter, the step size is 1 and the padding (PAD) is 1 o'clock, keeping the space dimensions of the picture or feature map unchanged. The pool size used by the pool layer is 2x2.

4.2 Learning rate using validation set is an effective means to obtain the appropriate LR (Learning Rate). When you start training, LR is usually set to 0.1. In practice, when you observe that the loss or accuracy rate on the validation set is not changing, the LR is kept running after dividing it by 2 or 5.

4.3 Fine-tuning many state-of-the-arts deep networks models on a pre-training model are open source, and these pre trained model generalization capabilities (generalization abilities) are strong, So you can fine-tune your tasks on the basis of these models. Fine-tuning involves two important factors: the size of the new dataset and the similarity of the two datasets. The top layer features of the network contain more dataset-specific features.

5. Activation function

The activation function is used to introduce nonlinearity in the network. Sigmoid and Tanh used to be very popular, but are now rarely used in visual models, mainly because when the absolute value of the input is large, its gradient (derivative) close to 0, then the parameters are almost no longer updated, the gradient of the reverse propagation process will be interrupted, the phenomenon of gradient dissipation.

Activation function diagram

sigmoid activation function

Tanh activation function

Relu activation function

Relu Advantage: The realization is very simple, accelerate the calculation process. Accelerated convergence, no saturation problem, greatly alleviated the phenomenon of gradient dissipation.

Relu disadvantage: It may be "dead" forever, if there is a set of two-dimensional data x (x1, x2) distributed in x1:[0,1], x2:[0,1 "region, there is a set of parameters W (W1, W2) to do the linear transformation of x, and input the result into the Relu. F = w1*x1 + w2*x2 If w1 = W2 =-1, then the F must be less than or equal to zero regardless of how X takes the value. Then the derivative of the Relu function to F will always be zero. This Relu node will never participate in the learning process of the entire model.

In order to solve the problem that the derivative of relu in negative interval is zero, the leaky Relu, parametric relu, randomized relu and other variants are invented, and their central idea is to give a certain slope to the Relu function in the negative interval, So that its derivative is not 0 (here set the slope to Alpha).

Leaky Relu is to specify a fixed value directly to alpha, and the entire model uses this slope

Parametric Relu takes Alpha as an argument and obtains its optimal value by learning from the data.

The alpha of the randomized Relu is randomly selected within the specified interval and is fixed at the test stage.

Some scholars have put the best two kinds of CNN networks together with different activation functions to do experiments on cifar-10,cifar-100 and NDSB datasets, and evaluate the pros and cons of four kinds of activation functions.

The experimental results show that leaky relu has better alpha accuracy rate. Parametric Relu is easy to fit on a small dataset (with the lowest error rate on the training set, not ideal on the test set), but still better than the Relu. Rrelu effect is good, the experiment shows that it can overcome the model over fit, which may be due to the randomness of alpha selection. In practice, both parametric relu and randomized relu are desirable.

6. Regularization (regularizations)

The following are several commonly used ways to block the neural network over Fit (Overfitting) by controlling the capacity of the model.

6.1 L1 regularization L1 regularization is another related common regularization method. Here, for each weight in the network W, we will add an item λ|w| to the target function. A very interesting attribute of L1 regularization is that it makes the weight vector w become sparse during optimization (for example, very close to 0 vectors). The neural network with the end of the L1 regularization term uses only a sparse subset of its input of the most important and nearly constant noise. In contrast, the final weight vector from L2 regularization is usually a decentralized, small number. In practice, if you don't care about explicit feature selection, you can expect L2 regularization to be superior in L1 performance.

6.2 L2 Regularization L2 regularization may be the most commonly used form of regularization. It can be implemented by adding the square of all the parameters in the model to the target function (objective) as a penalty term. In other words, for each weighted w in the network, we add its item 12λw2 to the objective function, where λ is the regular strength parameter. It is common to add 12 in front of the penalty formula, because the optimization function 12λw2 the derivation by not creating a constant item factor 2, but simply λw such a simple form. The intuitive interpretation of L2 regularization is that L2 regularization is a strong punishment for the spike vectors and tends to scatter the weight vectors.

The other form of the 6.3 maximum norm constraint normalization is to enforce the absolute upper limit size in each neuron's weight vector, using the projection gradient descent to force the constraint. In practice, this corresponds to the normal update of the execution parameters, and then performs the clamping constraint on the VEC {W} each neuron's weight vector satisfies the parallel vec {W} parallel_2 < C. The typical C value is 3 or 4 of the order. Some people report improvements in the use of this form of regularization. One of its attractive features is that the network cannot "explode" even though the learning rate

6.4 Dropout dropout is an extremely effective, simple and recently proposed regularization technique as a supplement to the above three regularization methods (L1, L2, maximum norm constraints). During training, dropout can be understood as a sample of the neural network in a fully connected neural network and only updates the parameters of the network sampling update based on the input data. However, the number of possible samples of the index, the network is not independent, because they share the parameters. During the testing process, dropout was not used. The predicted mean value is interpreted by integrating all the subnets of the exponential level. In practice, the ratio of dropout to p=0.5 is a reasonable default value. However, this value can be fine-tuned on the validation data.

The most popular use of regularization technology dropout

7. From the number of observations

7.1 From the study rate to observe too high learning rate, loss curve will be very strange, it is easy to have parametric explosion phenomenon, low learning rate, loss decline very slow, high learning rate, the first loss will fall quickly, but it is easy to fall into the local minimum value; Good learning rate should be smoothed down.

7.2 Amplification Loss curve Observation Figure 2 The horizontal axis is epoch (network in the entire training set of the complete run time, so each epoch will have more than one mini batches), ordinate is each training batch classification loss. If the loss curve shows that the rate is too low, if the loss no longer falls, the learning rate is too high to fall into the local minimum; the width of the curve is related to the batch size, and if the width is too wide, the change between adjacent batch is too large to reduce the batch size.

7.3 From the accuracy rate curve observed in Fig. 3 The Red line is the exact rate on the training set, and the accuracy rate on the green validation set. When the accuracy of the verification set converges, the red line and the green wire spacing are too large and there is an obvious fitting on the training set. When the spacing between two lines is small and the accuracy is very low, the model learning ability is too low and the capacity of the model need to be increased.

8. Integrated

In machine learning, it is a frontier learning method to train multiple learning devices and combine them. It is well known that integration methods are often more important than individual learners in obtaining higher accuracy. And the integration approach has achieved great success in real-world tasks. In practical applications, especially in challenges and contests, almost all the first and second winners use integration.

Integration techniques in deep learning scenarios:

The same model, different initialization
Use Cross-validation to determine the optimal super parameter, and then train multiple methods according to the best set of parameters, but use different random initialization. The danger of this approach is that the diversity of the model depends solely on initialization.

Discovery of optimal model in cross-validation phase
Use Cross-validation to determine the optimal parameters and then select a few of the best models for integration. This improves the diversity of integration, but he also has a risk: for example, local optimization. In practice, this can be easier to perform because it does not require additional training after cross-validation of the model. In fact, you can directly select several of the most advanced depth models to perform integration from the Caffe Model Zoo.

Different checkpoints for a single model
if the cost of training is high, some people have achieved limited success at different checkpoints with a single network over time (for example, at each stage) and use these to form a whole. Obviously, this is constrained by some lack of diversity, but can still work well in practice. The advantage of this method is that it is very simple.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.