Deep learning scattered knowledge points (continue more) __ Machine learning

Source: Internet
Author: User

1, Gradient descent algorithm steps:

A. Initializing weights and deviations with random values

B. Getting input into the neural network output value

C. Calculation of errors between predicted and real values

D. Adjust the corresponding weights for each neuron that produces the error to reduce the error

E. Repeat iterations until you get the best weight


2, before the data into the neural network needs to do a series of data preprocessing (rotation, translation, scaling) work, the neural network itself can not complete these transformations


3, bagging operation and neural network in the dropout similar, bagging (bagging method, and boosting lifting method side by side), is a uniform probability distribution from the data of repeated sampling (have put back) technology


4, in the training of neural networks, loss function loss in the first few epochs did not decline, may be because: the learning rate is too low, the regular parameters are too high or fall into the local optimal solution


5. For many high-dimensional non convex functions, local minima (and Maxima) are in fact far less than the other points with zero gradients: saddle points. Some points near the saddle point have a greater cost than the saddle point, while others have a smaller cost.



6, PCA (principal component analysis) is to extract the data distribution variance ratio of the larger direction, also play the role of dimensionality reduction. In the neural network, if the hidden reservoir can realize dimensionality reduction, it extracts the characteristic of predictive ability.


7, CNN and RNN will share the weight value.


8, the Neural Network batch normalization (BN) processing of the fitting principle, because the same data in different batches of the value of the normalized after the difference, the equivalent of doing a data enhancement


9, the input picture size is 200x200, followed by a layer of convolution (kernel size 5x5,padding 1,stride 2), pooling (kernel size 3x3,padding 0,stride 1), another layer of convolution (kernel Size 3x3,padding 1,stride 1), the output feature map is sized to. 97

Analysis:

Padding is an outward-extending edge size, Stride is the step length of each move

Output size = (Input dimension + Padding*2-kernel size)/stride + 1

First layer convolution: output = (200+2-5)/2 + 1 = 99.5 (take 99)

Second-tier pooling: output = (99-3)/1 +1 = 97

Third-tier convolution: output = (97+2-3)/1 +1 = 97

The convolution layer is rounded down and the pool layer is rounded up: the formula of the convolution layer is used directly/number, the remainder is removed and the whole is taken down. In the pool layer, the Ceil function is used to take the whole up.


10, H-k algorithm, under the minimum mean square error criterion for weight vector, applicable to the linear and nonlinear can be divided, for the linear can be divided to give the optimal weight vector, for the nonlinear can be discriminated, to exit the iteration.


11, three dense matrix A (m*n), B (n*p), C (p*q), m<n<p<q, to calculate ABC, how to calculate the highest efficiency. (AB) C

(AB) c:m*n*p+m*p*q multiplication operation

A (BC): n*p*q+m*n*p multiplication operation


12. The following figure is a gradient descent graph of neural network training with four hidden layers using the sigmoid function as the activation function. This neural network encounters a problem of gradient disappearance. The first hidden layer corresponds to D, the second hidden layer corresponds to C, the third hidden layer corresponds to B, the fourth hidden layer corresponds to a (A is the first layer of reverse propagation, the learning speed is fast)


13, assuming that in training we suddenly encountered a problem, after several cycles, the error instantaneous reduction you think the data is problematic, so you draw the data and found that perhaps the data is too much to spend on the problem, how to solve.


Solution: PCA and normalization of data. Error instantaneous reduction, the general reason is that multiple data samples have strong correlation and suddenly be fitted hit, or contain more generous data samples are suddenly fitted to hit. So the principal component analysis (PCA) and normalization of data can improve the problem.


14, we can observe that there are many small "fluctuations" in the error. Should we be worried about this situation?


No, just a cumulative drop in the training set and cross validation set.

To reduce these "ups and downs", you can try increasing the batch size (batch size). In particular, in order to reduce these "fluctuations" when the overall trend of the curve is declining, an attempt is made to increase the batch size (batch size) to narrow the range of batch synthetic gradient direction. When the overall curve trend is flat, there is a considerable "fluctuation", which can be used to reduce the learning rate to further converge. "Ups and downs" can not be seen in advance to stop training to avoid the fitting.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.