Deeplearing Learning notes-improving deep neural networks (third week-hyper-parametric debugging, regularization)

Last Update:2018-07-25 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

background:

Introduction of Hyper-parameter debugging and processing 1-super-parameter debugging

Compared with the earlier one, we can use the grid-like numerical division to do the numerical traversal to obtain the optimal parameters. However, in the field of deep learning, we generally try to use random methods to make parameters.

The grid-like parameters in the above figure can only be fixed within 5 values, which is unwise if we have not yet known which parameters are more important. At this point, if we take the random value of the right figure, in the case where the value is 25, we get 25 parameters 1 and 25 parameter 2. For example, one of the parameters is the learning rate Α\alpha, the other is Ε\epsilon, the left figure just tried 5 α\alpha, and the right image to try 25 α\alpha values, more can find the most appropriate α\alpha.
For many more hyper-parameters, the search space of the parameters is high latitude, and the same is the method of random value, thus improving the search efficiency.

Another way is to rough and refine the search method first. After the above random values, we will find that some areas of the value effect is better, then we in this area to refine the value, more dense value.
2-Select the appropriate range of hyper-parameters

The previously mentioned random value is not a random uniform value within the range of valid values, but a uniform value after selecting the appropriate ruler.
For the number of neurons in a layer in a neural network, we can do a uniform search within a certain range, such as 20~40, or for the number of layers of a neural network, we can even search within a certain range, such as 2~5. However, for some parameters it is not applicable.
For example, learning rate α\alpha, suppose we set its minimum value of 0.0001, its maximum value is 1, that is, the search range (0.0001,1). If there is a random value along this axis, there is actually a 90% probability that the value is between (0.1,1), and only 10% of the search resources are used between (0.0001, 0.1). It is more reasonable to search for super parameters with a logarithmic ruler . The points 0.0001,0.001,0.01,0.1 and 1 are set on the axis respectively, and then the points are evenly taken on the logarithmic axis.

Python implementations:

R=-4 * Np.random.rand () #此时r取值范围是 [ -4,0]
alpha=np.power (10,r) #即alpha =10^r, so the alpha value range is [10^-4,10^0]

If a value is taken between 10^a and 10^b, for the above example, at this time the A=LOG10 (0.0001) =−4,b=log10 (1) =0 a=\log_{10} (0.0001) =-4,b=\log_{10} (1) = 0. Then we can randomly and evenly fetch the value of r from [a, b] and get α=10r \alpha=10^r. We are going to take the value of the 10a 10^a and 10b 10^b interval to the random evenly taking r value between A and B on the logarithmic axis.

For the Β\beta value of the super-parameter used when calculating the exponential weighted average, we assume that the Β\beta is between [0.9,0.999], which can be transformed by 1−β1-\beta at this time, and the range of 1−β1-\beta in [0.001, 0.1] can be used in the above way, To the random, evenly taking R-value problem between [ -3,1]. The Β\beta value is removed by 1−β=10r 1-\beta=10^r.

When Β\beta approaches 1 o'clock, the sensitivity of the resulting results changes, even if the Β\beta changes very slightly. For example, if the Β\beta changes from 0.9000 to 0.9005, then there will be no change in the results, but if Β\beta is changed from 0.999 to 0.9995, it will have a huge impact on the algorithm. In terms of the average value of an exponential weighted average, the former is based on averages of about 10 values, while the latter is averaged from about 1000 values (relative to 0.999), to about 2000 values (relative to 0.9995). The formula on which it is based is 1/(1-β\beta). Therefore, when the Β\beta is close to 1, the resulting value changes very sensitively. Therefore, when the Β\beta is close to 1, it needs to be more densely valued. For 1−

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Deeplearing Learning notes-improving deep neural networks (third week-hyper-parametric debugging, regularization)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Deeplearing Learning notes-improving deep neural networks (third week-hyper-parametric debugging, regularization)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support