background:
Introduction of Hyper-parameter debugging and processing 1-super-parameter debugging
Compared with the earlier one, we can use the grid-like numerical division to do the numerical traversal to obtain the optimal parameters. However, in the field of deep learning, we generally try to use random methods to make parameters.
The grid-like parameters in the above figure can only be fixed within 5 values, which is unwise if we have not yet known which parameters are more important. At this point, if we take the random value of the right figure, in the case where the value is 25, we get 25 parameters 1 and 25 parameter 2. For example, one of the parameters is the learning rate Α\alpha, the other is Ε\epsilon, the left figure just tried 5 α\alpha, and the right image to try 25 α\alpha values, more can find the most appropriate α\alpha.
For many more hyper-parameters, the search space of the parameters is high latitude, and the same is the method of random value, thus improving the search efficiency.
Another way is to rough and refine the search method first. After the above random values, we will find that some areas of the value effect is better, then we in this area to refine the value, more dense value.
2-Select the appropriate range of hyper-parameters
The previously mentioned random value is not a random uniform value within the range of valid values, but a uniform value after selecting the appropriate ruler.
For the number of neurons in a layer in a neural network, we can do a uniform search within a certain range, such as 20~40, or for the number of layers of a neural network, we can even search within a certain range, such as 2~5. However, for some parameters it is not applicable.
For example, learning rate α\alpha, suppose we set its minimum value of 0.0001, its maximum value is 1, that is, the search range (0.0001,1). If there is a random value along this axis, there is actually a 90% probability that the value is between (0.1,1), and only 10% of the search resources are used between (0.0001, 0.1). It is more reasonable to search for super parameters with a logarithmic ruler . The points 0.0001,0.001,0.01,0.1 and 1 are set on the axis respectively, and then the points are evenly taken on the logarithmic axis.
Python implementations:
R=-4 * Np.random.rand () #此时r取值范围是 [ -4,0]
alpha=np.power (10,r) #即alpha =10^r, so the alpha value range is [10^-4,10^0]
If a value is taken between 10^a and 10^b, for the above example, at this time the A=LOG10 (0.0001) =−4,b=log10 (1) =0 a=\log_{10} (0.0001) =-4,b=\log_{10} (1) = 0. Then we can randomly and evenly fetch the value of r from [a, b] and get α=10r \alpha=10^r. We are going to take the value of the 10a 10^a and 10b 10^b interval to the random evenly taking r value between A and B on the logarithmic axis.
For the Β\beta value of the super-parameter used when calculating the exponential weighted average, we assume that the Β\beta is between [0.9,0.999], which can be transformed by 1−β1-\beta at this time, and the range of 1−β1-\beta in [0.001, 0.1] can be used in the above way, To the random, evenly taking R-value problem between [ -3,1]. The Β\beta value is removed by 1−β=10r 1-\beta=10^r.
When Β\beta approaches 1 o'clock, the sensitivity of the resulting results changes, even if the Β\beta changes very slightly. For example, if the Β\beta changes from 0.9000 to 0.9005, then there will be no change in the results, but if Β\beta is changed from 0.999 to 0.9995, it will have a huge impact on the algorithm. In terms of the average value of an exponential weighted average, the former is based on averages of about 10 values, while the latter is averaged from about 1000 values (relative to 0.999), to about 2000 values (relative to 0.9995). The formula on which it is based is 1/(1-β\beta). Therefore, when the Β\beta is close to 1, the resulting value changes very sensitively. Therefore, when the Β\beta is close to 1, it needs to be more densely valued. For 1−