You know, unlike machine learning models, deep learning models are filled with a variety of hyper-parameters. Moreover, not all parametric variables have the same contribution to the learning process of the model.
Given this extra complexity, it is not easy to find the optimal configuration of these parameter variables in a multidimensional space.
Every scientist and researcher wants to find the best model under the current resource conditions (calculation, money and time).
Typically, researchers and amateurs will try a search strategy in the final stages of development. This may help improve their model of hard-working training.
In addition, in the semi-automatic/fully automated deep learning process, the hyper-parametric search is also a very important stage.
What exactly is a hyper-parameter?
First let's start with the simplest definition:
Hyper-parameters are knobs that can be rotated when building a machine/deep learning model.
Or say this:
The hyper-parameter is all the training variables manually set before starting the training, with a predetermined value.
We should all agree that learning rate and dropout rate are hyper-parameters, but what about model design variables? such as embedding, number of layers, activation functions, and so on. Should we treat these variables as hyper-parameters?
For simplicity, we also treat model design variables as part of the set of parameters.
So what should be considered in terms of the parameters obtained from the training process and the variables obtained from the data? This is called the model parameter. We will exclude them from the set of hyper-parameters.
Let's give an example. See, an example illustrates the different classifications of variables in a deep learning model.
Our next question: Search is expensive
The challenge of finding the best configuration for hyper-parameters is that a hyper-parametric search is an iterative process that is subject to computational, monetary, and temporal constraints.
Start with a potential configuration guessing (step 1) and wait until a complete training (step 2) is finished to get a realistic assessment of the relevant useful measures (step 3). We will then track the search process (step 4) and then choose a new guess based on our search strategy (step 1).
We'll keep going like this until we're done. Usually, money or time is running out.
Let's talk about strategy.
There are four main strategies we can use to search for the best configuration:
Babysitting, aka Trial and Error (Trial & error)
Grid Search
Random Search
Bayesian optimization (Bayesian optimization)
Babysitting
In the academic field, babysitting is also known as "trial-and-error" or "Postgraduate Descent" (Grad Student descent). This method is 100% manual, usually used by researchers, students and amateurs.
The process is very simple: for example, when a student designs a new experiment, she follows all the steps of the learning process, from data collection to visualization of the feature map, and then she iterates over the parameters in sequence until she has reached the deadline or ran out of other driving forces.
To give the description of the panda work flow.
This approach is very instructive. However, in a team or a company, this approach does not apply because the time of the data scientist is invaluable.
This raises a question for us:
"Is there a better way to take advantage of our time?" ”
Of course, we can use your time by defining an automatic hyper-parametric search strategy.
Grid Search
Grid search is a simple way to try all possible configurations.
Here is the workflow:
Define a grid on the N-dimension, where each map represents a hyper-parameter. For example, n= (Learning_rate, Dropout_rate, Batch_size)
For each dimension, define the range of possible values: For example, batch _ size = [4, 8, 16, 32, 64, 128, 256]
Search for all possible configurations and wait for results to establish optimal configuration: for example C1 = (0.1, 0.3, 4), acc = 92%, C2 = (0.1, 0.35, 4), acc = 92.3%, etc...
Shows a simple two-dimensional grid search for dropout and learning rate.
In general, this parallel strategy can be embarrassing because it does not take into account the computational background. With Grid search, the more computing resources you have, the more guesses you can try at the same time.
The real pain point of this approach is called dimension catastrophe . The more dimensions we add, the more difficult the search becomes, and ultimately, the strategy will be unsustainable.
This method can be used when the dimension is less than or equal to 4 o'clock. But in practice, even if it ensures that the best configuration is finally found, it is still undesirable. Instead, it is best to use random search.
Random Search
A few years ago, Bergstra and Bengio published a paper demonstrating the inefficiency of grid search.
The only real difference between grid search and random search is the first step: random search randomly selects points from the configuration space .
Let's use some of the following images to show the results of the researchers ' arguments.
In the picture, the main reason is to compare the two methods by searching for the best configuration on two hyper-parameter spaces. It also assumes that one parameter is more important than the other.
This is a safe assumption, as mentioned at the beginning, that the deep learning model is really full of various hyper-parameters, and that the researchers/scientists/students know which parameters have the greatest impact on training.
In Grid search, it's easy to notice that even though we trained 9 models, each variable used only 3 values.
In random search, it is very unlikely that the same variable is selected multiple times. If you use the second method, each variable uses 9 different values to train 9 models.
Focus: If your search space contains between 3 and 4 dimensions, do not use grid search. Instead, using random search, it provides a very good benchmark for each search task.
One step back, two steps forward.
In addition, it is important to set the correct scale for each variable when you need to set the space for each dimension.
For example, using a batch size value as a power of 2 and sampling learning rate in the log is common.
Another common practice is to start with one of the above layouts in a number of iterations, and then place the potential subspace by more intensive sampling within each variable range, even starting a new search with the same or different search strategies.
There is one more question: independent guessing
Unfortunately, grid search and random search have a common disadvantage:
"Every new guess is independent of the previous run!" ”
By contrast, the advantages of babysitting have emerged. Babysitting is effective because scientists have the ability to take advantage of past guesses and use it as a resource to improve the next step, effectively promoting search and experimentation.
Wait a minute, this sounds familiar ... What if we tried to model the hyper-parametric search as a machine learning task? What happens?
Well, please allow me to "please out" Bayesian optimization.
Bayesian optimization
The search strategy is to create a proxy model that attempts to predict the metrics we care about from the hyper-parameter configuration.
In each iteration, the agent will become increasingly confident that new guesses will bring new improvements. Like any other search strategy, it stops when everything is exhausted.
: Https://mp.weixin.qq.com/s/wG_N6hKlKXcrXrAhzj3QwQ
A practical guide to deep learning model hyper-parametric search