Post
- Hyperparameter Optimization for Neural Networks
Paper
- Algorithms for Hyper-parameter optimization
Noteintroduction
Sometimes it can difficult to choose a correct architecture for neural Networks. Usually, this process requires a lot of experience because networks include many parameters. Let's check some of the most important parameters, we can optimize for the neural network:
- Number of layers
- Different parameters for each layer (number of hidden units, filter size for convolutional layer and so on)
- Type of activation functions
- Parameter Initialization method
- Learning Rate
- Loss function
Even though the list of parameters in don't even close to being complete, it's still impressive how many parameters Influenc ES network ' s accuracy.
Hyperparameter optimization
In this article, I would like to show a few different hyperparameter selection methods.
- Grid Search
- Random Search
- Hand-tuning
- Gaussian Process with expected improvement
- tree-structured Parzen estimators (TPE)
Grid Search
The simplest algorithms that's can use for hyperparameter optimization is a Grid Search. The idea was simple and straightforward. You just need to define a set of parameter values, train model for all possible parameter combinations and select the best One. This method is a good choice if the model can train quickly, which is isn't the case for typical neural networks.
Imagine that we need to optimize 5 parameters. Let's assume, for simplicity, so we want to try all different values per each parameter. Therefore, we need to make 100,000 ( 5) evaluations. Assuming that network trains, minutes on average we'll have finished hyperparameter tuning in almost 2 years. Seems crazy, right? Typically, network trains much longer and we need to tune more hyperparameters, which means the it can take forever to Ru N Grid search for typical neural network. The better solution is random search.
Random Search
The idea was similar to Grid Search, but instead of trying all possible combinations we'll just use randomly selected sub Set of the parameters. Instead of trying to check 100,000 samples we can check for only $ parameters. Now it should take a week to run Hyperparameter optimization instead of 2 years.
Let's sample two-dimensional data points from a uniform distribution.
In case if there is not enough data points, random sampling doesn ' t fully covers parameter space. It can be seen in the figure above because there is some regions that don ' t has data points. In addition, it samples some points very close to each other which is redundant for our purposes. We can solve this problem with low-discrepancy sequences (also called quasi-random sequences).
There is many different techniques for quasi-random sequences:
- Sobol sequence
- Hammersley Set
- Halton sequence
- Poisson Disk Sampling
Let's compare some of the mentioned methods with previously random sampled data points.
As we can see now sampled points spread out through the parameter space more uniformly. One disadvantage of these methods is and not all of them can provide good results for the higher dimensions. For instance, Halton sequence and Hammersley set does not work well for dimension bigger than 10 [7].
Even though we improved hyperparameter optimization algorithm it still is not suitable for large neural networks.
But before we move on to more complicated methods I want to focus on parameter hand-tuning.
Hand-tuning
Let's start with an example. Imagine that we want to select the best number of units on the hidden layer (we set up just one hyperparameter for simplic ity). The simplest thing is-try different values and select the best one. Let's say we set up ten units for the hidden layer and train the network. After the training, we check the accuracy for the validation datasets and it turns out that we classified 65% of the sample s correctly.
The accuracy is low and so it's intuitive to think that we need more units in a hidden layer. Let ' s increase the number of units and check the improvement. But, by how many should we increase the number of units? Would small changes make a significant effect on the prediction accuracy? Would it is a good step to set up a number of hidden units equal to 12? Probably not. So let's go further and explore parameters from the next order of magnitude. We can set up a number of hidden units equal to 100.
For the hidden units, we got prediction accuracy equal to 82% which are a great improvement compared to 65%. Points in the above show us so by increasing number of hidden units we increase the accuracy. We can proceed using the same strategy and train network with a. Hidden units.
After the third iteration, we prediction accuracy is 84%. We ' ve increased the number of units by a factor of both and got only 2% of improvement.
We can keep going, but I think judging by this example it's clear that human can select parameters better than Grid Searc H or Random search algorithms. The main reason why is, we able to learn from our previous mistakes. After the iteration, we memorize and analyze our previous results. This information gives us a much better to selection of the next set of parameters. And even more than that. The more your work with neural networks the better intuition your develop for WHO and when to use.
Nevertheless, let's get back to our optimization problem. How can we automate the process described above? One-of-the-doing this is to apply a Bayesian optimization.
Other Reference
[Topic Discussion] Hyperparameter Optimization for Neural Networks