State of Hyperparameter Selection

Source: Internet
Author: User
Tags vcard neural net

State of Hyperparameter SelectionDANIEL SaltielVIEW NOTEBOOK

Historically hyperparameter determination have been a woefully forgotten aspect of machine learning. With the rise of neural nets-which require more hyperparameters, more precisely tuned than many other models-there have Been a recent surge of interest in intelligent methods for selection; However, the average practitioner still seems to commonly use either default Hyperparameters, grid search, random search, or (Believe it or not) manual search.

For the readers who don ' t know, hyperparameter selection boils down to a conceptually) simple problem:you has a set of V Ariables (your hyperparameters) and an objective function (a measure of how good your model are). As you add hyperparameters, the search space of this problem explodes.

Grid Search is (often) Stupid

One method for finding optimal hyperparameters are grid search (divide the space into even increments and test them exhaust ively).

When presented with the above plot, a human would instinctively detect the pattern present, and, if looking for the lowest  Point, would is able to make a intelligent guess on the Where to begin. Most would isn't choose to evenly divide the space into a grid and test every point, yet this was precisely what grid search  Does. Humans has a fundamental intuition from looking on that image that there was areas the minimum is more likely to be. By exhaustively searching the space, you ' re wasting your time on the areas where the function is obviously (excluding an I Mprobable Fluke) Not going to is at the IT minimum and ignoring any information that has from the points you already know.

Random Search Isn ' t Much Better

The next most common method was random search, which is exactly what it sounds like.  Given the same plot, I doubt anybody would decide to pick random points. Random search isn't quite stupid-anybody who have studied statistics knows the power of randomness from techniques like Bootstrapping or Monte Carlo.  In fact, randomly picking parameters often outperforms grid search. Random Hyperparameters can find points that grid search would either skip over (if the granularity of the grid were too COA  RSE) or cost a tremendous amount to find (as the grid becomes finer). It can similarly outperform manual search in some situations:a human would generally focus on the domain in which they hav E seen the lowest points, whereas random search finds new domains neglected by intuition.

A Different Kind of optimization

Outside of cases where finding the absolute global minimum is required and a exhaustive search is necessary, grid search Could only really reasonably being used in situations where the cost of evaluating the objective function was so low it could be considered to being non-existent, and even in those cases the only excuse for implementing it is laziness (i.e. it costs M Ore to implement/run a sophisticated method than to perform grid search). In these cases grid search was preferred to random search because with a fine enough grid search for is guaranteed to find A near optimal point, whereas random search offers no such guarantee.

But what if you were in a situation where finding every point is too costly? Training models is often expensive, both on time and computational power, and this expense skyrockets with the increased D  ATA and model complexity. What we need are an intelligent-on-the-traverse our parameter space and searching for a minimum.

Upon first impression, this might seem like a easy problem.  finding the minimum of a objective function is Pret Ty much all we ever does in machine learning, right?  well Here's the rub:you don ' t know this function.  fitting a regression you choose your the cost function.  training A neural net you choose your activation function.  if know what these functions is, then you know their derivatives (assuming you picked differentiable functions) ; This means your know which direction points "down". This knowledge are fundamental to the most optimization techniques, like momentum or stochastic gradient descent.

There is other techniques, like binary search or Golden ratio search, which don ' t require the derivative directly but req Uire The knowledge your objective is Unimodal-any local, Non-global minima has potential to make this search entirely I Neffective. Yet other optimization methods don't depend upon any knowledge of the underlying function (simulated annealing, Coordinat e descent) but require a large number of samples from the objective function to find a approximate minimum.

So the question are, what does when does the cost of evaluation are high?  How does we intelligently guess where the minimums is likely to is in a high dimensional space? We need a method which doesn ' t waste time on points where there isn ' t expected return, but also won ' t get caught in local Minima.

Being Smart with Hyperparameters

So now that we know what bad grid search and random search was, the questions is what can we do better?  Here we discuss one technique for intelligent hyperparameter Search, known as Bayesian optimization. We'll now cover the concept of what this technique can is used to traverse hyperparameter space; There is an associated IPython Notebook which further illustrates the practice.

Bayesian optimization on a conceptual level

The basic idea was this:you assume there is a smooth response surface (i.e. the curve of your objective function) Conne Cting all your known hyperparameter-objective function points and then you model all of these possible surfaces. Averaging over these surfaces we acquire a expected objective function value for all given point and an associated Varian Ce (in more mathematical terms, we is modeling our response as a Gaussian Process). The lowest the variance on this surface dips are the point of highest ' expected improvement '. This is a black-box model; We need no actual knowledge of the underlying process producing the response curve.

This concept was illustrated clearly in Yelp's MOE documentation for the case of a single hyperparameter.  On top, if we have a objective function response surface, we have no previous data points. It is flat, as we don ' t yet know anything.  We can see, the variance (the shaded region) is also flat.  In the bottom plot we see the maximum expected improvement (the lowest the variance dips). This are also flat, so we point to highest expected improvement is random.

Next we acquire a single data value, effectively pinning our expectation and collapsing it variance around a given point.  Everywhere else the objective function remains unknown, but we are modeling the response surface as smooth. Additionally, you can see how the variance of We sample point can easily is incorporated into this model-the variance s imply isn ' t pinched quite as tightly.

We can see, the objective function value for our acquired point is high (which for this example we'll say is Undesir  ABLE). We pick our next point to the sample as far from here as possible.

We ' ve now ' pinched's response curve on both sides, and begin to explore the middle. However, since the lower hyperparameter value had a lower objective value, we'll favor towards Lower Hyperparameter Valu  Es. The red line above shows the point of maximum expected improvement, i.e. we next point to sample.

Now that we've pinched the middle of the curve, we have a choice to make-exploration or exploitation. You can see this trade-off are automatically made in our Model-where the modeled variance dips the lowest are where our hi Ghest expected improvement Point lies (the one dimensional example isn ' t ideal for illustrating this, but can imagine In more dimensions have large unexplored domains and the need to balance between exploiting better points near the low P Oints you have and exploring these unknown domains).

If you had the capability to carry out multiple evaluations of the response curve in parallel (i.e. can train multiple mo Dels at once), a simple approach for sampling multiple points would is to assume the expected response curve value for you  R current Point and sample a new point based upon. When you get the actual values back, you update your model and keep sampling.

Our hyper-hyperparameter, the variance of the Gaussian process, are actually very important in determining exploration vs.  exploitation. Below you can see both examples of identical expected response surfaces where the variance magnitude (1 on the left, 5 on t He right) which give different next points to sample (note that the scale of the y-axis have changed). The greater variance is set the most the model favors exploration and the lower it is set the more the model favors Exploi Tation.

With Bayesian optimization, the ' in the ' worst case (if you have any) you get random search. As you gain information, it becomes less random and more intelligent, picking points where the maximum improvement is EXPE CTED, trading off between finding absolute minima around previously sampled points and exploring new domains.

There is prominent open source packages which implement Bayesian optimization:the above mentioned MOE (Metric Optimi Zation Engine, produced by Yelp and the source of any of the pretty pictures featured above) and Spearmint (from the Harva  RD group HIPS). These packages is so easy-to-use (see the attached IPython Notebook) that there ' s practically no reason not to implement them on every hyperparameter search perform (the argument, they take computing power to run themselves are valid; h Owever, the computing cost of either was often negligible compared to that of training almost any Non-toy model).

So don ' t waste your time looking places which won ' t yield results. and don ' t search randomly when you can search intelligently.

A Note on Overfitting

As always, by tweaking based to a function of your data, there is a danger of overfitting. The easiest ways to avoid overfitting your hyperparameters is to either tune your hyperparameters to an in-sample met Ric or to keep a third data split for final validation. Additionally, regularization can always is used to bound the potential complexity of your model.

Footnote: Animated plots of MOE exploring various objective functions to find the minimum

Acknowledgement: Scott Clark, sigopt

3 LikesSHARE Notes on EXtreme Gradient boosting
  • "Learning to Interact" by John Langford. Excellent summary of contextual bandits and exploration http://t.co/9vupPzvM82 #machinelearning3 hours ago
  • State of Hyperparameter Selection by @danielsaltiel Http://t.co/dR4gNQ2RIJ #machinelearning http://t.co/oxgRfIZKZu Hours ago
  • Intro to deep learning with Theano and opendeep by @mbeissinger Https://t.co/NNxytB4niK #deeplearning #machinelearning @Op Endeep3 days ago
  • Great introduction to convolutional neural Networks by Aysegul Dundar Https://t.co/knO3eEOidS #deeplearning #machinelearn ing4 days ago
  • RT @Orange_SV: Join US & our panel of experts in 6/10 as we explore #blockchain & its uses. Http://t.co/OYwQkPiAaX #cryptocurrencyhttp://t.co/td48brtict5 days ago
  • Over 92% of Stacks Overflow questions about expert topics is answered-in a median time of one minutes Http://t.co/PQYjDih t2k #datascience5 days ago
  • Excited to participate in ' putting the Blockchain to work ' event by @Orange_SV on 6/10 https://t.co/o2ksOK0V1i #blockchain #machinelearning5 days ago
  • Come to We first-ever Open House June 18. Meet our Fellows & join #MachineLearning discussion http://t.co/hSmEeq5IJi http://t.co/aYEDCn3dF0A week ago
  • #deeplearning layers for speech recognition #GoogleIO2015 http://t.co/9kqn4BmhUmA week ago
  • My @Quora answer to what is applications of data science and machine learning in the Media & entertainment Industry?h ttp://t.co/qmjne4bb1h

State of Hyperparameter Selection

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.