An introduction to Bayesian Optimization for HyperParameter tuning
Introduction
Using scikit-optimize [1], [2] this post will describe the steps required to set up a bayesian optimization to optimize the hyperparameters of a classification model.
Optimization
In broad terms the concept of optimization is to find a set of parameters that minimize a cost function. This cost function is often too expensive to be evaluated repeatedly thus implying the need to minimize the number of evaluations before reaching the optimum.
Bayesian Optimization is one of the available approaches used to minimize this number of evaluation of the cost function. Since the cost function is expensive to evaluate, the solution proposed is to create an estimation of the cost function (with a much smaller evaluation time) and rely on this estimation to find a global minimum.
To summarize, the cost function f has the following properties:
- It is a blackbox over a complex process
- It is costly to evaluate
- Evaluation of the function might suffer from noise
We are looking for a set of parameters X’ that minimize f.
It is not feasible to test all possible combinations of parameters over the cost function, so we will use Bayesian optimization to find X’.
Bayesian Optimization
Bayesian optimization can be describe with those steps:
1. Fitting the surrogate
Given observations (observations are the observations of the cost function at certain points) a model is built to predict the cost function f at a new point. This model is often called the “surrogate” in literature.
One of the most common surrogate models is the Gaussian Process Regressor. This model has the great advantage that its prediction is probabilistic. Which helps selecting the region of interest in the acquisition function (step 2.). Other options are also available but will not be discussed here.
2. Selecting the next point to evaluate
The surrogate is then used to find the optimal set of parameters X’ that minimizes the cost function f. To do so, we use a method called “acquisition function” which role is to select the next point to evaluate through a compromise between exploitation and exploration (we want to minimize f but we might not yet know enough about it)
3. Evaluating the cost function
The cost function f is then evaluated at the point X’ selected via the acquisition function and the result is returned. The result is then appended to the list of observations.
This whole loop is repeated until the desired number of iterations is reached or a stopping criterion is met.
Example
Let’s assume a cost function as describe by the following illustration
At first the optimization starts with a random set of observations. This will help fit the surrogate
Here 3 points have been sampled. The green dotted line represents the surrogate function and the green area the uncertainty of the surrogate at a given point. We can see that the noise of the evaluation of f is having an impact.
Based on the current state of the surrogate the optimization will now select a point to evaluate on the actual cost function (step 2). Once evaluated this point will be added to the observations.
The point selected (around x = 0.2) makes sense as it is close to the minimum observed so far. We can see that this is only a local minimum but we have to take into account the lack of observation and the (quite large) noise each observation suffers from.
After a couple more iterations the optimization has now completely discovered the local minimum mentioned in the last paragraph. Going further more exploration is gonna be required to find the global minimum. We can also see that the noise makes it hard to model the shape of the actual function.
At last the optimization managed to sample one observation close to the actual global minimum. This is already a good result given the level of noise.
HyperParameter tuning
Often when building a Machine Learning model it is unclear how to select hyper parameters that would optimize our expected output.
There are some known and easy to set up solutions to prevent those issues such as Grid Search [3] or Random Search [4] but they all come with their disadvantages (respectively computation time and lack of consistency). Through Bayesian Optimization scikit-optimize proposes a searching approach meant to explore the parameter space within a controllable number of iterations [5].
The model training and evaluation becomes the cost function. The role of the optimization is to find the set of hyper parameters that led to the best model performances. The surrogate will then be a function linking model parameters with model performance.
Here we are able to look through large parameters space. This would be infeasible with a classic grid search as the total number a combination is way too high.
Conclusion
Bayesian Optimization comes in handy to optimize complex costly processes in as little evaluation as possible. This can be applied for example to Machine Learning model hyperparameter tuning through scikit-optimize.
References
[1] “Bayesian optimization with skopt — scikit-optimize 0.8.1 documentation.” https://scikit-optimize.github.io/stable/auto_examples/bayesian-optimization.html (accessed Jul. 08, 2022).
[2] “Scikit-learn hyperparameter search wrapper — scikit-optimize 0.8.1 documentation.” https://scikit-optimize.github.io/stable/auto_examples/sklearn-gridsearchcv-replacement.html (accessed Jul. 08, 2022).
[3] “sklearn.model_selection.GridSearchCV,” scikit-learn. https://scikit-learn/stable/modules/generated/sklearn.model_selection.GridSearchCV.html (accessed Jul. 08, 2022).
[4] “sklearn.model_selection.RandomizedSearchCV,” scikit-learn. https://scikit-learn/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html (accessed Jul. 08, 2022).
[5] “skopt.BayesSearchCV — scikit-optimize 0.8.1 documentation.” https://scikit-optimize.github.io/stable/modules/generated/skopt.BayesSearchCV.html#skopt.BayesSearchCV (accessed Jul. 08, 2022).