Reinforcement Learning (RL) research is still finding its footing when it comes to HPO, with many papers still relying on Grid Searches while there are specific tools to assist in HPO for RL [Parker-Holder et al. 2022]. We give an overview of how to integrate existing HPO best practices [Bischl et al. 2022] with the scientific standards of the RL community in our recipe for reproducible RL research with HPO:
1. Define Training and Test Settings
Some RL environments have explicit train and test distributions, but “setting” here encompasses more: seeds for the agent, the initial state distribution as well as non-determinism within the environment should all be chosen for both train and test settings.. This way we can make sure to develop and tune our method on the train setting and still get valid results from our test setting without overfitting. Some guidance on the selection of seeds can be found in Henderson et al. 2018 and Agarwal et al. 2021.
2. Define an HPO Configuration Space
The hyperparameter configuration space should contain all hyperparameters that are likely to contribute to the success of the training run as well as ranges that likely include the optimal hyperparameter value. As many hyperparameters tend to be impactful in RL algorithms and their importance varies across environments [Eimer et al. 2023], the default choice without domain knowledge should be to optimize all hyperparameters with wide ranges.
3. Select an HPO Method
There is a wide range of methods available, including gradient-based methods, RL-specific optimizers and classic black-box approaches. This choice depends on your configuration space as well as your compute budget. When in doubt, it is best to choose an established optimizer with a limited amount of optimizer settings that need to be configured. Configuring an optimizer usually requires knowledge of the optimizer and problem domain, so selecting an optimizer whose settings are well tested on a variety of domains and need little adjustment is a good idea if you are unsure where to start. Good choices with robust default settings include our own optimizers DEHB and SMAC.
4. Define the HPO Constraints
Do you only have a certain compute budget available? Do you want to tune until a certain performance threshold is reached? Maybe you have a time limit but can parallelize heavily? You should define all of these limits beforehand and apply them to all your tuning runs.
5. Choose a Cost Metric
You will need to judge how well a given configuration is performing, so you should define a metric that captures your use-case well. Potentially you care mostly about transfer performance to a validation environment and will use the evaluation reward for a few episodes on this environment as your cost. Maybe you’re interested in sample efficiency and thus the AUC of the return in training would be a better choice. Whatever fits your setting, your cost metric should be as consistent as possible in order to make comparisons easy. You can improve the reliability of your cost metric by using multiple seeds or episodes whenever possible.
6. Run the HPO for All Reported Methods
It is important to tune not only for a single method but for all of them wherever possible – this way we can make our comparisons as fair as possible. Therefore we tune all baselines and other methods we want to report on using the training setting and HPO procedure outlined so far.
7. Evaluate the Result
After the HPO is finished, we are almost done – but the most important part comes now. Using the results from HPO, we run all baselines and methods on the test setting – usually this means re-training a policy using the test seeds, environments and so on. These are the results we report in our findings. Since we gave all methods the same budget, used a comparable configuration space, the same settings for tuning and testing we should have reliable results and fair comparisons at the end.
Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., and Meger, D.: Deep reinforcement learning that matters. Proceedings of the Conference on Artificial Intelligence, AAAI’18.
Agarwal, R., Schwarzer, M., Castro, P., Courville, A., and Bellemare, M.: Deep reinforcement learning at the edge of the statistical precipice. Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems, NeurIPS’21
Eimer, T., Lindauer, M., Raileanu, R.: Hyperparameters in reinforcement learning and how to tune them. Proceedings of the Fortieth International Conference on Machine Learning, ICML’23