AutoML | Call for Datasets: OpenML 2023 Benchmark Suites

Call for Datasets: OpenML 2023 Benchmark Suites

Posted on April 18, 2023 by Jan van Rijn

Algorithm benchmarks shine a beacon for machine learning research. They allow us, as a community, to track progress over time, identify challenging issues, to raise the bar and learn how to do better. The OpenML.org platform already serves thousands of datasets together with tasks (combination of a dataset with the target attribute, a performance metric and evaluation procedure such as holdout or cross-validation [see here]) in a machine-readable way. OpenML is also integrated into many machine learning libraries, so that fine details about machine learning models (or pipelines) and their performance evaluations can be automatically collected. This integration allows experiments to be automatically shared and organized on the platform, linked to the underlying datasets and tasks. A first OpenML benchmark suite that has been extensively used in scientific papers is the OpenML-CC18, consisting of 72 datasets.

In order to further advance benchmarks in machine learning research, we call for additional classification and regression datasets, to be included in a new benchmark suite, the OpenML curated classification and regression 2023 benchmark suite. In order to do so, we are opening a call for datasets.

Guidelines:

The dataset should comply with FAIR principles, in particular License information should be clear and should ensure that the dataset can be used for scientific practices.
The dataset should facilitate supervised classification or regression tasks, and have a (preferably peer-reviewed) source where the collection procedure, the dataset and its features, and the exact task are explained.
The dataset should be in a machine learning compliant format (i.e., single table with a clearly annotated task)
The data should be independently and identically distributed (i.i.d.). Specifically, the dataset should not have time dependencies, grouping, or require a specific train-test split.

In return for submitting the dataset to be considered for the OpenML benchmarking suite,

The dataset will be reviewed according to the aforementioned guidelines, and a clear decision will be communicated as to why the dataset is or is not suitable for the benchmark suite
The citation information on OpenML will be made programmatically to everyone using the benchmark suite, likely resulting in many citations
If accepted, the dataset will likely be used across many machine learning studies, resulting in a wide range of advanced models being run on it

Submission format: Please upload your dataset in the right format to OpenML [how to upload a dataset] or provide a link to the dataset (in which case we will try to upload it), and please register it in the following form (so we can assess it and get back to you with feedback). In case you have any questions, please contact: jvrijn@liacs.nl

Back