top of page


Updated: May 13, 2022


The field of Artificial Intelligence (AI) and Machine Learning (ML) has seen immense growth over the span of the last 20 years. At the same time, interest in Deep Learning (DL) has increased substantially as well, demonstrated via Google Trends. While such progress is remarkable, rapid growth comes at a cost. Akin to concerns in other disciplines, several authors have noted issues with reproducibility and a lack of significance testing or published results not carrying over to different experimental setups, for instance in NLP, Reinforcement Learning, and optimization. Others have questioned commonly-accepted procedures as well as these problems have not gone unnoticed—many of the works mentioned here have proposed a cornucopia of solutions.

In a quickly-moving publication environment however, keeping track and implementing these proposals becomes challenging. In this work, we weave many of them together into a cohesive methodology for gathering stronger experimental evidence, that can be implemented with reasonable effort. Based on the scientific method we divide the empirical research process—obtaining evidence from data via modeling—into four steps, which are: Data including dataset creation and usage, Codebase & Models , Experiments & Analysis and Publication . For each step, we survey contemporary findings and summarize them into actionable practices for empirical research. While written mostly from the perspective of NLP researchers, we expect many of these insights will be useful for practitioners of other adjacent sub-fields of ML and DL.
An experiment then is reproducible if enough information is provided to find the original evidence even without the tooling for replicating a metric’s exact value. We assume that the practitioner aims to follow these principles in order to find answers to a well- motivated research question by gathering the strongest possible evidence for or against their hypotheses. The following methods therefore aim to reduce uncertainty in each step of the experimental pipeline in order to ensure reproducibility and/or replicability.


Choice of Dataset: The choice of dataset will arise from the need to answer a specific research question within the limits of the available resources. Such answers typically come in the form of comparisons between different experimental setups while using the equivalent data and evaluation metrics. Using a publicly available, well-documented dataset will likely yield more comparable work, and thus stronger evidence. In absence of public data, creating a new dataset according to guidelines which closely follow prior work can also allow for useful comparisons.

Simple baseline methods such as regression analyses or simply manually verifying random samples of the data may provide indications regarding the suitability and difficulty of the task and associated dataset . Such documentation is important, as biases can be introduced at all levels.

The results a model achieves on a given data setup should first and foremost be taken as just that. Appropriate, broader conclusions can be drawn using this evidence provided that biases or incompleteness of the data are addressed. Even with statistical tests for the significance of comparisons, properties such as the size of the dataset and the distributional characteristics of the evaluation metric may influence the statistical power of any evidence gained from experiments. Communicating the limits of the data helps future work in reproducing prior findings more accurately.


A common practice has been to open-source all components of the experimental procedure in a repository. We as a community expect such a repository to contain model implementations, pre-processing code, evaluation scripts, and detailed documentation on how to obtain claimed results using these components.

In DL, such data can be large and impractical to share. However, because results rely heavily on data, it is essential to carefully consider how one can share the data with researchers in the future. Repositories for long-term data storage backed by public institutions should be preferred .
In such cases, practitioners must instead carefully consider how to distribute data and tools to allow future research to produce accurate replications of the original data. As tuning of hyperparameters is typically performed using specific parts of the dataset, it is important to note that any modeling decisions based on them automatically invalidates their use as test data, since reported results are not unseen anymore.

It should be emphasized that distributing model weights should always complement a well-documented repository as libraries and hosting sites might not be supported in the future. Model Evaluation With respect to models and tasks, the exact evaluation procedure can differ greatly. It is important to either reference the exact evaluation script used or at least include the evaluation script in the code base.


We outlined in the introduction how issues with replicability and significance of results in the ML literature have been raised therefore, For model training, it is advisable to set a random seed for replicability, and train multiple initializations per model in order to obtain a sufficient sample size for later statistical tests. Further recommend to vary as many sources of randomness in the training procedure as possible to obtain a closer approximation of the true model performance.

Significance Testing Especially with deep neural networks, even with a fixed set of hyperparameters, performance can be influenced by a number of factors. First of all, the size of the dataset should support sufficiently powered statistical analyses. Secondly, an appropriate significance test should be chosen. Parametric tests are designed with a specific distribution for the test statistic in mind, and have strong statistical power . With the necessary tools at hand, we can now return to carefully answer the original research questions.


To reduce the risks of a reproducibility crisis and unreliable research findings, Ioannidis, experimental rigor is imperative. While necessarily incomplete, this paper aims at providing a rich toolbox of actionable recommendations for each research step, and a reflection and summary of the ongoing broader discussion.


Subsequent to all the prior considerations, the publication step of a research project allows the findings to be spread across the scientific community.

The widespread of DL studies and their increasing use of human-produced data means the outcome of experiments and applications have direct effects on the lives of individuals. From the NLP perspective the social impact research may have beyond the more explored privacy issues.


1. Mees, Oier, Lukas Hermann, and Wolfram Burgard. "What Matters in Language Conditioned Robotic Imitation Learning." arXiv preprint arXiv:2204.06252 (2022).
2. Mees, O., Hermann, L., & Burgard, W. (2022). What Matters in Language Conditioned Robotic Imitation Learning. arXiv preprint arXiv:2204.06252.

32 views0 comments

Recent Posts

See All


bottom of page