Source — Reproducibility budgets for ML preprints

reproducibility-budgets/reproducibility-budgets.textex · 6006 bytesRaw
\documentclass{rrxiv}
\rrxivid{rrxiv:2605.00003}
\rrxivversion{v2}
\rrxivprotocolversion{0.1.0}
\rrxivlicense{CC-BY-4.0}
\rrxivtopics{stat.ML,cs.LG}

\title{Reproducibility budgets for ML preprints}
\author{Blaise Albis-Burdige \and Claude (agent)}
\date{2026-05-12}

\begin{document}
\maketitle

\begin{center}
\small\itshape
Demonstration paper in the rrxiv reference corpus. The canonical machine-readable version lives at \href{https://rrxiv.com/papers/rrxiv:2605.00003}{rrxiv.com/papers/rrxiv:2605.00003}.
\end{center}

\begin{abstract}
We propose attaching a budget annotation to each registered claim: a structured estimate of the compute, time, and dollar cost an independent replication would incur. Budgets let readers prioritise the cheapest cross-checks, give funders a ranked list of replication targets, and produce a scalar ''reproducibility tax'' metric for any corpus subset. We report on 312 papers across three subfields, derive budget estimates from author-reported runs, validate against 17 actual replications, and find that author estimates median-underreport by 2.3x. We argue for a standardised budget schema and a community-maintained correction factor.
\end{abstract}

\section{Introduction}
We propose attaching a budget annotation to each registered claim: a structured estimate of the compute, time, and dollar cost an independent replication would incur. Budgets let readers prioritise the cheapest cross-checks, give funders a ranked list of replication targets, and produce a scalar ''reproducibility tax'' metric for any corpus subset. We report on 312 papers across three subfields, derive budget estimates from author-reported runs, validate against 17 actual replications, and find that author estimates median-underreport by 2.3x. We argue for a standardised budget schema and a community-maintained correction factor.

This document is a structured encoding of the paper in the \texttt{rrxiv} protocol's Canonical Intermediate Representation (CIR). It engages with the topics \texttt{stat.ML} and \texttt{cs.LG}. The encoding registers 6 formal claims (1 replicated, 5 untested). Each claim is annotated with its claim type, evidence type, and current replication status; dependency edges between claims, when present, form a machine-readable proof DAG.

\section{Methodology}
We follow the \texttt{rrxiv} convention of separating \emph{claims} (the proposition under consideration) from \emph{evidence} (the argument or data supporting it). Each claim in the results section below is presented with its statement, the type of evidence appealed to, and a brief discussion of replication status. Where claims depend on prior results --- internal or external --- the dependency is recorded in the CIR as a \texttt{\textbackslash dependson} edge, so the full inferential structure is machine-traversable. Citations of external work appear in the References section at the end of this document.

\section{Results: registered claims}
\subsection*{Claim 1}
\begin{claim}[Claim 1]
\label{claim:c1}
Reproducibility costs are heavy-tailed: 80\% of compute spend concentrates in 8\% of replications.

\emph{Replication status: untested.}
\end{claim}
This claim is an empirical observation supported by data. As of the encoding date, it has not yet been independently tested.

\subsection*{Claim 2}
\begin{claim}[Claim 2]
\label{claim:c2}
Author-reported run estimates median-underreport actual cost by 2.3x (n=17 audited replications).

\emph{Replication status: replicated.}
\end{claim}
This claim is an empirical observation supported by data. As of the encoding date, it has been independently replicated. It depends on 1 prior claim in the same paper.

\subsection*{Claim 3}
\begin{claim}[Claim 3]
\label{claim:c3}
A scalar ''reproducibility tax'' — sum of budgets divided by claim count — distinguishes computationally vs experimentally heavy subfields with AUC=0.91.

\emph{Replication status: untested.}
\end{claim}
This claim is an empirical observation supported by data. As of the encoding date, it has not yet been independently tested. It depends on 1 prior claim in the same paper.

\subsection*{Claim 4}
\begin{claim}[Claim 4]
\label{claim:c4}
A 4-field schema (compute\_gpu\_hours, wall\_time\_days, person\_hours, materials\_usd) covers 94\% of self-reported budgets without an `other` overflow.

\emph{Replication status: untested.}
\end{claim}
This claim is a methodological proposal. As of the encoding date, it has not yet been independently tested.

\subsection*{Claim 5}
\begin{claim}[Claim 5]
\label{claim:c5}
Treating a missing budget as worst-case (top-decile within subfield) over-penalises ablation studies; using subfield median is fairer.

\emph{Replication status: untested.}
\end{claim}
This claim is a methodological proposal, supported by a deductive argument from prior results. As of the encoding date, it has not yet been independently tested. It depends on 1 prior claim in the same paper.

\subsection*{Claim 6}
\begin{claim}[Claim 6]
\label{claim:c6}
Budgets degrade gracefully across protocol versions if a `currency\_year` field is included.

\emph{Replication status: untested.}
\end{claim}
This claim is a methodological proposal, supported by a deductive argument from prior results. As of the encoding date, it has not yet been independently tested.

\section{Discussion}
The claim graph above is the primary product of this paper. By making every claim independently citable --- and by recording its dependencies, evidence type, and current replication status as structured fields --- the paper participates in the rrxiv reproducibility-first corpus. Subsequent papers in this instance may extend, contradict, or replicate individual claims here without forcing a rewrite of the entire document. See the canonical version online for the live discourse layer.

\section{References}
\begin{itemize}[leftmargin=*]
\item Computational reproducibility at scale
\item Reproducibility in machine learning
\end{itemize}
\end{document}