Data Availability and Reproducibility Requirements in Scientific Journals
Data availability statements and reproducibility standards have moved from optional courtesy to editorial requirement at most major journals — a shift driven by a reproducibility crisis that shook confidence across biomedical, psychology, and social science research fields. This page covers what journals actually demand when they say "make your data available," how those requirements are enforced, where researchers most commonly run into trouble, and how editorial teams decide when exceptions apply.
Definition and scope
A data availability statement is a short, formal declaration in a published article specifying where the data underlying the findings can be accessed — whether in a public repository, as supplementary material, or on request. Reproducibility requirements extend that idea: the journal asks that any reader with appropriate expertise could, in principle, re-run the analysis and arrive at the same conclusions.
These two concepts are related but distinct. Data availability is about access — can someone get the files? Reproducibility is about completeness — are the methods, code, and materials documented thoroughly enough that re-running the work is actually possible? A paper can technically comply with a data availability policy by depositing a dataset while still being effectively irreproducible if the analysis scripts, preprocessing steps, or model parameters are missing.
The scope of these requirements varies substantially by discipline. Genomics journals have required raw sequence deposition in NCBI databases like GenBank for decades. Clinical trial registrations are expected before patient enrollment begins (ClinicalTrials.gov). In contrast, social science journals adopted formal data policies more recently, and in fields like qualitative sociology, what "data" even means remains genuinely contested.
For a broader orientation to how journals are structured and evaluated, the Scientific Journal Authority covers the landscape across disciplines.
How it works
When a manuscript is submitted to a journal with a data availability policy, the workflow typically unfolds in 4 distinct stages:
- Statement at submission — Authors declare, during the submission process, whether data will be deposited in a public repository, available as supplementary files, available on request, or restricted due to legal or ethical constraints.
- Editorial screening — Editors check whether the stated approach is consistent with journal policy. Some journals, like those published by PLOS, require data to be deposited before peer review begins (PLOS Data Policy).
- Reviewer access — Peer reviewers may be given access to data files during the review period to verify that the analyses are reproducible. Not all journals do this, but the practice is growing.
- Post-acceptance verification — High-rigor journals, particularly in computational fields, may run independent code checks before publication. Nature journals introduced a statistics checklist in 2013 that requires explicit reporting of sample sizes, randomization, and blinding methods (Nature Reporting Standards).
The role of repositories matters here. Generalist platforms like Zenodo (run by CERN) and Figshare accept almost any file type. Discipline-specific repositories — the Inter-university Consortium for Political and Social Research (ICPSR) for social science data, Dryad for ecology and evolutionary biology — offer metadata standards tailored to their fields. The UK's UKRI and the US NIH Data Management and Sharing Policy (effective January 2023) both mandate data management plans for funded research, pushing the requirement upstream from publication into the grant stage.
This connects directly to questions around open access publishing in science and the broader federal open access mandate in the US, which reshaped how publicly funded research must be shared.
Common scenarios
Three patterns cover most of the cases researchers encounter:
Scenario 1: Quantitative, computational research. This is where data availability requirements are most mature and most enforceable. A paper running a regression on a public dataset like the American Community Survey should, in most journals' view, provide the exact dataset version, all analysis code, and a reproducible environment specification (such as a containerized environment via Docker or a session info file). Journals like eLife and PLOS Computational Biology expect this level of documentation.
Scenario 2: Human subjects data. Identifiable patient or participant data cannot be deposited publicly, and journals acknowledge this. The standard workaround is depositing a de-identified or synthetic version of the dataset, specifying the conditions under which the original data can be accessed (often through a data use agreement with the originating institution), and documenting the code separately. The NIH's Genomic Data Sharing Policy offers a formal framework for controlled access with an approval process.
Scenario 3: Proprietary or commercially licensed data. Research using licensed financial data, proprietary satellite imagery, or third-party survey panels may be unable to share the underlying files. In these cases, journals typically accept a detailed methodological description, a data availability statement that names the source and explains the restriction, and sometimes a commitment to provide summary statistics.
Decision boundaries
The phrase "data available on reasonable request" appears in the statements of a striking number of papers — and has attracted scrutiny. A 2022 analysis published in BMJ Open found that when authors who used the phrase were actually contacted, the majority either did not respond or declined to share (BMJ Open, "Availability of study data…", 2022). This gap between stated and actual availability is the central enforcement problem facing journals.
Editors draw the line — or try to — along two axes:
- Policy type: Is the journal's data policy a recommendation, a requirement, or a condition of publication? Journals like Scientific Data (Nature Portfolio) and PLOS ONE treat compliance as mandatory; many others still treat it as aspirational.
- Data sensitivity: Ethical restrictions (human subjects, endangered species location data, national security-adjacent research) are recognized exceptions in essentially all policies. Commercial restrictions are handled inconsistently and remain an active area of editorial debate.
The connection to research integrity is direct — see research ethics and publication standards and retractions and corrections in science for how reproducibility failures feed into the retraction pipeline.
References
- NIH Data Management and Sharing Policy — U.S. National Institutes of Health
- PLOS Data Availability Policy — Public Library of Science
- Nature Portfolio Reporting Standards — Springer Nature
- GenBank / NCBI — National Center for Biotechnology Information
- NIH Genomic Data Sharing Policy — U.S. National Institutes of Health
- ClinicalTrials.gov — U.S. National Library of Medicine
- Zenodo — CERN Open Science Repository
- Dryad Data Repository — Dryad Digital Repository
- ICPSR — Inter-university Consortium for Political and Social Research
- UKRI Open Research Data Policy — UK Research and Innovation
- BMJ Open — "Availability of study data claimed to be available on request" — BMJ Publishing Group, 2022