Which Leakage Types Matter? A Quantitative Landscape Across 2,047 Benchmark Datasets

Roth, Simon

Which leakage types actually matter?

Not all leakage is equal. Effects vary by two orders of magnitude, and the variation is organized by causal mechanism.

0

experiments

0

datasets

0

classes

0

% mean actual CI coverage at nominal 95%

Textbooks emphasize the leakage class that barely matters on average. Competitions reward the one that inflates scores most. Neither is calibrated to measured effect sizes.

Paper

Which Leakage Types Matter?

A Quantitative Landscape Across 2,047 Benchmark Datasets

Roth, S. (2026). Epagogy. · arXiv:2604.04199

arXiv 2604.04199 PDF ↓ Zenodo PDF GitHub Repo BibTeX

The question

Everyone agrees data leakage is bad. But the severity had not been comparatively measured — not across classes, not at scale, not with pre-registered predictions. Two caveats: predictions were registered internally, not on a public registry; and d_z is standardised on paired differences, so cross-class comparisons should be read as directional, not exact. Textbooks emphasize fitting scalers on the full dataset. Competition winners worry about seed selection. Reviewers flag target encoding. Are these the same severity? The same mechanism? The answers are not what the conventional emphasis implies.

This paper tests 29 experiments (28 core + 1 boundary study on 129 temporal datasets) across 2,047 tabular benchmark datasets from OpenML and PMLB, covering 17 scientific fields. The design is within-subject counterfactual: each dataset and model serves as its own control. Every finding replicates on a held-out confirmation split.

Four classes

Leakage effects vary by two orders of magnitude. The variation is not random — it is organized by causal mechanism. Four classes emerge:

Class	Mechanism	Effect	Practical impact
I — Estimation	Fitting scalers, encoders, or imputers on the full dataset before splitting	d_z ≈ 0	Negligible. ΔAUC ≤ 0.005 across all 9 experiments. Vanishes at n ≥ 200.
II — Selection	Peeking at test performance, seed cherry-picking, target encoding, early stopping on validation	d_z = 0.27–0.93	Dominant. Persistent even at large n. Peeking inflates AUC by +0.040. Seed cherry-picking (best of 10) inflates by +0.045.
III — Memorization	Duplicate rows in train+test, SMOTE before split, random oversampling	d_z = 0.37–1.11	Scales with model capacity. Naïve Bayes barely affected (d_z = 0.37). Decision trees fully memorize (d_z = 1.11).
IV — Boundary	Random CV applied to temporal, group, or spatially structured data	d_z ≈ 0†	† Undetectable under random CV by design. +0.023 AUC on 14 genuinely temporal datasets; near-zero on benchmarks without real drift. Structural mismatch, not parameter bias.

Effect sizes at a glance

Cohen’s d_z for each leakage type, measured across 2,047 datasets. Hover for raw ΔAUC.

Cohen’s d_z by leakage class across 2,047 datasets. Hover bars for raw ΔAUC values.

The uncomfortable finding

The leakage that textbooks emphasize most — fitting a scaler on the full dataset — has negligible effect. The leakage that competition platforms structurally incentivize — peeking at test performance and selecting the best seed — produces the largest and most persistent inflation.

Peeking at a 10-fold cross-validated test set inflates AUC by +0.040 (d_z = 0.93). This does not shrink with more data. The non-zero floor is ΔAUC∞ ≈ 0.046.

Seed cherry-picking (best of 10 random seeds) inflates AUC by +0.045 with a logarithmic dose-response curve (R² = 0.994 on 5 aggregated dose levels, not individual datasets). Try 100 seeds and you get +0.08.

Key experiments

Experiment	Class	d_z	ΔAUC	Finding
Normalize before split	I	−0.02	≤ 0.001	Not detectable above noise
Peeking (k = 10)	II	0.93	+0.040	92% prevalence, persistent at all n
Seed cherry-picking (best of 10)	II	0.89	+0.045	Dose-response R² = 0.994
Early stopping on validation	II	0.46	+0.013	76% of datasets affected
Screen selection bias	II	0.27	+0.013	K-invariant; present regardless of k
10% duplication (Decision Tree)	III	1.11	+0.024	Full memorization; capacity-dependent
10% duplication (Naïve Bayes)	III	0.37	+0.004	Low-capacity model barely affected
SMOTE before split	III	0.55	+0.07–+0.25 (n-dep.)	Equal to random oversampling

Class IV (boundary leakage) is not in the table above — all rows measure effects under random CV, and random CV cannot detect boundary leakage by design. The boundary experiment covered 129 temporal datasets: 14 with verified genuine timestamps (effect +0.023 AUC), 92 FOREX null controls without real drift (effect near zero), and 23 with spurious time columns (excluded from effect estimates).

Cross-validation coverage gap

Across these 2,047 datasets, standard k-fold cross-validation confidence intervals achieve only 55.1% actual coverage at nominal 95%. The Student-t correction improves this to 70.4%. The best method tested (Conservative-Z) reaches 87%. None of the methods tested achieves the stated 95%.

Actual coverage at nominal 95% for six CI methods (LR and DT). Bootstrap collapses for DT. Reference line at 95%.

This means published error bars from cross-validation are systematically too narrow. Reported differences between models may not be differences at all.

Mechanism matters more than data

A Bayesian hierarchical meta-regression across 12,103 observations confirms: between-experiment variance (τ_exp = 0.013) exceeds between-dataset variance (τ_ds = 0.005) by 2.6×. All dataset moderators — log(n), log(p), class imbalance — classify as NULL after hierarchical correction.

The practical implication: leakage prevention should be unconditional. Dataset characteristics do not reliably predict which datasets are vulnerable. You either prevent the mechanism or you don’t.

Validation

Internal replication: All findings confirmed on held-out confirmation split. Prediction scorecard: 10/13 directional predictions confirmed (internal validation protocol, not formal pre-registration on an external platform); 3 failures documented in the paper. N-scaling: 493 datasets tested from n = 50 to 10,000 — Class I vanishes at n ≥ 200; Class II persists across all tested sample sizes.

Why this matters

If you teach ML, you should spend less time on scaler leakage and more time on peeking and seed selection. If you build ML tools, you should enforce constraints on test-set access, not just preprocessing order. If you review ML papers, you should ask how many seeds were tried and whether the test set was touched before the final report.

These findings are the empirical foundation for the grammar of machine learning workflows. The grammar’s four hard constraints — assess-once per holdout, prepare-after-split, type-safe transitions, and no unregistered data into fit — target the two classes that matter (II and III), not the one that doesn’t (I). The constraints are theoretically grounded in prior work — Hastie, Tibshirani & Friedman (2009) on sample-splitting logic, Bates, Hastie & Tibshirani (2024) on what CV actually estimates, Codd (1970) and Wickham (2010) on structural rejection precedent. This landscape adds empirical severity: the constraints can now be judged not just by logical necessity but by measured effect sizes.

Citation

Roth, S. (2026). Which Leakage Types Matter? A Quantitative
Landscape Across 2,047 Benchmark Datasets. Epagogy.

BibTeX

@misc{roth2026leakage,
  title = {Which Leakage Types Matter? A Quantitative Landscape
            Across 2,047 Benchmark Datasets},
  author = {Roth, Simon},
  year = {2026},
  note = {arXiv:2604.04199},
  url = {https://arxiv.org/abs/2604.04199}
}

Questions or critique

Pre-prints are falsifiable by design. If you find a structural error, get in touch.

Email Simon GitHub