Which leakage types actually matter?
Not all leakage is equal. Effects vary by two orders of magnitude, and the variation is organized by causal mechanism.
Textbooks emphasize the leakage class that barely matters on average. Competitions reward the one that inflates scores most. Neither is calibrated to measured effect sizes.
Which Leakage Types Matter?
A Quantitative Landscape Across 2,047 Benchmark Datasets
The question
Everyone agrees data leakage is bad. But the severity had not been comparatively measured — not across classes, not at scale, not with pre-registered predictions. Two caveats: predictions were registered internally, not on a public registry; and dz is standardised on paired differences, so cross-class comparisons should be read as directional, not exact. Textbooks emphasize fitting scalers on the full dataset. Competition winners worry about seed selection. Reviewers flag target encoding. Are these the same severity? The same mechanism? The answers are not what the conventional emphasis implies.
This paper tests 29 experiments (28 core + 1 boundary study on 129 temporal datasets) across 2,047 tabular benchmark datasets from OpenML and PMLB, covering 17 scientific fields. The design is within-subject counterfactual: each dataset and model serves as its own control. Every finding replicates on a held-out confirmation split.
Four classes
Leakage effects vary by two orders of magnitude. The variation is not random — it is organized by causal mechanism. Four classes emerge:
| Class | Mechanism | Effect | Practical impact |
|---|---|---|---|
| I — Estimation | Fitting scalers, encoders, or imputers on the full dataset before splitting | dz ≈ 0 | Negligible. ΔAUC ≤ 0.005 across all 9 experiments. Vanishes at n ≥ 200. |
| II — Selection | Peeking at test performance, seed cherry-picking, target encoding, early stopping on validation | dz = 0.27–0.93 | Dominant. Persistent even at large n. Peeking inflates AUC by +0.040. Seed cherry-picking (best of 10) inflates by +0.045. |
| III — Memorization | Duplicate rows in train+test, SMOTE before split, random oversampling | dz = 0.37–1.11 | Scales with model capacity. Naïve Bayes barely affected (dz = 0.37). Decision trees fully memorize (dz = 1.11). |
| IV — Boundary | Random CV applied to temporal, group, or spatially structured data | dz ≈ 0† | † Undetectable under random CV by design. +0.023 AUC on 14 genuinely temporal datasets; near-zero on benchmarks without real drift. Structural mismatch, not parameter bias. |
Effect sizes at a glance
Cohen’s dz for each leakage type, measured across 2,047 datasets. Hover for raw ΔAUC.
The uncomfortable finding
The leakage that textbooks emphasize most — fitting a scaler on the full dataset — has negligible effect. The leakage that competition platforms structurally incentivize — peeking at test performance and selecting the best seed — produces the largest and most persistent inflation.
Peeking at a 10-fold cross-validated test set inflates AUC by +0.040 (dz = 0.93). This does not shrink with more data. The non-zero floor is ΔAUC∞ ≈ 0.046.
Seed cherry-picking (best of 10 random seeds) inflates AUC by +0.045 with a logarithmic dose-response curve (R² = 0.994 on 5 aggregated dose levels, not individual datasets). Try 100 seeds and you get +0.08.
Key experiments
| Experiment | Class | dz | ΔAUC | Finding |
|---|---|---|---|---|
| Normalize before split | I | −0.02 | ≤ 0.001 | Not detectable above noise |
| Peeking (k = 10) | II | 0.93 | +0.040 | 92% prevalence, persistent at all n |
| Seed cherry-picking (best of 10) | II | 0.89 | +0.045 | Dose-response R² = 0.994 |
| Early stopping on validation | II | 0.46 | +0.013 | 76% of datasets affected |
| Screen selection bias | II | 0.27 | +0.013 | K-invariant; present regardless of k |
| 10% duplication (Decision Tree) | III | 1.11 | +0.024 | Full memorization; capacity-dependent |
| 10% duplication (Naïve Bayes) | III | 0.37 | +0.004 | Low-capacity model barely affected |
| SMOTE before split | III | 0.55 | +0.07–+0.25 (n-dep.) | Equal to random oversampling |
Class IV (boundary leakage) is not in the table above — all rows measure effects under random CV, and random CV cannot detect boundary leakage by design. The boundary experiment covered 129 temporal datasets: 14 with verified genuine timestamps (effect +0.023 AUC), 92 FOREX null controls without real drift (effect near zero), and 23 with spurious time columns (excluded from effect estimates).
Cross-validation coverage gap
Across these 2,047 datasets, standard k-fold cross-validation confidence intervals achieve only 55.1% actual coverage at nominal 95%. The Student-t correction improves this to 70.4%. The best method tested (Conservative-Z) reaches 89%. None of the methods tested achieves the stated 95%.
Actual coverage at nominal 95% for six CI methods, measured on Logistic Regression (LR) and Decision Tree (DT). Bootstrap collapses for DT. Reference line at 95% nominal.
This means published error bars from cross-validation are systematically too narrow. Reported differences between models may not be differences at all.
Mechanism matters more than data
A Bayesian hierarchical meta-regression across 12,103 observations confirms: between-experiment variance (τexp = 0.013) exceeds between-dataset variance (τds = 0.005) by 2.6×. All dataset moderators — log(n), log(p), class imbalance — classify as NULL after hierarchical correction.
The practical implication: leakage prevention should be unconditional. Dataset characteristics do not reliably predict which datasets are vulnerable. You either prevent the mechanism or you don’t.
Validation
Internal replication: All findings confirmed on held-out confirmation split. Prediction scorecard: 10/13 directional predictions confirmed (internal validation protocol, not formal pre-registration on an external platform); 3 failures documented in the paper. N-scaling: 493 datasets tested from n = 50 to 10,000 — Class I vanishes at n ≥ 200; Class II persists across all tested sample sizes.
Why this matters
If you teach ML, you should spend less time on scaler leakage and more time on peeking and seed selection. If you build ML tools, you should enforce constraints on test-set access, not just preprocessing order. If you review ML papers, you should ask how many seeds were tried and whether the test set was touched before the final report.
These findings are the empirical foundation for the grammar of machine learning workflows. The grammar’s four hard constraints — terminal assess, partition isolation, seed commitment, and CV-coverage correction — target the two classes that matter (II and III), not the one that doesn’t (I). Without this landscape, those constraints would be opinion. With it, they are calibrated to measured effect sizes.
Citation
Landscape Across 2,047 Benchmark Datasets. EPAGOGY.
BibTeX
title = {Which Leakage Types Matter? A Quantitative Landscape
Across 2,047 Benchmark Datasets},
author = {Roth, Simon},
year = {2026},
note = {arXiv:2604.04199},
url = {https://arxiv.org/abs/2604.04199}
}
Questions or critique
Pre-prints are falsifiable by design. If you find a structural error, get in touch.