Which leakage types actually matter?

Not all leakage is equal. Effects vary by two orders of magnitude, and the variation is organized by causal mechanism.

0
experiments
0
datasets
0
classes
0
% mean actual CI coverage at nominal 95%

Textbooks emphasize the leakage class that barely matters on average. Competitions reward the one that inflates scores most. Neither is calibrated to measured effect sizes.

Which Leakage Types Matter?

A Quantitative Landscape Across 2,047 Benchmark Datasets

Roth, S. (2026). EPAGOGY. · arXiv:2604.04199

The question

Everyone agrees data leakage is bad. But the severity had not been comparatively measured — not across classes, not at scale, not with pre-registered predictions. Two caveats: predictions were registered internally, not on a public registry; and dz is standardised on paired differences, so cross-class comparisons should be read as directional, not exact. Textbooks emphasize fitting scalers on the full dataset. Competition winners worry about seed selection. Reviewers flag target encoding. Are these the same severity? The same mechanism? The answers are not what the conventional emphasis implies.

This paper tests 29 experiments (28 core + 1 boundary study on 129 temporal datasets) across 2,047 tabular benchmark datasets from OpenML and PMLB, covering 17 scientific fields. The design is within-subject counterfactual: each dataset and model serves as its own control. Every finding replicates on a held-out confirmation split.

Four classes

Leakage effects vary by two orders of magnitude. The variation is not random — it is organized by causal mechanism. Four classes emerge:

Class Mechanism Effect Practical impact
I — Estimation Fitting scalers, encoders, or imputers on the full dataset before splitting dz ≈ 0 Negligible. ΔAUC ≤ 0.005 across all 9 experiments. Vanishes at n ≥ 200.
II — Selection Peeking at test performance, seed cherry-picking, target encoding, early stopping on validation dz = 0.27–0.93 Dominant. Persistent even at large n. Peeking inflates AUC by +0.040. Seed cherry-picking (best of 10) inflates by +0.045.
III — Memorization Duplicate rows in train+test, SMOTE before split, random oversampling dz = 0.37–1.11 Scales with model capacity. Naïve Bayes barely affected (dz = 0.37). Decision trees fully memorize (dz = 1.11).
IV — Boundary Random CV applied to temporal, group, or spatially structured data dz ≈ 0† † Undetectable under random CV by design. +0.023 AUC on 14 genuinely temporal datasets; near-zero on benchmarks without real drift. Structural mismatch, not parameter bias.

Effect sizes at a glance

Cohen’s dz for each leakage type, measured across 2,047 datasets. Hover for raw ΔAUC.

The uncomfortable finding

The leakage that textbooks emphasize most — fitting a scaler on the full dataset — has negligible effect. The leakage that competition platforms structurally incentivize — peeking at test performance and selecting the best seed — produces the largest and most persistent inflation.

Peeking at a 10-fold cross-validated test set inflates AUC by +0.040 (dz = 0.93). This does not shrink with more data. The non-zero floor is ΔAUC∞ ≈ 0.046.

Seed cherry-picking (best of 10 random seeds) inflates AUC by +0.045 with a logarithmic dose-response curve (R² = 0.994 on 5 aggregated dose levels, not individual datasets). Try 100 seeds and you get +0.08.

Key experiments

Experiment Class dz ΔAUC Finding
Normalize before split I −0.02 ≤ 0.001 Not detectable above noise
Peeking (k = 10) II 0.93 +0.040 92% prevalence, persistent at all n
Seed cherry-picking (best of 10) II 0.89 +0.045 Dose-response R² = 0.994
Early stopping on validation II 0.46 +0.013 76% of datasets affected
Screen selection bias II 0.27 +0.013 K-invariant; present regardless of k
10% duplication (Decision Tree) III 1.11 +0.024 Full memorization; capacity-dependent
10% duplication (Naïve Bayes) III 0.37 +0.004 Low-capacity model barely affected
SMOTE before split III 0.55 +0.07–+0.25 (n-dep.) Equal to random oversampling

Class IV (boundary leakage) is not in the table above — all rows measure effects under random CV, and random CV cannot detect boundary leakage by design. The boundary experiment covered 129 temporal datasets: 14 with verified genuine timestamps (effect +0.023 AUC), 92 FOREX null controls without real drift (effect near zero), and 23 with spurious time columns (excluded from effect estimates).

Cross-validation coverage gap

Across these 2,047 datasets, standard k-fold cross-validation confidence intervals achieve only 55.1% actual coverage at nominal 95%. The Student-t correction improves this to 70.4%. The best method tested (Conservative-Z) reaches 89%. None of the methods tested achieves the stated 95%.

Actual coverage at nominal 95% for six CI methods, measured on Logistic Regression (LR) and Decision Tree (DT). Bootstrap collapses for DT. Reference line at 95% nominal.

This means published error bars from cross-validation are systematically too narrow. Reported differences between models may not be differences at all.

Mechanism matters more than data

A Bayesian hierarchical meta-regression across 12,103 observations confirms: between-experiment variance (τexp = 0.013) exceeds between-dataset variance (τds = 0.005) by 2.6×. All dataset moderators — log(n), log(p), class imbalance — classify as NULL after hierarchical correction.

The practical implication: leakage prevention should be unconditional. Dataset characteristics do not reliably predict which datasets are vulnerable. You either prevent the mechanism or you don’t.

Validation

Internal replication: All findings confirmed on held-out confirmation split. Prediction scorecard: 10/13 directional predictions confirmed (internal validation protocol, not formal pre-registration on an external platform); 3 failures documented in the paper. N-scaling: 493 datasets tested from n = 50 to 10,000 — Class I vanishes at n ≥ 200; Class II persists across all tested sample sizes.

Why this matters

If you teach ML, you should spend less time on scaler leakage and more time on peeking and seed selection. If you build ML tools, you should enforce constraints on test-set access, not just preprocessing order. If you review ML papers, you should ask how many seeds were tried and whether the test set was touched before the final report.

These findings are the empirical foundation for the grammar of machine learning workflows. The grammar’s four hard constraints — terminal assess, partition isolation, seed commitment, and CV-coverage correction — target the two classes that matter (II and III), not the one that doesn’t (I). Without this landscape, those constraints would be opinion. With it, they are calibrated to measured effect sizes.

Citation

Roth, S. (2026). Which Leakage Types Matter? A Quantitative
Landscape Across 2,047 Benchmark Datasets. EPAGOGY.
BibTeX
@misc{roth2026leakage,
  title = {Which Leakage Types Matter? A Quantitative Landscape
            Across 2,047 Benchmark Datasets},
  author = {Roth, Simon},
  year = {2026},
  note = {arXiv:2604.04199},
  url = {https://arxiv.org/abs/2604.04199}
}

Questions or critique

Pre-prints are falsifiable by design. If you find a structural error, get in touch.