Epagogy Research

Research Papers

Empirical and formal work on structural correctness for machine learning. Independent scientific research.

All papers are preprints. Not yet peer-reviewed.
Preprint arXiv:2604.04199 April 2026 · last updated 2026-04-03

Which Leakage Types Matter?

A Quantitative Landscape Across 2,047 Benchmark Datasets

Simon Roth · Independent researcher

Twenty-eight within-subject counterfactual experiments across 2,047 tabular datasets, plus a boundary experiment on 129 temporal datasets, measuring the severity of four data leakage classes in machine learning. Class I (estimation — fitting scalers on full data) is negligible: all nine conditions produce |ΔAUC| ≤ 0.005. Class II (selection — peeking, seed cherry-picking) is substantial: ~90% of the measured effect is noise exploitation that inflates reported scores. Class III (memorization) scales with model capacity: d𝑧 = 0.37 (Naive Bayes) to 1.11 (Decision Tree). Class IV (boundary) is invisible under random CV. A diagnostic experiment finds that standard CV confidence intervals achieve only 55% actual coverage at nominal 95%. The textbook emphasis is inverted: normalization leakage matters least; selection leakage at practical dataset sizes matters most.

Preprint arXiv:2603.10742 April 2026 · last updated 2026-04-03

A Grammar of Machine Learning Workflows

Rejecting Data Leakage at Call Time

Simon Roth · Independent researcher

Data leakage has been identified in 648 published machine learning papers across 30 scientific fields. The knowledge to prevent it exists; the tools do not enforce it. This paper presents a grammar — eight typed primitives, a directed acyclic graph, and four hard constraints — that makes the most damaging leakage types structurally unrepresentable. The core mechanism is a terminal assessment gate: the first call-time-enforced evaluate/assess boundary in an ML framework (as of April 2026), backed by a specification precise enough for independent reimplementation. A companion landscape study across 2,047 datasets grounds the constraints in measured effect sizes. Two reference implementations (Python, R) are available.

In preparation · forthcoming as preprint

The Shortest Path Leaks

Silent Data Leakage in LLM-Generated ML Pipelines

Simon Roth · Independent researcher

Large language models have become the default interface for writing ML code. If the path they take is the shortest one — and the shortest path leaks — the problem documented in the landscape and grammar studies does not stay contained. It propagates silently, at scale, through every generated pipeline. A companion study to the landscape and grammar papers. Data collected and analysed across a large corpus of models and prompts. Structural enforcement is the only intervention that works by design.

Manuscript in preparation. Results not yet public.