648 papers. Leaked.

Most ML pipelines leak. Most teams don't know.

Thirty fields. Published and cited before anyone noticed. Not sloppy code. Structural errors the tools made invisible.

648: Kapoor & Narayanan, living survey (2025). 329: arXiv preprint (2022). 294: published in Patterns (2023).

The numbers looked right. The predictions didn't replicate. The tools never flagged it because they weren't designed to. Leakage isn't a bug. It's a missing type.

A Grammar of Machine Learning Workflows

Roth, S. (2026). EPAGOGY. arXiv:2603.10742

The problem

648 published papers across 30 scientific fields contain data leakage (Kapoor & Narayanan, 2025). The predictions looked right. They didn't replicate. The dominant response has been documentation: checklists, linters, best-practice guides. Documentation reduces errors but does not close structural failures. You can follow every checklist and still leak — because the tools never made the wrong thing impossible.

Why a grammar

Every domain that needed to compose complex artifacts eventually developed a grammar. Chomsky decomposed language into phrase rules (1957). Bertin decomposed cartography into seven retinal variables (1967). Codd decomposed databases into eight relational operators (1970). Wickham decomposed statistical graphics into layers, scales, and coordinates (2010). A grammar is not aspirational — it is an artifact of maturity.

Machine learning has libraries, frameworks, and AutoML systems. It does not have a grammar. This paper proposes one: 8 kernel primitives connected by a typed directed acyclic graph (DAG), with four hard constraints that reject the two most damaging leakage classes at call time.

The core contribution

Every ML workflow has two fundamentally different operations: evaluate (cheap, repeatable, drives iteration) and assess (terminal, once, seals commitment). Every practitioner knows this distinction. No framework enforced it — until now.

The grammar's central mechanism is the terminal assess constraint: assess() returns a nominally distinct Evidence type, not Metrics. A second call on the same hold-out test set raises. This is not a convention. It is a guard that fires at call time. The wrong workflow does not produce a wrong answer — it does not run.

What exists today

sklearn prevents some preprocessing leakage inside Pipeline. It has no evaluate/assess distinction and no structural rejection mechanism. tidymodels enforces per-fold preprocessing through recipes — the closest prior work and an important predecessor. It does not enforce terminal assess. mlr3 provides formally typed PipeOps, a composition model but not a grammar. NBLyzer tracks partition labels through abstract interpretation with 93% precision — but detects leakage after the workflow is written, not at call time.

No existing framework enforces the evaluate/assess boundary. That is the gap this grammar fills.

What the grammar does not do

Chomsky observed that “colorless green ideas sleep furiously” is grammatically valid but semantically nonsense. The ML analogue: logistic regression on 1M rows when XGBoost dominates, accuracy on a 99/1 class split, k-fold CV on time series. All structurally valid. All poor decisions. A grammar prevents accidents. It does not prevent bad judgment.

The grammar covers tabular supervised learning. It assumes a stationary data-generating process. Under concept drift, type guarantees hold but Evidence no longer measures generalization. Extending to other paradigms requires analogous empirical baselines first.

Implementations

Two maintained implementations — Python and R — demonstrate the claims with identical type signatures. A third implementation in Julia (7/8 primitives, 245 tests) validates that the specification is independent-implementable. The appendix specification lets anyone build a conforming version.

Eight primitives. One DAG.

Invalid compositions have no derivation. They are rejected at call time, not caught after the fact.

Zone Primitive Type signature What it does
data split() DataFrame → Partition Partition into train, valid, test. Establishes the assessment boundary.
cv() Partition → CVResult Create k-fold rotation from a partition strategy.
prepare() DataFrame → PreparedData Normalize, encode, impute. Per fold, after split. Never on the full dataset.
iterate fit() DataFrame × target → Model Train a model. Handles preparation internally. Seed required.
predict() Model × DataFrame → Predictions Generate predictions. No partition constraint; works on any data.
evaluate() Model × DataFrame → Metrics Measure on validation. Repeatable. The iterate zone.
explain() Model → Explanation Feature importance, partial dependence. Diagnostic, outside the validity chain.
commit assess() Model × DataFrame → Evidence Terminal measurement. Once per hold-out test partition. Returns Evidence, not Metrics.

Constraints

Every framework prevents some leakage. None prevented all of it.

Four rules. Not one. Not forty. That's the whole game. For now.

C1 Assess once per hold-out dz = 0.93Repeated test-set peeking inflates results. A second assess() on the same hold-out test set raises.
C2 Prepare after split, per fold dz ≈ 0Effect is negligible, but the constraint is costless and principled.
C3 Type-safe transitions dz = 0.53–1.11Fitting on test data, evaluating without a model: invalid compositions have no derivation.
C4 No label access before split dz = 0.93Feature selection using test labels inflates. The guard on fit() rejects untagged data.

Links

Zenodo preprint arXiv preprint GitHub Implementation

Citation

Roth, S. (2026). A Grammar of Machine Learning Workflows.
EPAGOGY. arXiv:2603.10742.
https://zenodo.org/records/19023838