Preprint · arXiv:2603.10742
Most ML pipelines leak. Most teams don’t know.
Thirty fields. Published and cited before anyone noticed. Not sloppy code. Structural errors the tools made invisible.
648: Kapoor & Narayanan, living survey (2025). 329: arXiv preprint (2022). 294: published in Patterns (2023).
The numbers looked right. The predictions didn’t replicate. The tools never flagged it because they weren’t designed to. Leakage isn’t a bug. It’s a missing type.
A Grammar of Machine Learning Workflows
The problem
648 published papers across 30 scientific fields contain data leakage (Kapoor & Narayanan, 2025). The predictions looked right. They didn’t replicate. The dominant response has been documentation: checklists, linters, best-practice guides. Documentation reduces errors but does not close structural failures. You can follow every checklist and still leak — because the tools never made the wrong thing impossible.
Why a grammar
Every domain that needed to compose complex artifacts eventually developed a grammar. Chomsky decomposed language into phrase rules (1957). Bertin decomposed visual representation into seven retinal variables (1967). Codd decomposed databases into eight relational operators (1970). Wickham decomposed statistical graphics into layers, scales, and coordinates (2010). A grammar is not aspirational — it is an artifact of maturity.
Machine learning has libraries, frameworks, and AutoML systems. It does not have a grammar. This paper proposes one: 8 kernel primitives connected by a typed directed acyclic graph (DAG), with four hard constraints that reject the two most damaging leakage classes at call time.
The core contribution
Every ML workflow has two fundamentally different operations: evaluate (cheap, repeatable, drives iteration) and assess (terminal, once, seals commitment). Every practitioner knows this distinction. No framework enforced it — until now.
The grammar’s central mechanism is the terminal assess constraint: assess() returns a nominally distinct Evidence type, not Metrics. A second call on the same hold-out test set raises. This is not a convention. It is a guard that fires at call time. The wrong workflow does not produce a wrong answer — it does not run.
# Correct: evaluate freely, assess once
model = ml.fit(train, target="y")
metrics = ml.evaluate(model, valid) # repeatable
result = ml.assess(model, test) # Evidence, not Metrics
# Wrong: second assess raises immediately
result2 = ml.assess(model, test)
# AssessmentError: hold-out already consumed What exists today
sklearn prevents some preprocessing leakage inside Pipeline. It has no evaluate/assess distinction — nothing prevents repeated scoring on the test set or marks it as spent. tidymodels enforces per-fold preprocessing through recipes — the closest prior work and an important predecessor. It does not enforce terminal assess. mlr3 provides formally typed PipeOps, a composition model but not a grammar. NBLyzer tracks partition labels through abstract interpretation with 93% precision — but detects leakage after the workflow is written, not at call time.
No existing framework enforces the evaluate/assess boundary. That is the gap this grammar fills.
What the grammar does not do
Chomsky observed that “colorless green ideas sleep furiously” is grammatically valid but semantically nonsense. The ML analogue: logistic regression on 1M rows when XGBoost dominates, accuracy on a 99/1 class split, k-fold CV on time series. All accepted by today’s ML libraries. None scientifically sound. A grammar prevents structural accidents. It does not prevent bad judgment.
Eight primitives. One DAG.
Invalid compositions have no derivation. They are rejected at call time, not caught after the fact.
| Zone | Primitive | Type signature | What it does |
|---|---|---|---|
| data | split() | DataFrame → Partition | Partition into train, valid, test. Establishes the assessment boundary. |
| cv() | DataFrame × Strategy → CVResult | Run k-fold cross-validation. Applies prepare and fit per fold. | |
| prepare() | DataFrame → PreparedData | Normalize, encode, impute. Per fold, after split. Never on the full dataset. | |
| iterate | fit() | DataFrame × str → Model | Train a model on the training partition. Target is a column name. Seed required. |
| predict() | Model × DataFrame → Predictions | Generate predictions. No partition constraint; works on any data. | |
| evaluate() | Model × DataFrame → Metrics | Measure performance on validation data. Repeatable. Drives the development loop. | |
| explain() | Model → Explanation | Feature importance, partial dependence. Diagnostic, outside the validity chain. | |
| commit | assess() | Model × TestPartition → Evidence | Terminal measurement. Once per hold-out test partition. Returns Evidence, not Metrics. A second call raises. |
Four constraints
Every framework prevents some leakage. None prevented all of it. Four rules. Not one. Not forty.
| C1 | Assess once per hold-out dz = 0.93 Repeated test-set peeking inflates results. A second assess() on the same hold-out test set raises. |
| C2 | Prepare after split, per fold dz ≈ 0 Effect is negligible, but the constraint is costless and principled. |
| C3 | Type-safe transitions dz = 0.53–1.11 Fitting on test data, evaluating without a model: invalid compositions have no derivation. |
| C4 | No label access before split dz = 0.93 Feature selection using test labels inflates. The guard on fit() rejects untagged data. |
Citation
EPAGOGY. arXiv:2603.10742.
https://zenodo.org/records/19023838
BibTeX
title = {A Grammar of Machine Learning Workflows},
author = {Roth, Simon},
year = {2026},
publisher = {EPAGOGY},
note = {arXiv:2603.10742},
url = {https://zenodo.org/records/19023838}
}
Questions or critique
Pre-prints are falsifiable by design. If you find a structural error, get in touch.