A Grammar of Machine Learning Workflows: Rejecting Data Leakage at Call Time

Roth, Simon

0 papers. Leaked.

Preprint · arXiv:2603.10742

Most ML pipelines leak. Most teams don’t know.

Thirty fields. Published and cited before anyone noticed. Not sloppy code. Structural errors the tools made invisible.

648: Kapoor & Narayanan, living survey (2025). 329: arXiv preprint (2022). 294: published in Patterns (2023).

The numbers looked right. The predictions didn’t replicate. The tools never flagged it because they weren’t designed to. Leakage isn’t a bug. It’s a missing type.

Paper

A Grammar of Machine Learning Workflows

Roth, S. (2026). Epagogy. · arXiv:2603.10742

arXiv 2603.10742 PDF ↓ Zenodo PDF GitHub Repo Impl. Docs → See also Companion → BibTeX

The problem

648 published papers across 30 scientific fields contain data leakage (Kapoor & Narayanan, 2025). The predictions looked right. They didn’t replicate. The dominant response has been documentation: checklists, linters, best-practice guides. Documentation reduces errors but does not close structural failures. You can follow every checklist and still leak — because the tools never made the wrong thing impossible.

Why a grammar

Every domain that needed to compose complex artifacts eventually developed a grammar. Chomsky decomposed language into phrase rules (1957). Bertin decomposed visual representation into seven retinal variables (1967). Codd decomposed databases into eight relational operators (1970). Wickham decomposed statistical graphics into layers, scales, and coordinates (2010). A grammar is not aspirational — it is an artifact of maturity.

Machine learning has libraries, frameworks, and AutoML systems. It does not have a grammar. This paper proposes one: 8 kernel primitives connected by a typed directed acyclic graph (DAG), with four hard constraints that reject the two most damaging leakage classes at call time.

The core contribution

Every ML workflow has two fundamentally different operations: evaluate (cheap, repeatable, drives iteration) and assess (terminal, once, seals commitment). Every practitioner knows this distinction. No framework enforced it — until now.

The grammar’s central mechanism is the terminal assess constraint: assess() returns a nominally distinct Evidence type, not Metrics. A second call on the same hold-out test set raises. This is not a convention. It is a guard that fires at call time. The wrong workflow does not produce a wrong answer — it does not run.

# Correct: evaluate freely, assess once
model   = ml.fit(train, target="y")
metrics = ml.evaluate(model, valid)   # repeatable
result  = ml.assess(model, test)      # Evidence, not Metrics

# Wrong: second assess raises immediately
result2 = ml.assess(model, test)
# AssessmentError: hold-out already consumed

What exists today

sklearn prevents some preprocessing leakage inside Pipeline. It has no evaluate/assess distinction — nothing prevents repeated scoring on the test set or marks it as spent. tidymodels enforces per-fold preprocessing through recipes — the closest prior work and an important predecessor. It does not enforce terminal assess. mlr3 provides formally typed PipeOps, a composition model but not a grammar. NBLyzer tracks partition labels through abstract interpretation with 93% precision — but detects leakage after the workflow is written, not at call time.

No existing framework enforces the evaluate/assess boundary. That is the gap this grammar fills.

What the grammar does not do

Chomsky observed that “colorless green ideas sleep furiously” is grammatically valid but semantically nonsense. The ML analogue: logistic regression on 1M rows when XGBoost dominates, accuracy on a 99/1 class split, k-fold CV on time series. All accepted by today’s ML libraries. None scientifically sound. A grammar prevents structural accidents. It does not prevent bad judgment.

Eight primitives. One DAG.

Invalid compositions have no derivation. They are rejected at call time, not caught after the fact.

Zone	Primitive	Type signature	What it does
data	split()	DataFrame → Partition	Partition into train, valid, test. Establishes the assessment boundary.
	cv()	DataFrame × Strategy → CVResult	Run k-fold cross-validation. Applies prepare and fit per fold.
	prepare()	DataFrame → PreparedData	Normalize, encode, impute. Per fold, after split. Never on the full dataset.
iterate	fit()	DataFrame × str → Model	Train a model on the training partition. Target is a column name. Seed required.
	predict()	Model × DataFrame → Predictions	Generate predictions. No partition constraint; works on any data.
	evaluate()	Model × DataFrame → Metrics	Measure performance on validation data. Repeatable. Drives the development loop.
	explain()	Model → Explanation	Feature importance, partial dependence. Diagnostic, outside the validity chain.
commit	assess()	Model × TestPartition → Evidence	Terminal measurement. Once per hold-out test partition. Returns `Evidence`, not `Metrics`. A second call raises.

Four constraints

Every framework prevents some leakage. None prevented all of it. Four rules. Not one. Not forty.

C1	Assess once per hold-out d_z = 0.93 Repeated test-set peeking inflates results. A second `assess()` on the same hold-out test set raises. The terminal seal is also a safety net for assumption violations and implementation bugs that no earlier layer caught — once the result is known, the pipeline cannot be quietly fixed and re-run.
C2	Prepare after split, per fold d_z ≈ 0 Effect is negligible on average, but the constraint is costless and principled.
C3	Type-safe transitions d_z = 0.37–1.11 Fitting on test data, evaluating without a model: invalid compositions have no derivation.
C4	No unregistered data into `fit` d_z = 0.93 Feature selection using test labels inflates. `fit()` rejects any data not produced by `split()`.

Citation

Roth, S. (2026). A Grammar of Machine Learning Workflows.
Epagogy. arXiv:2603.10742.
https://zenodo.org/records/19023838

BibTeX

@misc{roth2026grammar,
  title = {A Grammar of Machine Learning Workflows},
  author = {Roth, Simon},
  year = {2026},
  publisher = {EPAGOGY},
  note = {arXiv:2603.10742},
  url = {https://zenodo.org/records/19023838}
}

Questions or critique

Pre-prints are falsifiable by design. If you find a structural error, get in touch.

Email Simon GitHub