A research programme on structural correctness for reasoning workflows. Independent scientific research.
A layer that pales in comparison
You can pass every benchmark. Every lint check. Every code review. The pipeline can still be structurally broken — and nothing warns you. The error is not in the data. Not in the model. It is in how the operations compose. This is what the existing tools were not designed to catch.
The AI ecosystem has two well-developed layers. One that builds models. One that tests them — instance by instance, benchmark by benchmark. The third layer is still forming: defining structural correctness for the reasoning workflows and the enforcement mechanisms that connect models to scientific conclusions at call time.
correctness Exists
✱ Enforced by default. Types reject invalid workflow compositions at call time. No configuration. No discipline required. No way to forget.
Instance testing asks: did the model pass? Structural correctness asks: is the workflow's reasoning valid by construction? You can pass every benchmark, every lint check, every code review, and still have a pipeline that is structurally broken. The error is not in the data and not in the model. It is in how the operations compose epistemically.
This is not a new problem. It is an under-formalised one. The knowledge to prevent it has existed for years; what is missing is a typed infrastructure that enforces what the textbooks teach.
A grammar for ML workflows
A grammar — in the Wilkinson–Wickham sense — is a small set of typed primitives with composition rules and a rejection criterion. In sklearn, tidymodels, and every major AutoML framework, structurally invalid workflows still produce output. In the ML grammar, invalid compositions are type errors — caught before any result is produced.
Eight typed primitives. Four hard constraints. A directed acyclic graph that makes the most common structural errors in supervised ML unrepresentable. The core mechanism is a terminal assessment gate: the boundary between formative evaluation and summative judgment, enforced via type system rather than in documentation.
The implementation is open source. The test suite runs. Three directional predictions were recorded before observing results — two confirmed, one falsified.
Going deeper into ML
The supervised learning grammar is the base case — the simplest valid ML workflow. Split the data. Fit the model. Evaluate. Assess once. Every other subfield of ML takes that skeleton and adds structural complexity: reinforcement learning adds time and live interaction, active learning adds the labeling loop, machine reasoning adds multi-step inference. The base case is not the most important; it is the irreducible minimum from which the others derive.
The programme goes deeper by asking what each subfield adds to that base, and whether the same typed approach — rejection at call time — extends to the new constraints. Each extension inherits the base grammar and adds what the simpler case cannot express.
The deeper question is whether the approach transfers. A typed grammar with rejection criteria is a formal object — its properties do not depend on the domain. Whether the structural move that prevents test-set reuse in ML holds under translation to other reasoning workflows, and whether the constraints survive that translation, is the active research question.
What I don’t know yet
01
Intent versus structure
The grammar prevents accidents. But run evaluate() fifty times, pick the best result, then call assess(). You followed every rule and still leaked. Structure prevents negligence. Can it prevent intent?
02
The self-reference problem
The ML grammar cannot verify its own correctness. You cannot be both learner and judge — the grammar cannot assess() itself. This is not a flaw in the ML grammar; it is a structural property of all grammars. Does every grammar require an external verifier? And if so, what verifies the verifier?
03
Where structure ends
The grammar knows when a workflow is structurally valid. It does not know whether the question was worth asking. Thresholds, costs, fairness, and the right to act on a result — these sit between a measurement and the decision it justifies. Where does structure end and judgment begin?
Method
επαγωγή — Aristotle’s term for induction: reasoning from particular observations to general constraints. Prior Analytics, II.23.
What I expect is written down before I look at the data. Across both published papers, four of sixteen pre-registered directional predictions were not confirmed. All four are reported. The failures were more informative than the confirmations.
Every formal claim ships as working code. Theory without implementation is speculation. Code without theory is a library. The implementation is the proof — anyone can run it.
Hundreds of waves of stress testing on the implementation — edge cases, boundary violations, adversarial inputs designed to slip past the type constraints in code. Each wave targets a specific failure mode: does the library reject what it claims to reject? Breaking it is the point.
Every hypothesis has a stated failure condition, recorded before the experiment runs. If the structural guarantee does not hold after all implementation errors are ruled out, the result is published as a falsification. The programme can fail. That is what makes it scientific.
Code is MIT licensed. Papers go on Zenodo and arXiv. Paper, data, and analysis scripts are public on GitHub. Replication by default — it is the condition of the claim.
Where the question started
The question started with bias in ML pipelines. Not in any single model, but across the full computational workflow — data collection, transformation, training, validation, execution. The doctoral work (Graduate School of Decision Sciences, Konstanz, 2022) proposed a typology: discrimination bias, inductive bias, evaluation bias. Three categories, each with a different cause and a different remedy.
The first study built the typology by running it against real ML pipelines — including my own. Evaluation bias turned up where I hadn’t expected it: results that had looked plausible didn’t hold. The second study applied it outward — auditing Twitter’s recommendation algorithm for amplification bias at scale, using bot simulations to show filter bubbles are driven by engagement, not following. The third took it further: predicting voter turnout and party affinity from social media communication. The same biases. Much larger data. Different stakes.
Inductive bias was expected. Every model family makes structural assumptions, and the bias–variance decomposition quantifies the cost. Evaluation bias was different. The experiments found that violations of the validation–test boundary — how sample splitting interacts with final performance estimates, how the test set is touched during model selection — produced inflation that was not random. It followed structurally from how the operations composed.
That observation was the pivot. A bias you detect after the fact is an auditing problem. A bias that follows structurally from how the workflow is assembled is a type problem — and type problems have formal remedies.
Leakage and bias are distinct failures, but they share a root cause: pipeline stages that should be isolated are not. A model whose training process is contaminated by test data cannot be trusted to measure bias either. The grammar does not detect bias — but it closes the methodological precondition. Honest measurement requires honest evaluation. Honest assessment is what the type system enforces.
The ML grammar is that answer. The evaluate / assess boundary — validation data reusable, test data terminal — is the call-time enforcement of the assessment gate the typology identified. As of April 2026, no existing ML framework enforces this distinction as a type-level constraint at call time. Where the dissertation described the bias, the grammar rejects it.
Biased Machines in the Realm of Politics — Roth, S. (2022), Universität Konstanz
Questions. Critique. Collaboration.
If your field has a correctness problem that existing tools weren’t designed to catch, get in touch. Independent research, Stuttgart.