Python · R · Rust · Tabular supervised learning

ml

ML pipelines leak. What about yours?

A Python & R library where the most common structural mistakes are type errors, not silent wrong answers.

Install pip install mlw · 2,300+ tests

A student who grades their own homework learns nothing new. Most ML libraries allow exactly that: picking the best model on the test set, reporting the highest AUC across seeds, training on data that mirrors the evaluation rows. None of this raises an error in any major ML library by default. It produces plausible numbers that don’t hold.

The textbook rule (“use the test set only once”) is documentation, not enforcement. ml makes it a type error.

Side by side

vs. scikit-learn

Same algorithms. Same data. The difference is what happens when you make a mistake — in sklearn, nothing. In ml, a type error.

scikit-learn17 lines

ml5 lines

evaluate() is repeatable. assess() is terminal. Different types. A second assess() on the same hold-out test set raises.

Benchmark

Every algorithm. Same data. Real numbers.

3 seeds · 21 datasets · single core · ml Rust kernel vs sklearn, XGBoost, LightGBM, tidymodels and ml R. All results reported.

Benchmark running — results in a few days.

Script and raw data on systats/ml-bench.

Benchmark

Every algorithm. Same data. Real numbers.

3 seeds · 60/20/20 split · single core · identical preprocessing and hyperparameters. AUC = median over seeds. Green = ml faster, grey = sklearn faster — both are reported honestly. — script · raw JSON.

Algorithm	ml AUC	sklearn AUC	Δ AUC	ml time	sklearn time	speed
Loading live results…

Tier 1: same algorithm, same data split, single core (n_jobs=1). AUC = median of 3 independent seeds. Speedup = sklearn time ÷ ml time. Tree-based models (RF, ET, GBT, AdaBoost): ml Rust kernel is significantly faster. Linear models (Logistic, SVM, KNN, NB): sklearn’s LAPACK/liblinear is faster — grey cells, documented. Script and raw JSON on systats/ml-bench. Updates hourly.

The structure

A typed DAG, not a pipeline.

Every ML workflow is a directed acyclic graph with exactly one terminal boundary. Above it: iterate freely on validation data. Below it: commit once to test.

Iterate freely. Tune, compare, explore — nothing in the iterate zone touches held-out test data. Validation carries no holdout budget: evaluate as many times as you want, the counter makes selection pressure visible.

Terminal boundary. The final safety net for assumption violations and implementation bugs no earlier layer caught. The one measurement no adaptive decision has touched.

Commit once. assess() is terminal. One measurement on held-out data. A second call on the same split raises.

Evidence

Tested, not argued.

The underlying leakage surface was measured across 2,047 datasets from 17 scientific fields — lower-bound estimates, not upper bounds. The grammar was validated through 100+ structured adversarial reviews and 2,300+ automated regression tests.

2,047

datasets tested

2,300+

tests passing

100+

review passes

MIT

license

Across 2,047 datasets, the result inverts conventional wisdom. Fitting a normaliser before splitting — the textbook warning — produces near-zero inflation: all nine conditions stay below 0.005 AUC. The damaging classes are selection leakage (peeking at test performance, cherry-picking seeds, early stopping on the test set) and memorisation leakage (duplicates, oversampling) — both large effects, both invisible to the standard pipeline. The most-taught leakage class has near-zero effect. The ones that go undetected are what the data shows matters.

arXiv:2604.04199 — methods, data, replication code.

Messy data

What happens when the data isn’t clean?

String targets + categoricals

ml ✓ sklearn manual

ml handles mixed types automatically. sklearn requires manual LabelEncoder + OrdinalEncoder.

NaN in features (tree models)

ml ✓ sklearn ValueError

Tree-based models handle NaN natively. Raw sklearn raises ValueError. Linear and distance-based models require explicit imputation in both.

Forgetting to scale (SVM)

ml AUC 0.55 sklearn AUC 0.41

ml auto-scales for SVM, KNN, and logistic. sklearn scales correctly too — if you remember to add a StandardScaler. Forgetting is silent and near-random.

Test-set peeking

ml blocks sklearn silent

assess() raises after the first call. sklearn permits it; ml rejects it at call time.

Get started

Install in 30 seconds.

Python 3.10+ or R 4.1+. Native Rust backend included.

python

# core
$ pip install mlw

# + XGBoost (recommended)
$ pip install "mlw[xgboost]"

# everything
$ pip install "mlw[all]"

# requires: remotes + Rust toolchain (rustup.rs)
> install.packages("remotes")
> remotes::install_github("epagogy/ml", subdir = "r")

# Also available on CRAN: install.packages("ml")

Docs GitHub PyPI arXiv:2603.10742

When to stop using ml: when your framework of choice enforces all four constraints natively. I look forward to that day.