ml
ML pipelines leak. What about yours?
A Python & R library where the most common structural mistakes are type errors, not silent wrong answers.
pip install mlw · 2,300+ tests A student who grades their own homework learns nothing new. Most ML libraries allow exactly that: picking the best model on the test set, reporting the highest AUC across seeds, training on data that mirrors the evaluation rows. None of this raises an error in any major ML library by default. It produces plausible numbers that don’t hold.
The textbook rule (“use the test set only once”) is documentation, not enforcement. ml makes it a type error.
vs. scikit-learn
Same algorithms. Same data. The difference is what happens when you make a mistake — in sklearn, nothing. In ml, a type error.
Every algorithm. Same data. Real numbers.
3 seeds · 21 datasets · single core · ml Rust kernel vs sklearn, XGBoost, LightGBM, tidymodels and ml R. All results reported.
Script and raw data on systats/ml-bench.
Every algorithm. Same data. Real numbers.
3 seeds · 60/20/20 split · single core · identical preprocessing and hyperparameters. AUC = median over seeds. Green = ml faster, grey = sklearn faster — both are reported honestly. — script · raw JSON.
| Algorithm | ml AUC | sklearn AUC | Δ AUC | ml time | sklearn time | speed |
|---|---|---|---|---|---|---|
| Loading live results… | ||||||
Tier 1: same algorithm, same data split, single core (n_jobs=1). AUC = median of 3 independent seeds. Speedup = sklearn time ÷ ml time. Tree-based models (RF, ET, GBT, AdaBoost): ml Rust kernel is significantly faster. Linear models (Logistic, SVM, KNN, NB): sklearn’s LAPACK/liblinear is faster — grey cells, documented. Script and raw JSON on systats/ml-bench. Updates hourly.
A typed DAG, not a pipeline.
Every ML workflow is a directed acyclic graph with exactly one terminal boundary. Above it: iterate freely on validation data. Below it: commit once to test.
assess() is terminal. One measurement on held-out data. A second call on the same split raises. Tested, not argued.
The underlying leakage surface was measured across 2,047 datasets from 17 scientific fields — lower-bound estimates, not upper bounds. The grammar was validated through 100+ structured adversarial reviews and 2,300+ automated regression tests.
Across 2,047 datasets, the result inverts conventional wisdom. Fitting a normaliser before splitting — the textbook warning — produces near-zero inflation: all nine conditions stay below 0.005 AUC. The damaging classes are selection leakage (peeking at test performance, cherry-picking seeds, early stopping on the test set) and memorisation leakage (duplicates, oversampling) — both large effects, both invisible to the standard pipeline. The most-taught leakage class has near-zero effect. The ones that go undetected are what the data shows matters.
arXiv:2604.04199 — methods, data, replication code.
What happens when the data isn’t clean?
ml handles mixed types automatically. sklearn requires manual LabelEncoder + OrdinalEncoder.
Tree-based models handle NaN natively. Raw sklearn raises ValueError. Linear and distance-based models require explicit imputation in both.
ml auto-scales for SVM, KNN, and logistic. sklearn scales correctly too — if you remember to add a StandardScaler. Forgetting is silent and near-random.
assess() raises after the first call. sklearn permits it; ml rejects it at call time.
Install in 30 seconds.
Python 3.10+ or R 4.1+. Native Rust backend included.
# core
$ pip install mlw
# + XGBoost (recommended)
$ pip install "mlw[xgboost]"
# everything
$ pip install "mlw[all]" # requires: remotes + Rust toolchain (rustup.rs)
> install.packages("remotes")
> remotes::install_github("epagogy/ml", subdir = "r")
# Also available on CRAN: install.packages("ml") When to stop using ml: when your framework of choice enforces all four constraints natively. I look forward to that day.