assess
Terminal measurement on held-out test data. Returns Evidence, not Metrics. A second assess on the same hold-out test set raises an error. This is constraint C1 — the grammar rejects it at call time.
Signature
ml.assess(model, *, test, metrics=None, intervals=False)
ml_assess(model, test)
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
model | Model | — | A fitted model that has not been assessed |
test | DataFrame | — | Test partition from split |
metrics | dict | None | None | Custom metrics (Python only) |
intervals | bool | False | Bootstrap confidence intervals (Python only) |
Returns
Evidence — a sealed result type. Same metrics as evaluate, but the type is different and the model is now locked.
—— Evidence [classification] ———————
accuracy: 0.7863
f1: 0.7021
precision: 0.7500
recall: 0.6600
roc_auc: 0.8315
⚠ Final. A second assess() on the same hold-out test set raises. Examples
Standard protocol
# Refit on dev (train + valid) for maximum data
final = ml.fit(s.dev, "target", seed=42)
evidence = ml.assess(final, test=s.test) final <- ml_fit(s$dev, "target", seed = 42)
evidence <- ml_assess(final, test = s$test) A second call raises
ml.assess(final, test=s.test)
# AssessmentError: model already assessed. assess() is terminal. ml_assess(final, test = s$test)
# Error: model already assessed. ml_assess() is terminal. Why terminal?
Repeated test-set evaluation inflates reported performance by dz = 0.93 (Roth, 2026). The grammar makes this structurally impossible: a model moves from FITTED → ASSESSED and cannot move back. The type system enforces it, not documentation.