assess

Terminal measurement on held-out test data. Returns Evidence, not Metrics. A second assess on the same hold-out test set raises an error. This is constraint C1 — the grammar rejects it at call time.

Signature

ml.assess(model, *, test, metrics=None, intervals=False)
ml_assess(model, test)

Parameters

ParameterTypeDefaultDescription
modelModelA fitted model that has not been assessed
testDataFrameTest partition from split
metricsdict | NoneNoneCustom metrics (Python only)
intervalsboolFalseBootstrap confidence intervals (Python only)

Returns

Evidence — a sealed result type. Same metrics as evaluate, but the type is different and the model is now locked.

—— Evidence [classification] ———————
  accuracy:     0.7863
  f1:           0.7021
  precision:    0.7500
  recall:       0.6600
  roc_auc:      0.8315
  ⚠ Final. A second assess() on the same hold-out test set raises.

Examples

Standard protocol

# Refit on dev (train + valid) for maximum data
final = ml.fit(s.dev, "target", seed=42)
evidence = ml.assess(final, test=s.test)
final <- ml_fit(s$dev, "target", seed = 42)
evidence <- ml_assess(final, test = s$test)

A second call raises

ml.assess(final, test=s.test)
# AssessmentError: model already assessed. assess() is terminal.
ml_assess(final, test = s$test)
# Error: model already assessed. ml_assess() is terminal.

Why terminal?

Repeated test-set evaluation inflates reported performance by dz = 0.93 (Roth, 2026). The grammar makes this structurally impossible: a model moves from FITTEDASSESSED and cannot move back. The type system enforces it, not documentation.