assess

Terminal measurement on held-out test data. Returns Evidence, not Metrics. A second assess on the same hold-out test set raises an error. This is constraint C1 — the grammar rejects it at call time.

Signature

ml.assess(model, *, test, metrics=None, intervals=False)

ml_assess(model, test)

Parameters

Parameter	Type	Default	Description
`model`	Model	—	A fitted model that has not been assessed
`test`	DataFrame	—	Test partition from `split`
`metrics`	dict \| None	`None`	Custom metrics (Python only)
`intervals`	bool	`False`	Bootstrap confidence intervals (Python only)

Returns

Evidence — a sealed result type. Same metrics as evaluate, but the type is different and the model is now locked.

—— Evidence [classification] ———————
  accuracy:     0.7863
  f1:           0.7021
  precision:    0.7500
  recall:       0.6600
  roc_auc:      0.8315
  ⚠ Final. A second assess() on the same hold-out test set raises.

Examples

Standard protocol

# Refit on dev (train + valid) for maximum data
final = ml.fit(s.dev, "target", seed=42)
evidence = ml.assess(final, test=s.test)

final <- ml_fit(s$dev, "target", seed = 42)
evidence <- ml_assess(final, test = s$test)

A second call raises

ml.assess(final, test=s.test)
# AssessmentError: model already assessed. assess() is terminal.

ml_assess(final, test = s$test)
# Error: model already assessed. ml_assess() is terminal.

Why terminal?

Repeated test-set evaluation inflates reported performance by d_z = 0.93 (Roth, 2026). The grammar makes this structurally impossible: a model moves from FITTED → ASSESSED and cannot move back. The type system enforces it, not documentation.

screen →