split

Stratified three-way split into train, valid, and test. Returns a split result with a .dev accessor (train + valid combined) for the final refit.

Signature

ml.split(data, target=None, *, ratio=(0.6, 0.2, 0.2), seed=42, stratify=True, groups=None)
ml_split(data, target = NULL, ratio = c(0.6, 0.2, 0.2), seed = NULL, stratify = TRUE, groups = NULL)

Parameters

ParameterTypeDefaultDescription
dataDataFrameInput data
targetstrName of the target column
ratiotuple(0.6, 0.2, 0.2)Train/valid/test proportions. Must sum to 1.
seedint42Random seed for reproducibility.
stratifyboolTrueStratify on target class distribution (classification only).
groupsstr | NoneNoneColumn name for group-aware splitting. All rows with the same group value stay in the same partition.

Returns

A SplitResult with four accessors:

AccessorDescription
.trainTraining partition (60% by default)
.validValidation partition (20%)
.testTest partition (20%) — held out, used only by assess
.devTrain + valid combined — use for the final refit before assessment

Examples

Basic split

s = ml.split(data, "churn", seed=42)
print(len(s.train), len(s.valid), len(s.test))
s <- ml_split(data, "churn", seed = 42)
c(nrow(s$train), nrow(s$valid), nrow(s$test))

Custom ratio

s = ml.split(data, "target", ratio=(0.8, 0.1, 0.1), seed=42)
s <- ml_split(data, "target", ratio = c(0.8, 0.1, 0.1), seed = 42)

Grouped split

When rows belong to groups (e.g., multiple measurements per patient), set groups to keep all rows from the same group in the same partition. This prevents leakage across group boundaries.

s = ml.split(data, "outcome", groups="patient_id", seed=42)
s <- ml_split(data, "outcome", groups = "patient_id", seed = 42)