Benchmarks model performance under feature missingness. The function:
Filters to complete cases for
target_colandfeature_cols(baseline complete data),Splits into training/validation,
Masks feature values at each rate using Bernoulli (cell-wise) missingness,
Imputes missing features using MICE on training data and applies the fitted imputation model to validation data via
mice::mice.mids(newdata = ...)(reduces leakage),Trains Random Forest (
ranger) and kNN regression (FNN::knn.reg),Returns MAPE and R-squared for each model and mask rate.
Feature columns must be numeric (or coercible to numeric without introducing new missing values). This mirrors workflows where features are treated as numeric arrays.
Usage
run_missingness_benchmark(
data,
target_col,
feature_cols = NULL,
mask_rates = c(0.05, 0.1, 0.2, 0.3),
rf_n_estimators = 200,
knn_k = 5,
test_size = 0.2,
seed = 42
)Arguments
- data
A data.frame (or object coercible to data.frame) containing the dataset.
- target_col
Single character string: name of the outcome column.
- feature_cols
Character vector of feature column names. If
NULL, uses all columns excepttarget_col.- mask_rates
Numeric vector in (0, 1): proportion of feature entries to mask per rate.
- rf_n_estimators
Integer: number of trees for the random forest.
- knn_k
Integer: number of neighbors for kNN regression.
- test_size
Numeric in (0, 1): fraction of rows assigned to validation split.
- seed
Integer: seed for data split and model reproducibility.
Details
Validation imputation is performed using mice::mice.mids(newdata = ...), which generates imputations
for new data according to the model stored in the training mids object.
MAPE is computed using Metrics::mape() on non-zero targets only to avoid instability when actual values are zero.
Examples
data("CGMExampleData")
run_missingness_benchmark(
CGMExampleData,
target_col = "LBORRES",
feature_cols = c("TimeDifferenceMinutes", "TimeSeries", "USUBJID"),
mask_rates = c(0.05, 0.10)
)
#> Warning: Number of logged events: 1
#> Warning: Number of logged events: 1
#> MaskRate Model MAPE R2
#> 1 5% Random Forest 7.497932 0.7418421
#> 2 5% kNN 7.898898 0.7276014
#> 3 10% Random Forest 8.510749 0.6683246
#> 4 10% kNN 9.143478 0.6315460