How To Use CGMissingDataR
Source:vignettes/How-To-Use-CGMissingDataR.Rmd
How-To-Use-CGMissingDataR.RmdOverview
CGMissingDataR imputes missing glucose values in continuous glucose monitoring (CGM) data. The main user-facing function is:
The function is designed for real missing glucose values. It handles two common forms of CGM missingness:
- explicit missing glucose values, where a row exists but the glucose
value is
NA; and - implicit missing readings, where expected timestamps are absent from the data.
Before imputation, the function regularizes each subject to an equal
interval_minutes timestamp grid. Missing timestamp gaps are
converted into explicit rows with target_col = NA, then
imputed by the same workflow used for explicit missing glucose
values.
The returned data frame is intentionally minimal. It contains the
original user-supplied columns plus a completed glucose column named
imputed_glucose_value. Internal columns used for timestamp
regularization, time features, lag features, rolling means, model
fitting, and missingness tracking are not returned.
The core workflow is:
- read a data frame or CSV file;
- parse and sort timestamps by subject;
- regularize each subject to an equal
interval_minutestimestamp grid; - insert missing timestamp rows with
target_col = NA; - create internal time, lag, and rolling-mean features;
- impute the target and feature matrix;
- choose
MICE+ARIMAorMICE+XGBoostfrom the post-regularization missing rate; - return the original columns plus
imputed_glucose_value.
Installation
Install the CRAN release with:
install.packages("CGMissingDataR")Install the development version with:
install.packages("devtools")
devtools::install_github("ZhangLabUKY/CGMissingDataR")Load the package:
Example data
CGMExmplDat10Pct is a small multi-subject CGM data set
included with the package. It contains a subject identifier, raw
timestamp column, glucose column, age, and HbA1c.
data("CGMExmplDat10Pct")
summary_table <- data.frame(
Rows = nrow(CGMExmplDat10Pct),
Columns = ncol(CGMExmplDat10Pct),
Subjects = length(unique(CGMExmplDat10Pct$USUBJID)),
MissingGlucose = sum(is.na(CGMExmplDat10Pct$LBORRES)),
MissingPercent = round(mean(is.na(CGMExmplDat10Pct$LBORRES)) * 100, 1)
)
summary_table
#> Rows Columns Subjects MissingGlucose MissingPercent
#> 1 500 5 5 50 10
head(CGMExmplDat10Pct)
#> USUBJID LBORRES Time AGE hba1c
#> <int> <num> <char> <int> <num>
#> 1: 11 150 2020:01:16:00:00 34 6.4
#> 2: 11 134 2020:01:16:00:05 34 6.4
#> 3: 11 125 2020:01:16:00:10 34 6.4
#> 4: 11 132 2020:01:16:00:15 34 6.4
#> 5: 11 132 2020:01:16:00:20 34 6.4
#> 6: 11 132 2020:01:16:00:25 34 6.4The example data intentionally does not include
TimeSeries. The imputation function creates required time
features internally from the raw Time column.
Required input columns
At minimum, the imputation function needs:
| Role | Argument | Example column |
|---|---|---|
| Glucose value to impute | target_col |
LBORRES |
| Subject identifier | id_col |
USUBJID |
| Raw timestamp | time_col |
Time |
| Additional predictors | feature_cols |
AGE, hba1c
|
The target column may contain missing values. Predictor columns
should be numeric or coercible to numeric. The SEX column,
when present, is internally encoded as M = 1 and
F = 0.
What counts as missing?
CGM exports can represent missingness in two ways.
Explicit missing glucose values
A row exists, but the glucose value is missing:
| Time | LBORRES |
|---|---|
| 00:00 | 120 |
| 00:05 | NA |
| 00:10 | 125 |
The row with LBORRES = NA is imputed.
Timestamp gaps
A row is absent entirely, producing a jump in the timestamp sequence:
| Time | LBORRES |
|---|---|
| 00:00 | 120 |
| 00:05 | 122 |
| 00:30 | 130 |
With interval_minutes = 5, the function internally
regularizes this to:
| Time | LBORRES |
|---|---|
| 00:00 | 120 |
| 00:05 | 122 |
| 00:10 | NA |
| 00:15 | NA |
| 00:20 | NA |
| 00:25 | NA |
| 00:30 | 130 |
The inserted rows are then imputed using the same workflow as
explicit NA values. Because of this, the returned data
frame may have more rows than the input data when timestamp gaps are
present.
Basic real-imputation workflow
For the CRAN-safe R-native path, use
imputer_backend = "mice".
impute_out <- suppressWarnings(
run_missing_glucose_imputation(
CGMExmplDat10Pct,
target_col = "LBORRES",
feature_cols = c("AGE", "hba1c"),
id_col = "USUBJID",
time_col = "Time",
imputer_backend = "mice",
xgb_nrounds = 5
)
)The result is a data frame:
class(impute_out)
#> [1] "data.frame"
nrow(impute_out)
#> [1] 500
names(impute_out)
#> [1] "USUBJID" "LBORRES" "Time"
#> [4] "AGE" "hba1c" "imputed_glucose_value"The returned columns are the original user-supplied columns plus
imputed_glucose_value.
| Column | Meaning |
|---|---|
| Original columns | The user’s input columns, including the original glucose column. |
Original target column, e.g. LBORRES
|
The original glucose column. Values originally missing or inserted
from timestamp gaps remain NA. |
imputed_glucose_value |
Completed glucose values after imputation. |
head(impute_out[c(
"USUBJID",
"Time",
"LBORRES",
"AGE",
"hba1c",
"imputed_glucose_value"
)])
#> USUBJID Time LBORRES AGE hba1c imputed_glucose_value
#> 1 11 2020-01-16 00:00:00 150 34 6.4 150
#> 2 11 2020-01-16 00:05:00 134 34 6.4 134
#> 3 11 2020-01-16 00:10:00 125 34 6.4 125
#> 4 11 2020-01-16 00:15:00 132 34 6.4 132
#> 5 11 2020-01-16 00:20:00 132 34 6.4 132
#> 6 11 2020-01-16 00:25:00 132 34 6.4 132The original target column is not overwritten:
sum(is.na(CGMExmplDat10Pct$LBORRES))
#> [1] 50
sum(is.na(impute_out$LBORRES))
#> [1] 50
sum(is.na(impute_out$imputed_glucose_value))
#> [1] 0Inspect rows where the original target column is missing. These include explicit missing glucose values and, when timestamp gaps are present, rows inserted during timestamp regularization.
missing_rows <- is.na(impute_out$LBORRES)
head(impute_out[missing_rows, c(
"USUBJID",
"Time",
"LBORRES",
"imputed_glucose_value"
)])
#> USUBJID Time LBORRES imputed_glucose_value
#> 10 11 2020-01-16 00:45:00 NA 159.3147
#> 31 11 2020-01-16 02:30:00 NA 148.3185
#> 32 11 2020-01-16 02:35:00 NA 154.4872
#> 33 11 2020-01-16 02:40:00 NA 147.4133
#> 34 11 2020-01-16 02:45:00 NA 153.1695
#> 55 11 2020-01-16 04:30:00 NA 147.4133How the method is selected
The function automatically chooses the final imputation model from the target missing rate after timestamp-gap regularization:
- if the missing rate is less than or equal to
use_arima_if_missing_leq, the final method isMICE+ARIMA; - otherwise, the final method is
MICE+XGBoost.
The default threshold is 0.05.
Method labels and missingness-tracking columns are internal
implementation details in the minimal user-facing output. The returned
data frame keeps only the original input columns plus
imputed_glucose_value.
Time handling and timestamp regularization
The function accepts common timestamp formats, including
colon-separated, hyphen-separated, slash-separated, ISO-style, and
POSIXct inputs.
Examples of accepted character formats include:
"2020:01:16:00:00"
"2020-01-16 00:00:00"
"2020/01/16 00:00:00"
"01/16/2020 00:00"
"2020-01-16T00:00:00"The function uses the timestamp column and
interval_minutes to regularize each subject’s data to an
expected CGM interval. The default is:
interval_minutes = 5Observed timestamps are aligned to the subject-level interval grid,
missing grid positions are inserted, and the inserted target values are
set to NA before imputation.
Internal engineered features
The workflow creates TimeSeries,
TimeDifferenceMinutes, lag features, and a rolling mean
before imputation. These features help the model use temporal order,
time spacing, and recent glucose history.
For example, after timestamp regularization, lag features are created on the expanded grid:
| Time | LBORRES | lag1 | lag2 | lag3 |
|---|---|---|---|---|
| 00:00 | 120 | NA | NA | NA |
| 00:05 | 122 | 120 | NA | NA |
| 00:10 | NA | 122 | 120 | NA |
| 00:15 | NA | NA | 122 | 120 |
| 00:20 | NA | NA | NA | 122 |
These engineered columns are used internally by the imputer and final model but are removed from the returned data frame.
grep("^lag[0-9]+$|^rollmean$|^TimeSeries$|^TimeDifferenceMinutes$", names(impute_out), value = TRUE)
#> character(0)This should return an empty character vector because those features are internal implementation details.
Continuous imputed values
imputed_glucose_value is returned as a continuous
numeric model estimate. It is not rounded to the nearest whole number by
default because downstream analyses may benefit from retaining the
model-estimated precision.
Users who need whole-number glucose values for reporting can round after imputation:
impute_out$imputed_glucose_value_rounded <- round(impute_out$imputed_glucose_value)Optional Python-compatible backend
For closest agreement with the Python reference workflow, use:
imputer_backend = "sklearn"In that mode, the function sends the input data frame to Python
through reticulate. Python then performs preprocessing and
imputation with:
-
pandasfor data-frame operations; -
scikit-learnforIterativeImputer; -
statsmodelsfor ARIMA; - Python
xgboostfor XGBoost regression.
The completed pandas data frame is then converted back to R.
Installing optional Python dependencies
Install reticulate in R:
install.packages("reticulate")Declare the Python dependencies before running the Python backend:
reticulate::py_require(c(
"numpy",
"pandas",
"scikit-learn",
"statsmodels",
"xgboost"
))Then call the function with
imputer_backend = "sklearn":
out_py <- run_missing_glucose_imputation(
CGMExmplDat10Pct,
target_col = "LBORRES",
feature_cols = c("AGE", "hba1c"),
id_col = "USUBJID",
time_col = "Time",
imputer_backend = "sklearn",
xgb_nrounds = 5
)
head(out_py[c(
"USUBJID",
"Time",
"LBORRES",
"imputed_glucose_value"
)])The Python backend is optional. It is not required for package installation or for building this vignette.
Choosing a backend
| Backend | Use case | Notes |
|---|---|---|
mice |
Default R-native workflow | CRAN-safe and does not require Python. |
sklearn |
Closest Python-compatible workflow | Requires reticulate and Python packages. |
Use mice for simple installation and CRAN-safe examples.
Use sklearn when comparing with the Python reference
workflow or when you want Python libraries to perform the full strict
path.
Exporting results
Set export = TRUE to write the returned imputed data
frame to a timestamped CSV file in the current working directory.
out <- run_missing_glucose_imputation(
CGMExmplDat10Pct,
target_col = "LBORRES",
feature_cols = c("AGE", "hba1c"),
id_col = "USUBJID",
time_col = "Time",
imputer_backend = "mice",
export = TRUE
)The exported CSV contains the original input columns plus
imputed_glucose_value.
Troubleshooting
Timestamp parsing errors
If you see an error such as:
check the values in your timestamp column:
Use a standard format such as YYYY-mm-dd HH:MM:SS,
YYYY:mm:dd:HH:MM, or a POSIXct column.
Unexpected row counts
If the returned data frame has more rows than the input data, this is expected when timestamp gaps are present. The function creates rows for missing expected CGM readings before imputation.
If the increase is larger than expected, inspect whether the timestamp column contains off-grid times such as seconds, irregular minutes, or mixed timestamp formats.
Python module errors
If the Python backend reports a missing module such as
sklearn, remember that the package is installed as
scikit-learn but imported as sklearn.
reticulate::py_require(c("scikit-learn", "pandas", "statsmodels", "xgboost"))If Python was already initialized before declaring requirements, restart R and run the call again.
Session information
utils::sessionInfo()
#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
#> [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
#> [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] CGMissingDataR_0.0.2
#>
#> loaded via a namespace (and not attached):
#> [1] sass_0.4.10 generics_0.1.4 tidyr_1.3.2 CGManalyzer_1.3.1
#> [5] shape_1.4.6.1 lattice_0.22-9 lme4_2.0-1 digest_0.6.39
#> [9] magrittr_2.0.5 mitml_0.4-5 evaluate_1.0.5 grid_4.6.0
#> [13] iterators_1.0.14 mice_3.19.0 fastmap_1.2.0 xgboost_3.2.1.1
#> [17] foreach_1.5.2 jomo_2.7-6 jsonlite_2.0.0 glmnet_5.0
#> [21] Matrix_1.7-5 nnet_7.3-20 backports_1.5.1 survival_3.8-6
#> [25] purrr_1.2.2 codetools_0.2-20 textshaping_1.0.5 jquerylib_0.1.4
#> [29] reformulas_0.4.4 Rdpack_2.6.6 cli_3.6.6 rlang_1.2.0
#> [33] rbibutils_2.4.1 splines_4.6.0 cachem_1.1.0 yaml_2.3.12
#> [37] pan_1.9 otel_0.2.0 FNN_1.1.4.1 tools_4.6.0
#> [41] nloptr_2.2.1 minqa_1.2.8 dplyr_1.2.1 ranger_0.18.0
#> [45] boot_1.3-32 broom_1.0.12 rpart_4.1.27 vctrs_0.7.3
#> [49] R6_2.6.1 lifecycle_1.0.5 fs_2.1.0 MASS_7.3-65
#> [53] ragg_1.5.2 pkgconfig_2.0.3 desc_1.4.3 pkgdown_2.2.0
#> [57] pillar_1.11.1 bslib_0.10.0 data.table_1.18.4 glue_1.8.1
#> [61] Rcpp_1.1.1-1.1 systemfonts_1.3.2 xfun_0.57 tibble_3.3.1
#> [65] tidyselect_1.2.1 knitr_1.51 nlme_3.1-169 htmltools_0.5.9
#> [69] rmarkdown_2.31 compiler_4.6.0