Perform Sensitivity Analysis for Environmental Exposure — perform_sensitivity

This function performs a sensitivity analysis to assess the impact of environmental exposure on specified health outcomes. It integrates modeled probability data and measured concentration data to create a probabilistic exposure model, weighted by the proportion of the population using each source. It then uses multiple imputation to generate a series of datasets with imputed exposure levels.

Usage

perform_sensitivity_analysis(
  ndraws = 10,
  model_prob_csv = NULL,
  conc_data_csv = NULL,
  birth_data_txt = NULL,
  regression_formula = NULL,
  output_dir = "sensitivity_results/",
  model_type = "mixed",
  model_family = NULL,
  targets = c("OEGEST", "BWT"),
  impute_vars = NULL,
  cat_label = c("<5", "5-10", "10+"),
  drop_cat_label_ref = c("<5"),
  columns_to_select = NULL,
  rucc_col = NULL,
  id_col = "GEOID10",
  prob_cols = c("prob_C1", "prob_C2", "prob_C3"),
  conc_cutoffs = c(5, 10),
  pop_well_col = "Wells_2010",
  record_id_col = "FIPS",
  exposure_level_col = "ExposureLevel",
  conc_mean_col = "conc_meanlog",
  conc_sd_col = "conc_sdlog",
  conc_raw_col = NULL,
  default_sdlog = 1,
  pwell_col = "PWELL_private_pct",
  seed = 12345,
  mice_m = 1,
  mice_maxit = 5,
  mice_method = "pmm",
  mice_covs = c("pm"),
  apply_imputation_fallback = TRUE,
  format_ids = TRUE
)

Arguments

ndraws

An integer specifying the number of imputed datasets to generate.

model_prob_csv

A file path to the CSV containing modeled multinomial probabilities of exposure levels (e.g., from predictive models).

conc_data_csv

A file path to the CSV containing measured concentration lognormal parameters and the percentage of private well users.

birth_data_txt

A file path to the text file containing the birth data.

regression_formula

A string or formula object for the regression model. For fixed-effects models (model_type = "fixed") the formula must not contain random-effects terms; for random- and mixed-effects models it must contain at least one random-effects term such as (1 | group).

output_dir

A file path to the directory where output files will be saved.

model_type

A character string selecting the effects structure of the regression model. One of:

"mixed" (default) - fixed effects plus random effects, fitted with lme4::lmer() (linear) or lme4::glmer() (non-linear).
"random" - random-intercept-only models, fitted with lme4::lmer() / lme4::glmer(). Every random term must be an intercept, e.g. (1 | group).
"fixed" - no random effects, fitted with stats::lm() (linear) or stats::glm() (non-linear).

model_family

The error distribution and link for non-linear (generalized) models. Accepts NULL (default; a linear/Gaussian model), a family name such as "binomial" or "poisson", the special value "multinomial", a stats::family() object, or a family-generating function. When non-NULL, generalized models are fitted (stats::glm() for model_type = "fixed", otherwise lme4::glmer()). Use "multinomial" for multi-level categorical outcomes (fitted with nnet::multinom() for model_type = "fixed", or mclogit::mblogit() for "random"/"mixed"); the pooled results gain a y.level column identifying the outcome category. Categorical outcomes supplied as character columns are coerced to factors automatically for "binomial" and "multinomial".

targets

A character vector of the dependent variables (health outcomes) for the regression analysis. Defaults to c("OEGEST", "BWT").

impute_vars

A character vector of column names to be imputed using MICE. If NULL, no additional imputation is performed. Defaults to NULL.

cat_label

A character vector of labels for the exposure concentration categories. Defaults to c("<5", "5-10", "10+").

drop_cat_label_ref

A character vector of exposure categories to be used as the reference level in the regression. Defaults to c("<5").

columns_to_select

A character vector of column names to be selected from the birth data. If NULL, all columns are used.

rucc_col

A character string specifying the column name for the Rural-Urban Continuum Code. Defaults to NULL.

id_col

A character string specifying the column name for the Geographic Identifier in the exposure data. Supports any identifier format including county FIPS codes, census tract IDs, ZIP codes, or participant- specific IDs. The only requirement is that identifiers match between datasets. Defaults to "GEOID10".

prob_cols

A character vector of column names from the probability data for the exposure probability categories. Defaults to c("prob_C1", "prob_C2", "prob_C3").

conc_cutoffs

A numeric vector of cutoffs for categorizing exposure levels based on concentration data. If NULL, will be automatically determined based on number of probability columns. Defaults to c(5, 10)

pop_well_col

A character string specifying the column name from the probability data for the population of well users. Defaults to "Wells_2010". This has to be at the same scale as the id_col and record_id_col.

record_id_col

A character string specifying the column name for the geographic/participant identifier in the health data. Supports any identifier format (FIPS codes, census tracts, ZIP codes, participant IDs). Must have matching values with id_col in the exposure data. Defaults to "FIPS".

exposure_level_col

A character string for the name of the imputed exposure level column. Defaults to "ExposureLevel".

conc_mean_col

A character string for the column name of the concentration lognormal meanlog parameter. Defaults to "conc_meanlog". If this column doesn't exist and conc_raw_col is provided, it will be calculated automatically.

conc_sd_col

A character string for the column name of the concentration lognormal sdlog parameter. Defaults to "conc_sdlog". If this column doesn't exist, default_sdlog will be used.

conc_raw_col

A character string for the column name of raw concentration values (in measurement units). If provided and the meanlog/sdlog columns don't exist, lognormal parameters will be calculated automatically as: meanlog = log(concentration + 0.1), sdlog = default_sdlog. Defaults to NULL.

default_sdlog

A numeric value for the default lognormal sdlog when the sdlog column is not provided. Defaults to 1.0, which is typical for environmental concentration data.

pwell_col

A character string for the column name of the percentage of private well users. Defaults to "PWELL_private_pct".

seed

An integer for setting the random number generator seed for reproducibility. Defaults to 12345.

mice_m

An integer for the number of multiple imputations for covariates. Defaults to 1.

mice_maxit

An integer for the maximum number of iterations for mice. Defaults to 5.

mice_method

A character string specifying the imputation method for covariates. Defaults to "pmm" (predictive mean matching).

mice_covs

A character vector of column names to be used as covariates in the mice imputation process. Defaults to c("pm"). Make sure these columns exist in the birth data.

apply_imputation_fallback

Logical indicating whether to apply fallback imputation for missing exposure levels. The fallback fills missing values with the most common category in the respective dataset. If enabled, this fallback will be applied to any missing exposure levels after the initial imputation step. Otherwise, it will remove any rows with missing exposure levels.

format_ids

Logical indicating whether to apply smart identifier formatting. If TRUE (default), numeric FIPS-like codes (1-5 digits) are zero-padded to 5 digits for consistency, while alphanumeric identifiers (census tracts, participant IDs, etc.) are preserved unchanged. Set to FALSE to skip formatting and use identifiers as-is (converted to character).

Value

A list of data frames, each containing the pooled regression results for a target health outcome.

Examples

if (FALSE) { # \dontrun{
# Basic usage with minimal parameters
results <- perform_sensitivity_analysis(
  ndraws = 10,
  model_prob_csv = "prob_model_data.csv",
  conc_data_csv = "conc_measured_data.csv",
  birth_data_txt = "birth_outcomes.txt",
  regression_formula = "~ as.factor(ExposureLevel) + maternal_age + (1|county)",
  output_dir = "results/",
  targets = c("birth_weight", "gestational_age")
)

# Advanced usage with custom parameters and MICE imputation
results <- perform_sensitivity_analysis(
  ndraws = 100,
  model_prob_csv = "data/prob_data.csv",
  conc_data_csv = "data/conc_data.csv",
  birth_data_txt = "data/births.txt",
  regression_formula = "~ as.factor(ExposureLevel) + MAGE_R + rural + (1|FIPS)",
  output_dir = "sensitivity_results/",
  targets = c("OEGEST", "BWT"),
  impute_vars = c("MAGE_R", "education"),
  mice_m = 5,
  mice_maxit = 10,
  seed = 42
)

# Linear fixed-effects model (ordinary least squares, no random effects)
results <- perform_sensitivity_analysis(
  ndraws = 10,
  model_prob_csv = "data/prob_data.csv",
  conc_data_csv = "data/conc_data.csv",
  birth_data_txt = "data/births.txt",
  regression_formula = "~ as.factor(ExposureLevel) + MAGE_R",
  output_dir = "results_fixed/",
  model_type = "fixed"
)

# Linear random-intercept model
results <- perform_sensitivity_analysis(
  ndraws = 10,
  model_prob_csv = "data/prob_data.csv",
  conc_data_csv = "data/conc_data.csv",
  birth_data_txt = "data/births.txt",
  regression_formula = "~ as.factor(ExposureLevel) + (1 | FIPS)",
  output_dir = "results_random/",
  model_type = "random"
)

# Non-linear (logistic) mixed-effects model for a binary outcome
results <- perform_sensitivity_analysis(
  ndraws = 10,
  model_prob_csv = "data/prob_data.csv",
  conc_data_csv = "data/conc_data.csv",
  birth_data_txt = "data/births.txt",
  regression_formula = "~ as.factor(ExposureLevel) + MAGE_R + (1 | FIPS)",
  output_dir = "results_logistic/",
  targets = "preterm",
  model_type = "mixed",
  model_family = "binomial"
)
} # }