Using geoExposeR to Model Health Effects of Environmental Exposures

Introduction

Welcome to the geoExposeR package vignette. This document provides a comprehensive guide to using the package for estimating environmental exposures and assessing their association with health outcomes.

While originally developed for groundwater arsenic exposure assessment, the geoExposeR framework is generalizable to any environmental exposure where probabilistic models and measured concentration data are available. Applications include other drinking water contaminants (nitrate, uranium, PFAS, manganese), air quality (PM2.5, ozone), and other spatially-varying exposures.

The core methodology implemented in this package is based on the approaches used in foundational epidemiological studies by Bulka and others (2022), Lombard and others (2021), Angley and others (2026), and Nigra and others (2025).

Package Features

geoExposeR provides two main categories of functionality:

1. Data Ingestion Functions - Prepare raw data for analysis:

Function	Purpose
`prepare_prob_data()`	Extract arsenic probabilities from GeoTIFF rasters
`prepare_conc_data()`	Convert concentration measurements to lognormal parameters
`format_health_data()`	Standardize health outcome data formats
`add_well_percentages()`	Merge private well population percentages
`subset_by_geography()`	Filter data by state, region, or custom area
`validate_prepared_inputs()`	Validate all input files before analysis

2. Core Analysis Function - Run the sensitivity analysis:

Function	Purpose
`perform_sensitivity_analysis()`	Complete workflow from data loading to pooled results

This vignette will cover:

The conceptual background of the statistical model.
The required data formats for the analysis.
Data preparation workflow using data ingestion functions.
A complete, step-by-step example from data preparation to results interpretation.
Working with GeoTIFF rasters and other data formats.
Broader applications beyond arsenic (nitrate, uranium, PM2.5, and more).

Conceptual Framework

The primary challenge in studying the health effects of arsenic in private wells is the uncertainty in exposure assessment. Private wells are not federally regulated, and direct measurements for large populations are often unavailable. geoExposeR addresses this by implementing a multi-stage approach:

Probabilistic Exposure Modeling: The package combines two sources of arsenic prediction data:
- USGS Models: Machine learning models (Lombard and others, 2021) that provide a multinomial probability for arsenic concentration falling into one of three categories (<5, 5-10, >10 µg/L) for grid cells across the U.S.
- EPA Data: County-level estimates of arsenic concentrations from public supply wells aggregated by Ravalli and others 2022. These are combined, weighted by the proportion of the population using private vs. public wells, to create a county-specific probability distribution of arsenic exposure.
Multiple Imputation: Because we don’t know the true arsenic exposure for each individual, we treat it as missing data. The package uses a multiple imputation framework to create n complete datasets (ndraws). In each dataset, a plausible arsenic exposure category is assigned to each county based on the probability distribution derived in the previous step. The framework also imputes missing values in any user-specified covariates.
Regression Analysis and Pooling: A regression model is fitted to each of the n imputed datasets. The results (coefficients and standard errors) from these models are then pooled using Rubin’s Rules to produce a single set of estimates that properly accounts for the uncertainty introduced by the imputation process. The model defaults to a linear mixed-effects model, but the effects structure and response distribution are configurable (see Choosing a Regression Model below).

Choosing a Regression Model

Two arguments control the regression model fitted in step 3:

model_type selects the effects structure: "fixed" (fixed effects only, no random effects), "random" (random intercepts only), or "mixed" (the default; fixed plus random effects).
model_family selects the response distribution: NULL (the default) gives a linear Gaussian model for continuous outcomes; a family such as "binomial" or "poisson" gives a non-linear generalized model for binary or count outcomes; and "multinomial" gives a multinomial logistic model for multi-level categorical outcomes.

The fitting engine follows automatically from the combination:

`model_type`	linear (`model_family = NULL`)	non-linear (`model_family` set)	multinomial
`"fixed"`	`stats::lm()`	`stats::glm()`	`nnet::multinom()`
`"random"`	`lme4::lmer()`	`lme4::glmer()`	`mclogit::mblogit()`
`"mixed"`	`lme4::lmer()`	`lme4::glmer()`	`mclogit::mblogit()`

Fixed-effects models must not include random-effects terms in regression_formula, while random- and mixed-effects models require at least one (e.g. (1 | county)). For non-linear models the pooled estimate is on the link scale (for example, a log-odds ratio when model_family = "binomial"). Multinomial models (3+ outcome categories) add a y.level column to the pooled results identifying the outcome category. The examples/run_geoExposeR_model_variants.R and examples/run_geoExposeR_fixed_multi_target.R scripts demonstrate these options end to end.

Data Requirements

The perform_sensitivity_analysis() function requires three main data inputs.

1. Modeled Probability Data (`model_prob_csv`)

A CSV file containing the multinomial probability of arsenic levels from the modeled probability data for each geographic unit.

Default Column Names (can be customized via function parameters):
- prob_C1: Probability of arsenic < 5 µg/L (Category 1 - low exposure)
- prob_C2: Probability of 5 ≤ arsenic < 10 µg/L (Category 2 - moderate exposure)
- prob_C3: Probability of arsenic ≥ 10 µg/L (Category 3 - above EPA MCL)
- Wells_2010: Numeric population of well users for each geographic unit
- GEOID10: A geographic or participant identifier (e.g., FIPS code, census tract, ZIP code, or custom ID)

Note on column naming: The prob_C1, prob_C2, prob_C3 names match the USGS ScienceBase data release. If your data uses different column names, specify them via the prob_cols parameter:

# Using custom probability column names
results <- perform_sensitivity_analysis(
  prob_cols = c("prob_low", "prob_med", "prob_high"),  # Your column names
  # ... other parameters
)

The probabilities must sum to 1.0 for each row.

2. Measured Concentration Data (`conc_data_csv`)

A CSV file containing the concentration data compiled by Ravalli and others, 2022.

Default Column Names (can be customized via function parameters):
- conc_meanlog: Numeric mean (on a log scale) of the arsenic concentration
- PWELL_private_pct: Numeric percentage of the population on private wells (from 0 to 100)
- GEOID10: A geographic or participant identifier matching the probability data

Automatic Lognormal Conversion

Alternatively, if your concentration data contains raw arsenic concentrations instead of pre-calculated lognormal parameters, you can use the conc_raw_col parameter to have geoExposeR automatically convert them:

Alternative Required Columns (when using conc_raw_col):
- A column with raw arsenic concentration values (specified by conc_raw_col)
- PWELL_private_pct: Numeric percentage of the population on private wells
- A geographic or participant identifier matching the probability data

The conversion uses the formula: * meanlog = log(concentration + 0.1) (offset avoids log(0)) * sdlog = default_sdlog (defaults to 1.0 if not specified)

# With automatic conversion from raw concentrations
results <- perform_sensitivity_analysis(
  conc_data_csv = "conc_raw.csv",
  conc_raw_col = "arsenic_ppb",     # Column with raw concentrations
  default_sdlog = 1.0,              # Optional, defaults to 1.0
  # ... other parameters ...
)

3. Health Outcome Data (`birth_data`)

An R data.frame containing individual-level health outcome and covariate data.

Column names in the health data are fully user-defined. The only requirement is that column names match those referenced in your parameters and regression formula.

Configurable Columns:
- Identifier column (default "FIPS", set via record_id_col): A geographic or participant identifier matching the arsenic data
- Outcome columns (default c("OEGEST", "BWT"), set via targets): Columns for health outcomes
- Covariate columns: Any columns referenced in regression_formula

Example with custom column names:

# Your health data has: participant_id, birthweight_g, gest_weeks, mom_age, smoking
results <- perform_sensitivity_analysis(
  record_id_col = "participant_id",           # Your ID column name
  targets = c("birthweight_g", "gest_weeks"), # Your outcome column names
  regression_formula = "~ as.factor(ExposureLevel) + mom_age + smoking + (1|participant_id)",
  # ... other parameters
)

Data Preparation Workflow

The geoExposeR package provides a suite of data ingestion functions to help prepare raw data for analysis. These functions are particularly useful when working with GeoTIFF rasters, raw concentration measurements, or health data in various formats.

Data Ingestion Functions Overview

Function	Purpose
`prepare_prob_data()`	Extract arsenic probabilities from GeoTIFF rasters
`prepare_conc_data()`	Convert concentration measurements to lognormal parameters
`format_health_data()`	Standardize health outcome data formats
`add_well_percentages()`	Merge private well population percentages
`subset_by_geography()`	Filter data by state or geographic region
`validate_prepared_inputs()`	Validate all input files before analysis

Preparing Probability Data from GeoTIFF Rasters

If you have arsenic probability surfaces as GeoTIFF rasters, use prepare_prob_data() to extract and aggregate probabilities by geographic unit.

Data Citation: > Lombard, M.A., 2021, Data used to model and map arsenic concentration exceedances in private wells throughout the conterminous United States for human health studies: U.S. Geological Survey data release, https://doi.org/10.5066/P90RBJXS.

library(terra)
library(sf)

# Define raster paths (exceedance probabilities)
prob_rasters <- list(
  gt5 = "path/to/prob_gt5.tif",    # P(As > 5 µg/L)
  gt10 = "path/to/prob_gt10.tif"   # P(As > 10 µg/L)
)

# Load county boundaries (e.g., from Census TIGER files)
county_boundaries <- sf::st_read("path/to/county_boundaries.shp")

# Extract and convert to multinomial probabilities
prob_processed <- prepare_prob_data(
  prob_rasters = prob_rasters,
  boundaries = county_boundaries,
  id_col = "GEOID",
  extraction_method = "mean",  # or "centroid", "weighted"
  cutoffs = c(5, 10),
  output_file = "prob_arsenic.csv"
)

The function automatically converts exceedance probabilities to multinomial category probabilities (P(As < 5), P(5 ≤ As < 10), P(As ≥ 10)).

Preparing Concentration Data

Use prepare_conc_data() to convert raw arsenic measurements to lognormal distribution parameters:

# From raw measurements
conc_processed <- prepare_conc_data(
 conc_data = raw_conc_data,
  id_col = "county_fips",
  concentration_col = "arsenic_ug_L",
  population_col = "pop_served",
  private_well_pct_col = "pwell_pct",
  input_type = "raw",
  output_file = "conc_lognormal.csv"
)

# From pre-aggregated summary statistics
conc_processed <- prepare_conc_data(
  conc_data = summary_data,
  id_col = "county_fips",
  mean_col = "mean_arsenic",
  sd_col = "sd_arsenic",
  input_type = "summary"
)

Formatting Health Data

Use format_health_data() to standardize health outcome data. The package uses smart identifier formatting (enabled by default via format_ids = TRUE) that automatically detects ID types:

Numeric FIPS codes (1-5 digits): Zero-padded to 5 digits (e.g., 1 → "00001")
Longer numeric codes (census tracts, ZIP+4): Converted to character without modification
Alphanumeric IDs (participant IDs, custom codes): Preserved unchanged

Set format_ids = FALSE in perform_sensitivity_analysis() to skip formatting and use identifiers as-is.

health_formatted <- format_health_data(
  health_data = raw_health_data,
  id_col = "county_id",           # Works with any identifier format
  outcome_cols = c("BWT", "OEGEST", "SBP"),
  covariate_cols = c("MAGE_R", "smoke", "MRACEHISP_F"),
  geoid_digits = 5,                  # For numeric IDs, pads to specified digits
  validate = TRUE,
  output_file = "health_outcomes.txt"
)

Subsetting by Geography

Use subset_by_geography() to filter data to specific states or regions:

# Subset to California and Nevada (state FIPS codes)
western_data <- subset_by_geography(
  data = prob_processed,
  id_col = "GEOID10",
  state_fips = c("06", "32")
)

# Subset to specific counties
selected_counties <- subset_by_geography(
  data = prob_processed,
  id_col = "GEOID10",
  geoid_list = c("06001", "06003", "06005", "32003")
)

Validating Prepared Inputs

Before running the analysis, validate all input files using validate_prepared_inputs():

validation <- validate_prepared_inputs(
  prob_file = "prob_arsenic.csv",
  conc_file = "conc_lognormal.csv",
  health_file = "health_outcomes.txt"
)

if (validation$valid) {
  message("All files validated successfully!")
  message("Summary:")
  message("  Probability records: ", validation$summary$prob$n_rows)
  message("  Concentration records: ", validation$summary$conc$n_rows)
  message("  Health records: ", validation$summary$health$n_rows)
} else {
  message("Validation issues found:")
  for (issue in validation$issues) {
    message("  - ", issue)
  }
}

A Complete Walkthrough

Let’s walk through a full example.

Step 1: Setup

First, load the necessary libraries.

library(geoExposeR)
library(lme4) # Loaded by geoExposeR, but good practice for clarity
#> Loading required package: Matrix

Step 2: Load Example Data

The geoExposeR package includes example data files in the examples/input_data/ folder. These files demonstrate the expected data formats:

prob_dom.csv: Modeled probability data with multinomial probabilities of arsenic levels
conc_dom.csv: Measured concentration data with raw concentration values
Demographics_dom.txt: Demographics data with health outcomes (SBP)

# Define paths to the example data files included with the package
# When running from the package source directory:
example_dir <- file.path("examples", "input_data")

# Define file paths
example_files <- list(
  prob = file.path(example_dir, "prob_dom.csv"),
  conc = file.path(example_dir, "conc_dom.csv"),
  health = file.path(example_dir, "Demographics_dom.txt")
)

# Verify files exist
cat("Example data files:\n")
cat("  Probability data:", example_files$prob, "\n")
cat("  Concentration data:", example_files$conc, "\n")
cat("  Health data:", example_files$health, "\n")

The example data files have the following structure:

# Preview probability data
prob_preview <- data.table::fread(example_files$prob, nrows = 5)
cat("\nProbability data columns:", paste(names(prob_preview), collapse = ", "), "\n")
# Columns: id, pub_water_pop, RFCprobC1, RFCprobC2, RFCprobC3, RFC_maxprob, RFC_maxprob_num, Dom_Pop_Percent

# Preview concentration data
conc_preview <- data.table::fread(example_files$conc, nrows = 5)
cat("Concentration data columns:", paste(names(conc_preview), collapse = ", "), "\n")
# Columns: id, weighted_as20062008, county_pop_served20062008, well, Dom_Pop_Percent

# Preview health/demographics data
health_preview <- data.table::fread(example_files$health, nrows = 5)
cat("Health data columns:", paste(names(health_preview), collapse = ", "), "\n")
# Columns: id, SBP, age_decade

Step 3: Define Analysis Parameters

Next, define all the parameters for the analysis, including the regression formula. Note that we use column names matching the example data files.

# Use a small number for the vignette; recommend >= 20 for real analysis
ndraws <- 5
targets <- c("SBP") # Systolic Blood Pressure (from Demographics_dom.txt)
output_dir <- "analysis_results"
exposure_level_col <- "ExposureLevel"

# Arsenic category labels (matching USGS 3-category model)
cat_labels <- c("<5", "5-10", "10+")

# Reference category for regression (lowest exposure)
drop_cat_label_ref <- c("<5")

# Define the regression formula.
# We use a geographic random INTERCEPT by county, (1 | id): arsenic exposure is
# assigned at the county level while SBP is measured on individuals nested
# within counties (the demographics file holds multiple participants per
# county). Each county gets its own baseline SBP, separating between-county from
# within-county variation. There is no random slope, so the arsenic effect is
# assumed constant across counties.
regression_formula <- paste0(
  "~ as.factor(", exposure_level_col, ") + ",
  "(1 | id)"
)

A random intercept by age_decade, (1 | age_decade), is also possible, but it has a different meaning: it treats age as a contextual grouping variable (a separate baseline per age band, no random slope) rather than modelling individual age, and it estimates the between-group variance from only a handful of levels. If you simply want to adjust for age, prefer a fixed effect such as + as.factor(age_decade). See examples/run_geoExposeR_model_variants.R for these alternatives.

Step 4: Run the Analysis

With all parameters defined, call the main function perform_sensitivity_analysis(). Note that we specify custom column names to match the example data file structure.

# This chunk may take a moment to run
results <- perform_sensitivity_analysis(
  birth_data = example_files$health,
  model_prob_csv = example_files$prob,
  conc_data_csv = example_files$conc,
  ndraws = ndraws,
  regression_formula = regression_formula,
  output_dir = output_dir,
  targets = targets,
  cat_label = cat_labels,
  drop_cat_label_ref = drop_cat_label_ref,
  # Custom column names matching the example data files
  id_col = "id",                                           # ID column in prob/conc data
  prob_cols = c("RFCprobC1", "RFCprobC2", "RFCprobC3"), # Probability columns
  pop_well_col = "pub_water_pop",                          # Population column in prob data
  record_id_col = "id",                                    # ID column in health data
  exposure_level_col = exposure_level_col,
  # Use raw concentration column for automatic lognormal conversion
  conc_raw_col = "weighted_as20062008",                 # Raw arsenic concentration
  default_sdlog = 1.0,                                 # Default sdlog value
  pwell_col = "Dom_Pop_Percent",                       # Private well percentage
  conc_cutoffs = c(5, 10),
  # Imputation settings
  seed = 12345,
  mice_m = 1,
  mice_maxit = 5,
  mice_method = "pmm",
  mice_covs = NULL,
  impute_vars = NULL,
  apply_imputation_fallback = TRUE,
  rucc_col = NULL
)

Step 5: Interpret the Results

The results object is a list where each element is a mipo object from the mice package. To view a publication-ready table of the pooled results, use summary().

Let’s examine the results for Systolic Blood Pressure (SBP).

sbp_summary <- results$SBP
knitr::kable(
  sbp_summary,
  caption = "Pooled Regression Results for Systolic Blood Pressure (SBP)"
)

The summary table includes: * term: The regression model term. * q.mi: The pooled point estimate (e.g., the mean difference in SBP). * se.mi: The pooled standard error. * statistic: The t-statistic. * conf.low: The lower bound of the 95% confidence interval. * conf.high: The upper bound of the 95% confidence interval. * p.value: The p-value for the term.

From this table, you can assess the magnitude, direction, and statistical significance of the association between different levels of arsenic exposure and the health outcome, while accounting for uncertainty from both exposure and covariate imputation.

Step 6: Cleanup (Optional)

Remove the output directory created during the analysis if desired.

unlink(output_dir, recursive = TRUE)

Additional Examples

The geoExposeR package includes several complete working examples in the examples/ folder demonstrating both the core arsenic analysis workflow and broader applications to other environmental exposures.

1. Basic CSV Workflow (`run_geoExposeR_example.R`)

The standard example using pre-formatted CSV data:

# Run from package directory
source("examples/run_geoExposeR_example.R")

This example demonstrates: - Loading data with custom column names - Preprocessing concentration data to lognormal parameters - Running the complete sensitivity analysis - Interpreting pooled regression results

2. GeoTIFF Workflow (`run_geoExposeR_from_geotiff.R`)

A complete workflow starting from GeoTIFF raster data:

# Run from package directory
source("examples/run_geoExposeR_from_geotiff.R")

This example demonstrates: - Creating/loading GeoTIFF probability surfaces - Building geographic boundaries for extraction - Using prepare_prob_data() to extract probabilities - Using prepare_conc_data() to process concentration measurements - Using format_health_data() to standardize health data - Using validate_prepared_inputs() for quality control - Running the complete analysis pipeline

3. Data Pipeline Demo (`run_geoExposeR_data_pipeline.R`)

A comprehensive demonstration of all data ingestion functions:

# Run from package directory
source("examples/run_geoExposeR_data_pipeline.R")

This example demonstrates: - prepare_conc_data() with both raw and summary input types - format_health_data() with different data formats - add_well_percentages() for merging Census well data

Domestic Well Data Citation: > Johnson, T.D., and Belitz, K., 2019, Domestic well locations and populations served in the contiguous U.S.: datasets for decadal years 2000 and 2010: U.S. Geological Survey data release, https://doi.org/10.5066/P9FSLU3B.

subset_by_geography() for filtering by state or GEOID
validate_prepared_inputs() for comprehensive validation

Broader Applications

While geoExposeR was originally developed for groundwater arsenic exposure assessment, its framework is generalizable to any environmental exposure where:

You have probabilistic exposure estimates from models or spatial predictions
You have measured concentration data from monitoring systems
You want to assess health outcomes while accounting for exposure uncertainty

The package includes three additional examples demonstrating these broader applications:

4. Nitrate and Infant Methemoglobinemia (`run_geoExposeR_nitrate.R`)

Demonstrates analyzing nitrate contamination in drinking water and its association with infant methemoglobinemia (blue baby syndrome):

# Run from package directory
source("examples/run_geoExposeR_nitrate.R")

Key adaptations:

Exposure cutoffs: c(5, 10) mg/L NO₃-N (EPA MCL = 10 mg/L)
Health outcome: Methemoglobin percentage in infants
Data sources:
- prob_nitrate.csv: Modeled nitrate probability distributions
- conc_nitrate.csv: Public water system nitrate measurements
- Health_nitrate.txt: Infant health records with methemoglobin levels

# Example parameter configuration for nitrate
nitrate_cutoffs <- c(5, 10)  # mg/L NO3-N
cat_labels <- c("NO3<5", "NO3_5-10", "NO3_10+")
targets <- c("methemoglobin_pct")

results <- perform_sensitivity_analysis(
  birth_data_txt = "path/to/Health_nitrate.txt",
  model_prob_csv = "path/to/prob_nitrate.csv",
  conc_data_csv = "path/to/conc_nitrate.csv",
  conc_cutoffs = nitrate_cutoffs,
  cat_label = cat_labels,
  targets = targets,
  conc_raw_col = "weighted_nitrate",
  default_sdlog = 0.8,
  # ... other parameters
)

5. Uranium and Kidney Function (`run_geoExposeR_uranium.R`)

Demonstrates analyzing uranium exposure and its nephrotoxic effects on kidney function (eGFR):

# Run from package directory
source("examples/run_geoExposeR_uranium.R")

Key adaptations:

Exposure cutoffs: c(15, 30) µg/L (EPA MCL = 30 µg/L)
Health outcome: Estimated Glomerular Filtration Rate (eGFR)
Expected relationship: Negative coefficients (higher uranium → lower eGFR)

# Example parameter configuration for uranium
uranium_cutoffs <- c(15, 30)  # µg/L
cat_labels <- c("U<15", "U_15-30", "U_30+")
targets <- c("eGFR")

results <- perform_sensitivity_analysis(
  birth_data_txt = "path/to/Health_uranium.txt",
  model_prob_csv = "path/to/prob_uranium.csv",
  conc_data_csv = "path/to/conc_uranium.csv",
  conc_cutoffs = uranium_cutoffs,
  cat_label = cat_labels,
  targets = targets,
  conc_raw_col = "weighted_uranium",
  default_sdlog = 0.9,
  # ... other parameters
)

6. PM2.5 Air Quality and Respiratory Function (`run_geoExposeR_pm25.R`)

Demonstrates extending the framework beyond groundwater to air quality exposures:

# Run from package directory
source("examples/run_geoExposeR_pm25.R")

Key adaptations:

Exposure cutoffs: c(9, 35) µg/m³ (EPA annual standard = 9 µg/m³; 24-hour standard = 35 µg/m³)
Health outcome: FEV1 % predicted (lung function)
Data sources reinterpreted:
- Probability data parameter → Satellite/hybrid model predictions
- Concentration data parameter → Ground monitoring station measurements

# Example parameter configuration for PM2.5
pm25_cutoffs <- c(9, 35)  # µg/m³
cat_labels <- c("PM25<9", "PM25_9-35", "PM25_35+")
targets <- c("FEV1_pct_predicted")

results <- perform_sensitivity_analysis(
  birth_data_txt = "path/to/Health_pm25.txt",
  model_prob_csv = "path/to/prob_pm25.csv",
  conc_data_csv = "path/to/conc_pm25.csv",
  conc_cutoffs = pm25_cutoffs,
  cat_label = cat_labels,
  targets = targets,
  conc_raw_col = "weighted_pm25",
  default_sdlog = 0.5,  # Lower variance for air quality
  # ... other parameters
)

Adapting to Other Exposures

The framework can be adapted to any exposure with the following data structure:

geoExposeR Parameter	Original Use	Generalized Use
`model_prob_csv`	Modeled arsenic probabilities	Any probabilistic exposure model
`conc_data_csv`	Measured drinking water concentrations	Any measured concentration data
`conc_cutoffs`	Arsenic thresholds (5, 10 µg/L)	Any health-relevant thresholds
`cat_label`	Arsenic category names	Custom exposure category labels
`targets`	Birth outcomes	Any health outcome variables

Potential applications include:

Other groundwater contaminants: Manganese, fluoride, radium, PFAS
Air quality: Ozone, NO₂, SO₂
Soil contamination: Lead, heavy metals
Noise exposure: Traffic noise and cardiovascular outcomes
Heat exposure: Temperature and mortality/morbidity

References

Foundational Studies

Bulka, C.M., Bryan, M.S., Lombard, M.A., Bartell, S.M., Jones, D.K., Bradley, P.M., Vieira, V.M., Silverman, D.T., Focazio, M., Toccalino, P.L., Daniel, J., Backer, L.C., Ayotte, J.D., Gribble, M.O., and Argos, M., 2022, Arsenic in private well water and birth outcomes in the United States: Environment International, v. 163, article 107176, https://doi.org/10.1016/j.envint.2022.107176.

Lombard, M.A., Bryan, M.S., Jones, D.K., Bulka, C., Bradley, P.M., Backer, L.C., Focazio, M.J., Silverman, D.T., Toccalino, P., Argos, M., Gribble, M.O., and Ayotte, J.D., 2021, Machine learning models of arsenic in private wells throughout the conterminous United States as a tool for exposure assessment in human health studies: Environmental Science & Technology, v. 55, no. 8, p. 5012–5023, https://doi.org/10.1021/acs.est.0c05239.

Nigra, A. E., Bloomquist, T. R., Rajeev, T., Burjak, M., Casey, J. A., Goin, D. E., Herbstman, J. B., Ornelas Van Horne, Y., Wylie, B. J., Cerna-Turoff, I., Braun, J. M., McArthur, K. L., Karagas, M. R., Ames, J. L., Sherris, A. R., Bulka, C. M., Padula, A. M., Howe, C. G., Fry, R. C., … Consortium, E. C., 2025, Public Water Arsenic and Birth Outcomes in the Environmental Influences on Child Health Outcomes Cohort, JAMA Network Open, v. 8, no. 6, p. e2514084–e2514084. https://doi.org/10.1001/jamanetworkopen.2025.14084

Angley, M., Zhang, Y., Nigra, A. E., Lombard, M. A., Gribble, M. O., Lu, L., Unverzagt, F. W., McClure, L. A., Judd, S. E., Cushman, M., Brockman, J., and Kahe, K., 2026, Drinking water arsenic, urinary arsenic biomarkers, and cognitive impairment in the REGARDS Study, Environmental Research, 123768, https://doi.org/10.1016/j.envres.2026.123768.

Statistical Methods

Rubin, D.B., 1987, Multiple imputation for nonresponse in surveys: New York, John Wiley & Sons, https://doi.org/10.1002/9780470316696.

van Buuren, S., and Groothuis-Oudshoorn, K., 2011, mice—Multivariate imputation by chained equations in R: Journal of Statistical Software, v. 45, no. 3, p. 1–67, https://doi.org/10.18637/jss.v045.i03.

Honaker, J., King, G., and Blackwell, M., 2011, Amelia II—A program for missing data: Journal of Statistical Software, v. 45, no. 7, p. 1–47, https://doi.org/10.18637/jss.v045.i07.

Bates, D., Mächler, M., Bolker, B., and Walker, S., 2015, Fitting linear mixed-effects models using lme4: Journal of Statistical Software, v. 67, no. 1, p. 1–48, https://doi.org/10.18637/jss.v067.i01.

Venables, W.N., and Ripley, B.D., 2002, Modern applied statistics with S (4th ed.): New York, Springer, https://doi.org/10.1007/978-0-387-21706-2. [multinomial logistic regression via the nnet package]

Elff, M., 2024, mclogit—Multinomial logit models, with or without random effects or overdispersion: R package, https://doi.org/10.32614/CRAN.package.mclogit.

Data Sources

Lombard, M.A., 2021, Data used to model and map arsenic concentration exceedances in private wells throughout the conterminous United States for human health studies: U.S. Geological Survey data release, https://doi.org/10.5066/P90RBJXS.

Johnson, T.D., and Belitz, K., 2019, Domestic well locations and populations served in the contiguous U.S.—Datasets for decadal years 2000 and 2010: U.S. Geological Survey data release, https://doi.org/10.5066/P9FSLU3B.

DeSimone, L.A., McMahon, P.B., and Rosen, M.R., 2014, The quality of our Nation’s waters—Water quality in principal aquifers of the United States, 1991–2010: U.S. Geological Survey Circular 1360, 151 p., https://pubs.usgs.gov/circ/1360/.

Ravalli, F., Yu, Y., Bostick, B. C., Chillrud, S. N., Schilling, K., Basu, A., Navas-Acien, A., Nigra, A. E., 2022, Sociodemographic inequalities in uranium and other metals in community water systems across the USA, 2006-11: a cross-sectional study: a cross-sectional study. The Lancet Planetary Health, 6(4), e320–e330, https://doi.org/10.1016/S2542-5196(22)00043-2

R Packages and Resources

R Core Team, 2024, R: A language and environment for statistical computing: Vienna, Austria, R Foundation for Statistical Computing, https://www.R-project.org/.

Wickham, H., and Bryan, J., 2023, R packages (2nd ed.): Sebastopol, California, O’Reilly Media, https://r-pkgs.org/.

Barrett, T., Dowle, M., Srinivasan, A., Gorecki, J., Chirico, M., Hocking, T., Schwendinger, B., and Krylov, I., 2025, data.table—Extension of data.frame: R package version 1.17.99, https://r-datatable.com.

Hijmans, R.J., 2024, terra—Spatial data analysis: R package version 1.7-78, https://CRAN.R-project.org/package=terra.

Pebesma, E., and Bivand, R., 2024, sf—Simple features for R: R package version 1.0-16, https://CRAN.R-project.org/package=sf.

Baston, D., 2024, exactextractr—Fast extraction from raster datasets using polygons: R package version 0.10.0, https://CRAN.R-project.org/package=exactextractr.

Sayantan Majumdar, Scott M. Bartell, Melissa A. Lombard, Ryan G. Smith, and Matthew O. Gribble

2026-06-17

Introduction

Package Features

Conceptual Framework

Choosing a Regression Model

Data Requirements

1. Modeled Probability Data (`model_prob_csv`)

2. Measured Concentration Data (`conc_data_csv`)

Automatic Lognormal Conversion

3. Health Outcome Data (`birth_data`)

Data Preparation Workflow

Data Ingestion Functions Overview

Preparing Probability Data from GeoTIFF Rasters

Preparing Concentration Data

Formatting Health Data

Subsetting by Geography

Validating Prepared Inputs

A Complete Walkthrough

Step 1: Setup

Step 2: Load Example Data

Step 3: Define Analysis Parameters

Step 4: Run the Analysis

Step 5: Interpret the Results

Step 6: Cleanup (Optional)

Additional Examples

1. Basic CSV Workflow (`run_geoExposeR_example.R`)

2. GeoTIFF Workflow (`run_geoExposeR_from_geotiff.R`)

3. Data Pipeline Demo (`run_geoExposeR_data_pipeline.R`)

Broader Applications

4. Nitrate and Infant Methemoglobinemia (`run_geoExposeR_nitrate.R`)

5. Uranium and Kidney Function (`run_geoExposeR_uranium.R`)

6. PM2.5 Air Quality and Respiratory Function (`run_geoExposeR_pm25.R`)

Adapting to Other Exposures

References

Foundational Studies

Statistical Methods

Data Sources

R Packages and Resources

Using geoExposeR to Model Health Effects of Environmental Exposures

Sayantan Majumdar, Scott M. Bartell, Melissa A. Lombard, Ryan G. Smith, and Matthew O. Gribble

2026-06-17

Introduction

Package Features

Conceptual Framework

Choosing a Regression Model

Data Requirements

1. Modeled Probability Data (model_prob_csv)

2. Measured Concentration Data (conc_data_csv)

Automatic Lognormal Conversion

3. Health Outcome Data (birth_data)

Data Preparation Workflow

Data Ingestion Functions Overview

Preparing Probability Data from GeoTIFF Rasters

Preparing Concentration Data

Formatting Health Data

Subsetting by Geography

Validating Prepared Inputs

A Complete Walkthrough

Step 1: Setup

Step 2: Load Example Data

Step 3: Define Analysis Parameters

Step 4: Run the Analysis

Step 5: Interpret the Results

Step 6: Cleanup (Optional)

Additional Examples

1. Basic CSV Workflow (run_geoExposeR_example.R)

2. GeoTIFF Workflow (run_geoExposeR_from_geotiff.R)

3. Data Pipeline Demo (run_geoExposeR_data_pipeline.R)

Broader Applications

4. Nitrate and Infant Methemoglobinemia (run_geoExposeR_nitrate.R)

5. Uranium and Kidney Function (run_geoExposeR_uranium.R)

6. PM2.5 Air Quality and Respiratory Function (run_geoExposeR_pm25.R)

Adapting to Other Exposures

References

Foundational Studies

Statistical Methods

Data Sources

R Packages and Resources

1. Modeled Probability Data (`model_prob_csv`)

2. Measured Concentration Data (`conc_data_csv`)

3. Health Outcome Data (`birth_data`)

1. Basic CSV Workflow (`run_geoExposeR_example.R`)

2. GeoTIFF Workflow (`run_geoExposeR_from_geotiff.R`)

3. Data Pipeline Demo (`run_geoExposeR_data_pipeline.R`)

4. Nitrate and Infant Methemoglobinemia (`run_geoExposeR_nitrate.R`)

5. Uranium and Kidney Function (`run_geoExposeR_uranium.R`)

6. PM2.5 Air Quality and Respiratory Function (`run_geoExposeR_pm25.R`)