Using geoExposeR to Model Health Effects of Environmental Exposures
Sayantan Majumdar, Scott M. Bartell, Melissa A. Lombard, Ryan G. Smith, and Matthew O. Gribble
2026-06-17
geoExposeR-vignette.RmdIntroduction
Welcome to the geoExposeR package vignette. This
document provides a comprehensive guide to using the package for
estimating environmental exposures and assessing their association with
health outcomes.
While originally developed for groundwater arsenic exposure
assessment, the geoExposeR framework is
generalizable to any environmental exposure where probabilistic models
and measured concentration data are available. Applications include
other drinking water contaminants (nitrate, uranium, PFAS, manganese),
air quality (PM2.5, ozone), and other spatially-varying exposures.
The core methodology implemented in this package is based on the approaches used in foundational epidemiological studies by Bulka and others (2022), Lombard and others (2021), Angley and others (2026), and Nigra and others (2025).
Package Features
geoExposeR provides two main categories of
functionality:
1. Data Ingestion Functions - Prepare raw data for analysis:
| Function | Purpose |
|---|---|
prepare_prob_data() |
Extract arsenic probabilities from GeoTIFF rasters |
prepare_conc_data() |
Convert concentration measurements to lognormal parameters |
format_health_data() |
Standardize health outcome data formats |
add_well_percentages() |
Merge private well population percentages |
subset_by_geography() |
Filter data by state, region, or custom area |
validate_prepared_inputs() |
Validate all input files before analysis |
2. Core Analysis Function - Run the sensitivity analysis:
| Function | Purpose |
|---|---|
perform_sensitivity_analysis() |
Complete workflow from data loading to pooled results |
This vignette will cover:
- The conceptual background of the statistical model.
- The required data formats for the analysis.
- Data preparation workflow using data ingestion functions.
- A complete, step-by-step example from data preparation to results interpretation.
- Working with GeoTIFF rasters and other data formats.
- Broader applications beyond arsenic (nitrate, uranium, PM2.5, and more).
Conceptual Framework
The primary challenge in studying the health effects of arsenic in
private wells is the uncertainty in exposure assessment. Private wells
are not federally regulated, and direct measurements for large
populations are often unavailable. geoExposeR addresses
this by implementing a multi-stage approach:
-
Probabilistic Exposure Modeling: The package combines two sources of arsenic prediction data:
-
USGS Models: Machine learning models (Lombard and
others, 2021) that provide a multinomial probability for arsenic
concentration falling into one of three categories (
<5,5-10,>10µg/L) for grid cells across the U.S. - EPA Data: County-level estimates of arsenic concentrations from public supply wells aggregated by Ravalli and others 2022. These are combined, weighted by the proportion of the population using private vs. public wells, to create a county-specific probability distribution of arsenic exposure.
-
USGS Models: Machine learning models (Lombard and
others, 2021) that provide a multinomial probability for arsenic
concentration falling into one of three categories (
Multiple Imputation: Because we don’t know the true arsenic exposure for each individual, we treat it as missing data. The package uses a multiple imputation framework to create
ncomplete datasets (ndraws). In each dataset, a plausible arsenic exposure category is assigned to each county based on the probability distribution derived in the previous step. The framework also imputes missing values in any user-specified covariates.Regression Analysis and Pooling: A regression model is fitted to each of the
nimputed datasets. The results (coefficients and standard errors) from these models are then pooled using Rubin’s Rules to produce a single set of estimates that properly accounts for the uncertainty introduced by the imputation process. The model defaults to a linear mixed-effects model, but the effects structure and response distribution are configurable (see Choosing a Regression Model below).
Choosing a Regression Model
Two arguments control the regression model fitted in step 3:
-
model_typeselects the effects structure:"fixed"(fixed effects only, no random effects),"random"(random intercepts only), or"mixed"(the default; fixed plus random effects). -
model_familyselects the response distribution:NULL(the default) gives a linear Gaussian model for continuous outcomes; a family such as"binomial"or"poisson"gives a non-linear generalized model for binary or count outcomes; and"multinomial"gives a multinomial logistic model for multi-level categorical outcomes.
The fitting engine follows automatically from the combination:
model_type |
linear (model_family = NULL) |
non-linear (model_family set) |
multinomial |
|---|---|---|---|
"fixed" |
stats::lm() |
stats::glm() |
nnet::multinom() |
"random" |
lme4::lmer() |
lme4::glmer() |
mclogit::mblogit() |
"mixed" |
lme4::lmer() |
lme4::glmer() |
mclogit::mblogit() |
Fixed-effects models must not include random-effects terms in
regression_formula, while random- and mixed-effects models
require at least one (e.g. (1 | county)). For non-linear
models the pooled estimate is on the link scale (for example, a log-odds
ratio when model_family = "binomial"). Multinomial models
(3+ outcome categories) add a y.level column to the pooled
results identifying the outcome category. The
examples/run_geoExposeR_model_variants.R and
examples/run_geoExposeR_fixed_multi_target.R scripts
demonstrate these options end to end.
Data Requirements
The perform_sensitivity_analysis() function requires
three main data inputs.
1. Modeled Probability Data (model_prob_csv)
A CSV file containing the multinomial probability of arsenic levels from the modeled probability data for each geographic unit.
-
Default Column Names (can be customized via
function parameters):
-
prob_C1: Probability of arsenic < 5 µg/L (Category 1 - low exposure) -
prob_C2: Probability of 5 ≤ arsenic < 10 µg/L (Category 2 - moderate exposure) -
prob_C3: Probability of arsenic ≥ 10 µg/L (Category 3 - above EPA MCL) -
Wells_2010: Numeric population of well users for each geographic unit -
GEOID10: A geographic or participant identifier (e.g., FIPS code, census tract, ZIP code, or custom ID)
-
Note on column naming: The prob_C1,
prob_C2, prob_C3 names match the USGS
ScienceBase data release. If your data uses different column names,
specify them via the prob_cols parameter:
# Using custom probability column names
results <- perform_sensitivity_analysis(
prob_cols = c("prob_low", "prob_med", "prob_high"), # Your column names
# ... other parameters
)The probabilities must sum to 1.0 for each row.
2. Measured Concentration Data (conc_data_csv)
A CSV file containing the concentration data compiled by Ravalli and others, 2022.
-
Default Column Names (can be customized via
function parameters):
-
conc_meanlog: Numeric mean (on a log scale) of the arsenic concentration -
PWELL_private_pct: Numeric percentage of the population on private wells (from 0 to 100) -
GEOID10: A geographic or participant identifier matching the probability data
-
Automatic Lognormal Conversion
Alternatively, if your concentration data contains raw arsenic
concentrations instead of pre-calculated lognormal parameters, you can
use the conc_raw_col parameter to have geoExposeR
automatically convert them:
-
Alternative Required Columns (when using
conc_raw_col):- A column with raw arsenic concentration values (specified by
conc_raw_col) -
PWELL_private_pct: Numeric percentage of the population on private wells - A geographic or participant identifier matching the probability data
- A column with raw arsenic concentration values (specified by
The conversion uses the formula: *
meanlog = log(concentration + 0.1) (offset avoids log(0)) *
sdlog = default_sdlog (defaults to 1.0 if not
specified)
# With automatic conversion from raw concentrations
results <- perform_sensitivity_analysis(
conc_data_csv = "conc_raw.csv",
conc_raw_col = "arsenic_ppb", # Column with raw concentrations
default_sdlog = 1.0, # Optional, defaults to 1.0
# ... other parameters ...
)3. Health Outcome Data (birth_data)
An R data.frame containing individual-level health
outcome and covariate data.
Column names in the health data are fully user-defined. The only requirement is that column names match those referenced in your parameters and regression formula.
-
Configurable Columns:
-
Identifier column (default
"FIPS", set viarecord_id_col): A geographic or participant identifier matching the arsenic data -
Outcome columns (default
c("OEGEST", "BWT"), set viatargets): Columns for health outcomes -
Covariate columns: Any columns referenced in
regression_formula
-
Identifier column (default
Example with custom column names:
# Your health data has: participant_id, birthweight_g, gest_weeks, mom_age, smoking
results <- perform_sensitivity_analysis(
record_id_col = "participant_id", # Your ID column name
targets = c("birthweight_g", "gest_weeks"), # Your outcome column names
regression_formula = "~ as.factor(ExposureLevel) + mom_age + smoking + (1|participant_id)",
# ... other parameters
)Data Preparation Workflow
The geoExposeR package provides a suite of data
ingestion functions to help prepare raw data for analysis. These
functions are particularly useful when working with GeoTIFF rasters, raw
concentration measurements, or health data in various formats.
Data Ingestion Functions Overview
| Function | Purpose |
|---|---|
prepare_prob_data() |
Extract arsenic probabilities from GeoTIFF rasters |
prepare_conc_data() |
Convert concentration measurements to lognormal parameters |
format_health_data() |
Standardize health outcome data formats |
add_well_percentages() |
Merge private well population percentages |
subset_by_geography() |
Filter data by state or geographic region |
validate_prepared_inputs() |
Validate all input files before analysis |
Preparing Probability Data from GeoTIFF Rasters
If you have arsenic probability surfaces as GeoTIFF rasters, use
prepare_prob_data() to extract and aggregate probabilities
by geographic unit.
Data Citation: > Lombard, M.A., 2021, Data used to model and map arsenic concentration exceedances in private wells throughout the conterminous United States for human health studies: U.S. Geological Survey data release, https://doi.org/10.5066/P90RBJXS.
library(terra)
library(sf)
# Define raster paths (exceedance probabilities)
prob_rasters <- list(
gt5 = "path/to/prob_gt5.tif", # P(As > 5 µg/L)
gt10 = "path/to/prob_gt10.tif" # P(As > 10 µg/L)
)
# Load county boundaries (e.g., from Census TIGER files)
county_boundaries <- sf::st_read("path/to/county_boundaries.shp")
# Extract and convert to multinomial probabilities
prob_processed <- prepare_prob_data(
prob_rasters = prob_rasters,
boundaries = county_boundaries,
id_col = "GEOID",
extraction_method = "mean", # or "centroid", "weighted"
cutoffs = c(5, 10),
output_file = "prob_arsenic.csv"
)The function automatically converts exceedance probabilities to multinomial category probabilities (P(As < 5), P(5 ≤ As < 10), P(As ≥ 10)).
Preparing Concentration Data
Use prepare_conc_data() to convert raw arsenic
measurements to lognormal distribution parameters:
# From raw measurements
conc_processed <- prepare_conc_data(
conc_data = raw_conc_data,
id_col = "county_fips",
concentration_col = "arsenic_ug_L",
population_col = "pop_served",
private_well_pct_col = "pwell_pct",
input_type = "raw",
output_file = "conc_lognormal.csv"
)
# From pre-aggregated summary statistics
conc_processed <- prepare_conc_data(
conc_data = summary_data,
id_col = "county_fips",
mean_col = "mean_arsenic",
sd_col = "sd_arsenic",
input_type = "summary"
)Formatting Health Data
Use format_health_data() to standardize health outcome
data. The package uses smart identifier formatting
(enabled by default via format_ids = TRUE) that
automatically detects ID types:
-
Numeric FIPS codes (1-5 digits): Zero-padded to 5
digits (e.g.,
1→"00001") - Longer numeric codes (census tracts, ZIP+4): Converted to character without modification
- Alphanumeric IDs (participant IDs, custom codes): Preserved unchanged
Set format_ids = FALSE in
perform_sensitivity_analysis() to skip formatting and use
identifiers as-is.
health_formatted <- format_health_data(
health_data = raw_health_data,
id_col = "county_id", # Works with any identifier format
outcome_cols = c("BWT", "OEGEST", "SBP"),
covariate_cols = c("MAGE_R", "smoke", "MRACEHISP_F"),
geoid_digits = 5, # For numeric IDs, pads to specified digits
validate = TRUE,
output_file = "health_outcomes.txt"
)Subsetting by Geography
Use subset_by_geography() to filter data to specific
states or regions:
# Subset to California and Nevada (state FIPS codes)
western_data <- subset_by_geography(
data = prob_processed,
id_col = "GEOID10",
state_fips = c("06", "32")
)
# Subset to specific counties
selected_counties <- subset_by_geography(
data = prob_processed,
id_col = "GEOID10",
geoid_list = c("06001", "06003", "06005", "32003")
)Validating Prepared Inputs
Before running the analysis, validate all input files using
validate_prepared_inputs():
validation <- validate_prepared_inputs(
prob_file = "prob_arsenic.csv",
conc_file = "conc_lognormal.csv",
health_file = "health_outcomes.txt"
)
if (validation$valid) {
message("All files validated successfully!")
message("Summary:")
message(" Probability records: ", validation$summary$prob$n_rows)
message(" Concentration records: ", validation$summary$conc$n_rows)
message(" Health records: ", validation$summary$health$n_rows)
} else {
message("Validation issues found:")
for (issue in validation$issues) {
message(" - ", issue)
}
}A Complete Walkthrough
Let’s walk through a full example.
Step 1: Setup
First, load the necessary libraries.
library(geoExposeR)
library(lme4) # Loaded by geoExposeR, but good practice for clarity
#> Loading required package: MatrixStep 2: Load Example Data
The geoExposeR package includes example data files in
the examples/input_data/ folder. These files demonstrate
the expected data formats:
- prob_dom.csv: Modeled probability data with multinomial probabilities of arsenic levels
- conc_dom.csv: Measured concentration data with raw concentration values
- Demographics_dom.txt: Demographics data with health outcomes (SBP)
# Define paths to the example data files included with the package
# When running from the package source directory:
example_dir <- file.path("examples", "input_data")
# Define file paths
example_files <- list(
prob = file.path(example_dir, "prob_dom.csv"),
conc = file.path(example_dir, "conc_dom.csv"),
health = file.path(example_dir, "Demographics_dom.txt")
)
# Verify files exist
cat("Example data files:\n")
cat(" Probability data:", example_files$prob, "\n")
cat(" Concentration data:", example_files$conc, "\n")
cat(" Health data:", example_files$health, "\n")The example data files have the following structure:
# Preview probability data
prob_preview <- data.table::fread(example_files$prob, nrows = 5)
cat("\nProbability data columns:", paste(names(prob_preview), collapse = ", "), "\n")
# Columns: id, pub_water_pop, RFCprobC1, RFCprobC2, RFCprobC3, RFC_maxprob, RFC_maxprob_num, Dom_Pop_Percent
# Preview concentration data
conc_preview <- data.table::fread(example_files$conc, nrows = 5)
cat("Concentration data columns:", paste(names(conc_preview), collapse = ", "), "\n")
# Columns: id, weighted_as20062008, county_pop_served20062008, well, Dom_Pop_Percent
# Preview health/demographics data
health_preview <- data.table::fread(example_files$health, nrows = 5)
cat("Health data columns:", paste(names(health_preview), collapse = ", "), "\n")
# Columns: id, SBP, age_decadeStep 3: Define Analysis Parameters
Next, define all the parameters for the analysis, including the regression formula. Note that we use column names matching the example data files.
# Use a small number for the vignette; recommend >= 20 for real analysis
ndraws <- 5
targets <- c("SBP") # Systolic Blood Pressure (from Demographics_dom.txt)
output_dir <- "analysis_results"
exposure_level_col <- "ExposureLevel"
# Arsenic category labels (matching USGS 3-category model)
cat_labels <- c("<5", "5-10", "10+")
# Reference category for regression (lowest exposure)
drop_cat_label_ref <- c("<5")
# Define the regression formula.
# We use a geographic random INTERCEPT by county, (1 | id): arsenic exposure is
# assigned at the county level while SBP is measured on individuals nested
# within counties (the demographics file holds multiple participants per
# county). Each county gets its own baseline SBP, separating between-county from
# within-county variation. There is no random slope, so the arsenic effect is
# assumed constant across counties.
regression_formula <- paste0(
"~ as.factor(", exposure_level_col, ") + ",
"(1 | id)"
)A random intercept by age_decade,
(1 | age_decade), is also possible, but it has a different
meaning: it treats age as a contextual grouping variable (a
separate baseline per age band, no random slope) rather than modelling
individual age, and it estimates the between-group variance from only a
handful of levels. If you simply want to adjust for age, prefer a fixed
effect such as + as.factor(age_decade). See
examples/run_geoExposeR_model_variants.R for these
alternatives.
Step 4: Run the Analysis
With all parameters defined, call the main function
perform_sensitivity_analysis(). Note that we specify custom
column names to match the example data file structure.
# This chunk may take a moment to run
results <- perform_sensitivity_analysis(
birth_data = example_files$health,
model_prob_csv = example_files$prob,
conc_data_csv = example_files$conc,
ndraws = ndraws,
regression_formula = regression_formula,
output_dir = output_dir,
targets = targets,
cat_label = cat_labels,
drop_cat_label_ref = drop_cat_label_ref,
# Custom column names matching the example data files
id_col = "id", # ID column in prob/conc data
prob_cols = c("RFCprobC1", "RFCprobC2", "RFCprobC3"), # Probability columns
pop_well_col = "pub_water_pop", # Population column in prob data
record_id_col = "id", # ID column in health data
exposure_level_col = exposure_level_col,
# Use raw concentration column for automatic lognormal conversion
conc_raw_col = "weighted_as20062008", # Raw arsenic concentration
default_sdlog = 1.0, # Default sdlog value
pwell_col = "Dom_Pop_Percent", # Private well percentage
conc_cutoffs = c(5, 10),
# Imputation settings
seed = 12345,
mice_m = 1,
mice_maxit = 5,
mice_method = "pmm",
mice_covs = NULL,
impute_vars = NULL,
apply_imputation_fallback = TRUE,
rucc_col = NULL
)Step 5: Interpret the Results
The results object is a list where each element is a
mipo object from the mice package. To view a
publication-ready table of the pooled results, use
summary().
Let’s examine the results for Systolic Blood Pressure
(SBP).
sbp_summary <- results$SBP
knitr::kable(
sbp_summary,
caption = "Pooled Regression Results for Systolic Blood Pressure (SBP)"
)The summary table includes: * term: The regression model term. * q.mi: The pooled point estimate (e.g., the mean difference in SBP). * se.mi: The pooled standard error. * statistic: The t-statistic. * conf.low: The lower bound of the 95% confidence interval. * conf.high: The upper bound of the 95% confidence interval. * p.value: The p-value for the term.
From this table, you can assess the magnitude, direction, and statistical significance of the association between different levels of arsenic exposure and the health outcome, while accounting for uncertainty from both exposure and covariate imputation.
Step 6: Cleanup (Optional)
Remove the output directory created during the analysis if desired.
unlink(output_dir, recursive = TRUE)Additional Examples
The geoExposeR package includes several complete working
examples in the examples/ folder demonstrating both the
core arsenic analysis workflow and broader applications to other
environmental exposures.
1. Basic CSV Workflow (run_geoExposeR_example.R)
The standard example using pre-formatted CSV data:
# Run from package directory
source("examples/run_geoExposeR_example.R")This example demonstrates: - Loading data with custom column names - Preprocessing concentration data to lognormal parameters - Running the complete sensitivity analysis - Interpreting pooled regression results
2. GeoTIFF Workflow
(run_geoExposeR_from_geotiff.R)
A complete workflow starting from GeoTIFF raster data:
# Run from package directory
source("examples/run_geoExposeR_from_geotiff.R")This example demonstrates: - Creating/loading GeoTIFF probability
surfaces - Building geographic boundaries for extraction - Using
prepare_prob_data() to extract probabilities - Using
prepare_conc_data() to process concentration measurements -
Using format_health_data() to standardize health data -
Using validate_prepared_inputs() for quality control -
Running the complete analysis pipeline
3. Data Pipeline Demo
(run_geoExposeR_data_pipeline.R)
A comprehensive demonstration of all data ingestion functions:
# Run from package directory
source("examples/run_geoExposeR_data_pipeline.R")This example demonstrates: - prepare_conc_data() with
both raw and summary input types - format_health_data()
with different data formats - add_well_percentages() for
merging Census well data
Domestic Well Data Citation: > Johnson, T.D., and Belitz, K., 2019, Domestic well locations and populations served in the contiguous U.S.: datasets for decadal years 2000 and 2010: U.S. Geological Survey data release, https://doi.org/10.5066/P9FSLU3B.
-
subset_by_geography()for filtering by state or GEOID -
validate_prepared_inputs()for comprehensive validation
Broader Applications
While geoExposeR was originally developed for
groundwater arsenic exposure assessment, its framework is generalizable
to any environmental exposure where:
- You have probabilistic exposure estimates from models or spatial predictions
- You have measured concentration data from monitoring systems
- You want to assess health outcomes while accounting for exposure uncertainty
The package includes three additional examples demonstrating these broader applications:
4. Nitrate and Infant Methemoglobinemia
(run_geoExposeR_nitrate.R)
Demonstrates analyzing nitrate contamination in drinking water and its association with infant methemoglobinemia (blue baby syndrome):
# Run from package directory
source("examples/run_geoExposeR_nitrate.R")Key adaptations:
-
Exposure cutoffs:
c(5, 10)mg/L NO₃-N (EPA MCL = 10 mg/L) - Health outcome: Methemoglobin percentage in infants
-
Data sources:
-
prob_nitrate.csv: Modeled nitrate probability distributions -
conc_nitrate.csv: Public water system nitrate measurements -
Health_nitrate.txt: Infant health records with methemoglobin levels
-
# Example parameter configuration for nitrate
nitrate_cutoffs <- c(5, 10) # mg/L NO3-N
cat_labels <- c("NO3<5", "NO3_5-10", "NO3_10+")
targets <- c("methemoglobin_pct")
results <- perform_sensitivity_analysis(
birth_data_txt = "path/to/Health_nitrate.txt",
model_prob_csv = "path/to/prob_nitrate.csv",
conc_data_csv = "path/to/conc_nitrate.csv",
conc_cutoffs = nitrate_cutoffs,
cat_label = cat_labels,
targets = targets,
conc_raw_col = "weighted_nitrate",
default_sdlog = 0.8,
# ... other parameters
)5. Uranium and Kidney Function
(run_geoExposeR_uranium.R)
Demonstrates analyzing uranium exposure and its nephrotoxic effects on kidney function (eGFR):
# Run from package directory
source("examples/run_geoExposeR_uranium.R")Key adaptations:
-
Exposure cutoffs:
c(15, 30)µg/L (EPA MCL = 30 µg/L) - Health outcome: Estimated Glomerular Filtration Rate (eGFR)
- Expected relationship: Negative coefficients (higher uranium → lower eGFR)
# Example parameter configuration for uranium
uranium_cutoffs <- c(15, 30) # µg/L
cat_labels <- c("U<15", "U_15-30", "U_30+")
targets <- c("eGFR")
results <- perform_sensitivity_analysis(
birth_data_txt = "path/to/Health_uranium.txt",
model_prob_csv = "path/to/prob_uranium.csv",
conc_data_csv = "path/to/conc_uranium.csv",
conc_cutoffs = uranium_cutoffs,
cat_label = cat_labels,
targets = targets,
conc_raw_col = "weighted_uranium",
default_sdlog = 0.9,
# ... other parameters
)6. PM2.5 Air Quality and Respiratory Function
(run_geoExposeR_pm25.R)
Demonstrates extending the framework beyond groundwater to air quality exposures:
# Run from package directory
source("examples/run_geoExposeR_pm25.R")Key adaptations:
-
Exposure cutoffs:
c(9, 35)µg/m³ (EPA annual standard = 9 µg/m³; 24-hour standard = 35 µg/m³) - Health outcome: FEV1 % predicted (lung function)
-
Data sources reinterpreted:
- Probability data parameter → Satellite/hybrid model predictions
- Concentration data parameter → Ground monitoring station measurements
# Example parameter configuration for PM2.5
pm25_cutoffs <- c(9, 35) # µg/m³
cat_labels <- c("PM25<9", "PM25_9-35", "PM25_35+")
targets <- c("FEV1_pct_predicted")
results <- perform_sensitivity_analysis(
birth_data_txt = "path/to/Health_pm25.txt",
model_prob_csv = "path/to/prob_pm25.csv",
conc_data_csv = "path/to/conc_pm25.csv",
conc_cutoffs = pm25_cutoffs,
cat_label = cat_labels,
targets = targets,
conc_raw_col = "weighted_pm25",
default_sdlog = 0.5, # Lower variance for air quality
# ... other parameters
)Adapting to Other Exposures
The framework can be adapted to any exposure with the following data structure:
| geoExposeR Parameter | Original Use | Generalized Use |
|---|---|---|
model_prob_csv |
Modeled arsenic probabilities | Any probabilistic exposure model |
conc_data_csv |
Measured drinking water concentrations | Any measured concentration data |
conc_cutoffs |
Arsenic thresholds (5, 10 µg/L) | Any health-relevant thresholds |
cat_label |
Arsenic category names | Custom exposure category labels |
targets |
Birth outcomes | Any health outcome variables |
Potential applications include:
- Other groundwater contaminants: Manganese, fluoride, radium, PFAS
- Air quality: Ozone, NO₂, SO₂
- Soil contamination: Lead, heavy metals
- Noise exposure: Traffic noise and cardiovascular outcomes
- Heat exposure: Temperature and mortality/morbidity
References
Foundational Studies
Bulka, C.M., Bryan, M.S., Lombard, M.A., Bartell, S.M., Jones, D.K., Bradley, P.M., Vieira, V.M., Silverman, D.T., Focazio, M., Toccalino, P.L., Daniel, J., Backer, L.C., Ayotte, J.D., Gribble, M.O., and Argos, M., 2022, Arsenic in private well water and birth outcomes in the United States: Environment International, v. 163, article 107176, https://doi.org/10.1016/j.envint.2022.107176.
Lombard, M.A., Bryan, M.S., Jones, D.K., Bulka, C., Bradley, P.M., Backer, L.C., Focazio, M.J., Silverman, D.T., Toccalino, P., Argos, M., Gribble, M.O., and Ayotte, J.D., 2021, Machine learning models of arsenic in private wells throughout the conterminous United States as a tool for exposure assessment in human health studies: Environmental Science & Technology, v. 55, no. 8, p. 5012–5023, https://doi.org/10.1021/acs.est.0c05239.
Nigra, A. E., Bloomquist, T. R., Rajeev, T., Burjak, M., Casey, J. A., Goin, D. E., Herbstman, J. B., Ornelas Van Horne, Y., Wylie, B. J., Cerna-Turoff, I., Braun, J. M., McArthur, K. L., Karagas, M. R., Ames, J. L., Sherris, A. R., Bulka, C. M., Padula, A. M., Howe, C. G., Fry, R. C., … Consortium, E. C., 2025, Public Water Arsenic and Birth Outcomes in the Environmental Influences on Child Health Outcomes Cohort, JAMA Network Open, v. 8, no. 6, p. e2514084–e2514084. https://doi.org/10.1001/jamanetworkopen.2025.14084
Angley, M., Zhang, Y., Nigra, A. E., Lombard, M. A., Gribble, M. O., Lu, L., Unverzagt, F. W., McClure, L. A., Judd, S. E., Cushman, M., Brockman, J., and Kahe, K., 2026, Drinking water arsenic, urinary arsenic biomarkers, and cognitive impairment in the REGARDS Study, Environmental Research, 123768, https://doi.org/10.1016/j.envres.2026.123768.
Statistical Methods
Rubin, D.B., 1987, Multiple imputation for nonresponse in surveys: New York, John Wiley & Sons, https://doi.org/10.1002/9780470316696.
van Buuren, S., and Groothuis-Oudshoorn, K., 2011, mice—Multivariate imputation by chained equations in R: Journal of Statistical Software, v. 45, no. 3, p. 1–67, https://doi.org/10.18637/jss.v045.i03.
Honaker, J., King, G., and Blackwell, M., 2011, Amelia II—A program for missing data: Journal of Statistical Software, v. 45, no. 7, p. 1–47, https://doi.org/10.18637/jss.v045.i07.
Bates, D., Mächler, M., Bolker, B., and Walker, S., 2015, Fitting linear mixed-effects models using lme4: Journal of Statistical Software, v. 67, no. 1, p. 1–48, https://doi.org/10.18637/jss.v067.i01.
Venables, W.N., and Ripley, B.D., 2002, Modern applied statistics
with S (4th ed.): New York, Springer, https://doi.org/10.1007/978-0-387-21706-2. [multinomial
logistic regression via the nnet package]
Elff, M., 2024, mclogit—Multinomial logit models, with or without random effects or overdispersion: R package, https://doi.org/10.32614/CRAN.package.mclogit.
Data Sources
Lombard, M.A., 2021, Data used to model and map arsenic concentration exceedances in private wells throughout the conterminous United States for human health studies: U.S. Geological Survey data release, https://doi.org/10.5066/P90RBJXS.
Johnson, T.D., and Belitz, K., 2019, Domestic well locations and populations served in the contiguous U.S.—Datasets for decadal years 2000 and 2010: U.S. Geological Survey data release, https://doi.org/10.5066/P9FSLU3B.
DeSimone, L.A., McMahon, P.B., and Rosen, M.R., 2014, The quality of our Nation’s waters—Water quality in principal aquifers of the United States, 1991–2010: U.S. Geological Survey Circular 1360, 151 p., https://pubs.usgs.gov/circ/1360/.
Ravalli, F., Yu, Y., Bostick, B. C., Chillrud, S. N., Schilling, K., Basu, A., Navas-Acien, A., Nigra, A. E., 2022, Sociodemographic inequalities in uranium and other metals in community water systems across the USA, 2006-11: a cross-sectional study: a cross-sectional study. The Lancet Planetary Health, 6(4), e320–e330, https://doi.org/10.1016/S2542-5196(22)00043-2
R Packages and Resources
R Core Team, 2024, R: A language and environment for statistical computing: Vienna, Austria, R Foundation for Statistical Computing, https://www.R-project.org/.
Wickham, H., and Bryan, J., 2023, R packages (2nd ed.): Sebastopol, California, O’Reilly Media, https://r-pkgs.org/.
Barrett, T., Dowle, M., Srinivasan, A., Gorecki, J., Chirico, M., Hocking, T., Schwendinger, B., and Krylov, I., 2025, data.table—Extension of data.frame: R package version 1.17.99, https://r-datatable.com.
Hijmans, R.J., 2024, terra—Spatial data analysis: R package version 1.7-78, https://CRAN.R-project.org/package=terra.
Pebesma, E., and Bivand, R., 2024, sf—Simple features for R: R package version 1.0-16, https://CRAN.R-project.org/package=sf.
Baston, D., 2024, exactextractr—Fast extraction from raster datasets using polygons: R package version 0.10.0, https://CRAN.R-project.org/package=exactextractr.