Pool downstream model fits across multiple imputations (Rubin's rules)

Combine regression coefficients from M model fits – one per imputed dataset – into a single pooled table using Rubin's rules. The pooled standard errors properly account for both within-imputation sampling variance and between-imputation variance, so downstream inference propagates the uncertainty introduced by imputation.

Usage

pool_mi(
  fits,
  conf.level = 0.95,
  coef_fun = stats::coef,
  vcov_fun = stats::vcov,
  df_fun = NULL
)

Arguments

fits: A list of model fits of length M >= 2. Any model class implementing coef() and vcov() works (e.g. stats::lm, nlme::gls, lme4::lmer, glmmTMB::glmmTMB, phylolm::phylolm, phylolm::phyloglm). The output of with_imputations() is accepted directly. MCMCglmm fits are rejected – see Details.
conf.level: Confidence level for the pooled interval (default 0.95).
coef_fun: Function extracting a named numeric coefficient vector from one fit. Defaults to stats::coef(). Supply a custom extractor for models where coef() returns a list (e.g. glmmTMB with multiple components) – see Examples.
vcov_fun: Function extracting the variance-covariance matrix of the fixed-effect coefficients. Defaults to stats::vcov(). Must return a square matrix whose row/column names match coef_fun(fit).
df_fun: Optional function returning the complete-data residual degrees of freedom nu_com from one fit. When supplied, pooled degrees of freedom use the Barnard & Rubin (1999) small-sample correction, which is less biased for short series. When NULL (the default) the classical Rubin (1987) formula is used.

Value

A data.frame with one row per coefficient and columns:

term: Coefficient name.
estimate: Pooled point estimate (mean across fits).
std.error: Pooled standard error sqrt(T) where T = W + (1 + 1/M) * B.
df: Pooled degrees of freedom (Barnard-Rubin if df_fun supplied, else classical Rubin).
statistic: estimate / std.error.
p.value: Two-sided p-value from a t distribution on df.
conf.low, conf.high: Pooled conf.level interval.
fmi: Fraction of missing information.
riv: Relative increase in variance due to non-response.

Details

Let $\hat\theta_i$ be the coefficient vector from fit i and $U_i = \mathrm{vcov}(\mathrm{fit}_i)$, for $i = 1, \ldots, M$. Rubin's rules (Rubin 1987) give $$\bar\theta = M^{-1} \sum_i \hat\theta_i$$ $$W = M^{-1} \sum_i \mathrm{diag}(U_i)$$ $$B = (M-1)^{-1} \sum_i (\hat\theta_i - \bar\theta)^2$$ $$T = W + (1 + 1/M) B$$ with pooled standard error $\sqrt{T}$. The relative increase in variance is $r = (1 + 1/M) B / W$, the classical pooled df is $\nu_{\text{old}} = (M-1)(1 + 1/r)^2$, and the fraction of missing information is $$\mathrm{fmi} = (r + 2/(\nu + 3)) / (r + 1).$$ When df_fun returns finite complete-data df nu_com, the Barnard-Rubin (1999) correction combines $\nu_{\text{obs}} = ((\nu_{\text{com}}+1)/(\nu_{\text{com}}+3)) \nu_{\text{com}} (1 - \lambda)$ with nu_old via $\nu_{\text{BR}} = 1/(1/\nu_{\text{old}} + 1/\nu_{\text{obs}})$.

MCMCglmm fits are rejected because Rubin's rules are not the right tool for posterior samples: variance decomposition does not generalise cleanly to posterior distributions. For a Bayesian pigauto workflow (pigauto as imputer, MCMCglmm as inference engine), concatenate the posterior samples across imputations manually – stack fit$Sol and fit$VCV row-wise with do.call(rbind, ...) and wrap the result in coda::as.mcmc(). For the frequentist Rubin's-rules path, see vignette("mixed-types", package = "pigauto"). For an integrated chained-equation MCMC workflow where imputation and inference happen in a single engine, use the companion BACE package end-to-end (BACE::bace() + BACE::pool_posteriors()).

References

Rubin DB (1987). Multiple Imputation for Nonresponse in Surveys. Wiley.

Barnard J, Rubin DB (1999). "Small-sample degrees of freedom with multiple imputation." Biometrika 86(4): 948-955.

Nakagawa S, Freckleton RP (2008). "Missing inaction: the dangers of ignoring missing data." Trends in Ecology & Evolution 23(11): 592-596.

Nakagawa S, Freckleton RP (2011). "Model averaging, missing data and multiple imputation: a case study for behavioural ecology." Behavioral Ecology and Sociobiology 65(1): 103-116.

Examples

if (FALSE) { # \dontrun{
# Typical workflow
mi   <- multi_impute(traits, tree, m = 100)
fits <- with_imputations(mi, function(d) lm(y ~ x1 + x2, data = d))
pool_mi(fits)

# glmmTMB with custom extractors (conditional component only)
pool_mi(
  fits,
  coef_fun = function(f) glmmTMB::fixef(f)$cond,
  vcov_fun = function(f) vcov(f)$cond
)
} # }