Skip to contents

Run pigauto's full imputation pipeline and return M stochastic completions of the trait matrix instead of a single point estimate. The M datasets are the input needed for the classical multiple imputation workflow: fit a downstream model on each dataset, then pool the results with Rubin's rules via pool_mi(). This is the standard way to propagate imputation uncertainty into phylogenetic comparative analyses (PGLS, PGLMM, etc.) rather than treating imputed cells as if they were observed.

Usage

multi_impute(
  traits,
  tree,
  m = 100L,
  draws_method = c("conformal", "mc_dropout"),
  species_col = NULL,
  trait_types = NULL,
  multi_proportion_groups = NULL,
  log_transform = TRUE,
  missing_frac = 0.25,
  covariates = NULL,
  epochs = 2000L,
  verbose = TRUE,
  seed = 1L,
  ...
)

Arguments

traits

data.frame with species as rownames and trait columns. Same input format as impute(). Supported column types are numeric, integer, factor, ordered factor, and logical.

tree

object of class phylo aligned with traits.

m

integer. Number of imputation datasets to generate (default 100). Observed cells are identical across all M datasets; only originally-missing cells vary.

draws_method

character. How stochastic draws are generated for missing cells. One of:

"conformal"

(default) Run the model once, then sample each originally-missing cell from a Normal distribution centred on the point estimate with SD = conformal_score / 1.96. The conformal score is the empirical 97.5th percentile of held-out absolute residuals, so the draw width is calibrated against actual prediction error — not a model assumption. Falls back to BM-SE-based Normal sampling when conformal scores are unavailable, and to Bernoulli / Categorical draws for discrete traits. Preferred default for pigauto because MC dropout gives zero variance whenever the calibrated gate is zero (i.e. whenever the BM baseline already fits well), which is common for continuous traits with strong phylogenetic signal.

"mc_dropout"

Run M stochastic GNN forward passes in training mode (dropout active). Useful when the calibrated gate is open (r_cal > 0, i.e. GNN meaningfully corrects the BM baseline). When all gates are zero — as is typical for continuous traits on datasets with strong phylogenetic signal — MC dropout is deterministic and falls back silently to the BM-only point estimate for every draw. Check mi$fit$calibrated_gates before using this method.

species_col

character or NULL. If set, marks the column in traits containing species identifiers and enables multiple observations per species. See impute() for details.

trait_types

named character vector overriding auto-detected trait types for specific columns. Required for "proportion" and "zi_count". See impute() and preprocess_traits(). Default NULL (auto-detect).

multi_proportion_groups

named list declaring compositional trait groups (rows summing to 1), e.g. list(diet = c("plant", "invert", "vert")). Forwarded to impute() / preprocess_traits(). Default NULL.

log_transform

logical. Auto-log positive continuous columns (default TRUE).

missing_frac

numeric. Fraction of observed cells held out for validation/test during training (default 0.25). Passed through to impute().

covariates

data.frame or matrix of environmental covariates (fully observed, numeric). Passed through to impute(). Default NULL (no covariates).

epochs

integer. Maximum GNN training epochs (default 2000).

verbose

logical. Print progress (default TRUE).

seed

integer. Random seed (default 1).

...

additional arguments forwarded to fit_pigauto() via impute(). See fit_pigauto() for the full list; the "Safety floor" section below describes the relevant new v0.9.1.9002 argument.

Value

An object of class "pigauto_mi" with components:

datasets

A list of length m. Each element is a data.frame with the same shape and column types as the input traits; observed cells are preserved and missing cells are filled with the corresponding imputation draw. Pass this list to with_imputations() to fit downstream models.

m

Number of imputations.

pooled_point

A single data.frame whose missing cells are replaced by the MC-averaged point estimate. Convenient for reporting but does not propagate imputation uncertainty – use datasets + pool_mi() for inference.

se

Matrix of per-cell standard errors combining the baseline SE and the between-imputation standard deviation.

imputed_mask

Logical matrix; TRUE where a cell was originally missing.

fit

The underlying pigauto_fit object, retained for diagnostics and for calls to predict() on new data.

data

The pigauto_data object.

tree

The input phylogeny.

species_col

Passed-through species-column name or NULL.

Details

Multiple imputation is a method for doing downstream analysis under missing data, not an end in itself. Plugging a single point-estimate imputation into a regression underestimates standard errors because it treats imputed cells as if they were observed. The standard remedy, due to Rubin (1987), is to generate M stochastic completions, fit the downstream model on each, and pool the results. multi_impute() + with_imputations() + pool_mi() implement this workflow end to end.

draws_method = "conformal" (default): Run the model once; missing cells are sampled from \(x_{ij}^{(k)} \sim \mathrm{N}(\hat\mu_{ij},\; q_{j}/1.96)\) where \(q_j\) is the trait-level conformal score (the empirical 97.5th percentile of held-out absolute residuals, in latent z-score units back-transformed to the original scale). The draw width is therefore calibrated against actual prediction error regardless of whether the BM or GNN term dominates. For discrete traits (binary, categorical) it uses Bernoulli / categorical draws from the estimated probability vector. For multi_proportion groups it draws the K CLR latent columns with their BM latent SEs, projects back to sum-zero CLR space, and decodes to the simplex. This is the preferred default for pigauto.

draws_method = "mc_dropout": Run M GNN forward passes in training mode (dropout active). Caution: when the per-trait calibrated gate r_cal = 0 (which happens whenever the BM baseline already fits well, typically for continuous traits with strong phylogenetic signal), every MC pass is identical to the BM point estimate and draws have zero between-imputation variance. Check mi$fit$calibrated_gates after fitting — if all gates for the traits of interest are zero, use draws_method = "conformal" instead.

Nakagawa & Freckleton (2008, 2011) review the consequences of ignoring missing data in ecological and comparative analyses and argue for multiple imputation as the default.

When to use this

pigauto provides two multiple-imputation functions. Pick based on how many trees you have:

  • One tree (single published phylogeny, single time-calibrated tree): use multi_impute(). The m MC-dropout imputations capture model uncertainty.

  • Multiple posterior trees (BirdTree samples, BEAST posterior, etc.): use multi_impute_trees(). Between-tree variation is added to the pooled SEs via Rubin's rules (Nakagawa & de Villemereuil 2019).

The two functions share the same downstream API — both return objects compatible with with_imputations() and pool_mi().

Safety floor (v0.9.1.9002+)

When fit_pigauto() was called with safety_floor = TRUE (the default since v0.9.1.9002), the 3-way blend r_BM * BM + r_GNN * GNN + r_MEAN * MEAN propagates through every imputation draw automatically via the updated predict.pigauto_fit(). For draws_method = "mc_dropout" the mean term contributes no between-draw variance (it is a deterministic scalar per column), so Rubin-pooled SE stays correctly calibrated: variance comes from the BM-draw and GNN-dropout terms only. For draws_method = "conformal" the blend centre is the 3-way prediction and conformal scores remain calibrated on the blended residuals.

References

Rubin DB (1987). Multiple Imputation for Nonresponse in Surveys. Wiley.

Nakagawa S, Freckleton RP (2008). "Missing inaction: the dangers of ignoring missing data." Trends in Ecology & Evolution 23(11): 592-596.

Nakagawa S, Freckleton RP (2011). "Model averaging, missing data and multiple imputation: a case study for behavioural ecology." Behavioral Ecology and Sociobiology 65(1): 103-116.

See also

impute() for single-point imputation, with_imputations() for applying a model-fitting function across the M datasets, pool_mi() for Rubin's rules pooling of the resulting fits.

Examples

if (FALSE) { # \dontrun{
library(pigauto)
data(avonet300, tree300)
df <- avonet300; rownames(df) <- df$Species_Key; df$Species_Key <- NULL

# Generate 100 complete datasets
mi <- multi_impute(df, tree300, m = 100)
print(mi)

# Downstream analysis: phylogenetic GLS via nlme, pooled with Rubin's rules
fits <- with_imputations(mi, function(d) {
  d$species <- rownames(d)
  nlme::gls(
    log(Mass) ~ log(Wing.Length),
    correlation = ape::corBrownian(phy = tree300, form = ~species),
    data = d, method = "ML"
  )
})
pool_mi(fits)
} # }