Run pigauto's full imputation pipeline and return M stochastic
completions of the trait matrix instead of a single point estimate.
The M datasets are the input needed for the classical multiple
imputation workflow: fit a downstream model on each dataset, then
pool the results with Rubin's rules via pool_mi(). This is the
standard way to propagate imputation uncertainty into phylogenetic
comparative analyses (PGLS, PGLMM, etc.) rather than treating
imputed cells as if they were observed.
Usage
multi_impute(
traits,
tree,
m = 100L,
draws_method = c("conformal", "mc_dropout"),
species_col = NULL,
trait_types = NULL,
multi_proportion_groups = NULL,
log_transform = TRUE,
missing_frac = 0.25,
covariates = NULL,
epochs = 2000L,
verbose = TRUE,
seed = 1L,
...
)Arguments
- traits
data.frame with species as rownames and trait columns. Same input format as
impute(). Supported column types are numeric, integer, factor, ordered factor, and logical.- tree
object of class
phyloaligned withtraits.- m
integer. Number of imputation datasets to generate (default
100). Observed cells are identical across allMdatasets; only originally-missing cells vary.- draws_method
character. How stochastic draws are generated for missing cells. One of:
"conformal"(default) Run the model once, then sample each originally-missing cell from a Normal distribution centred on the point estimate with SD = conformal_score / 1.96. The conformal score is the empirical 97.5th percentile of held-out absolute residuals, so the draw width is calibrated against actual prediction error — not a model assumption. Falls back to BM-SE-based Normal sampling when conformal scores are unavailable, and to Bernoulli / Categorical draws for discrete traits. Preferred default for pigauto because MC dropout gives zero variance whenever the calibrated gate is zero (i.e. whenever the BM baseline already fits well), which is common for continuous traits with strong phylogenetic signal.
"mc_dropout"Run
Mstochastic GNN forward passes in training mode (dropout active). Useful when the calibrated gate is open (r_cal > 0, i.e. GNN meaningfully corrects the BM baseline). When all gates are zero — as is typical for continuous traits on datasets with strong phylogenetic signal — MC dropout is deterministic and falls back silently to the BM-only point estimate for every draw. Checkmi$fit$calibrated_gatesbefore using this method.
- species_col
character or
NULL. If set, marks the column intraitscontaining species identifiers and enables multiple observations per species. Seeimpute()for details.- trait_types
named character vector overriding auto-detected trait types for specific columns. Required for
"proportion"and"zi_count". Seeimpute()andpreprocess_traits(). DefaultNULL(auto-detect).- multi_proportion_groups
named list declaring compositional trait groups (rows summing to 1), e.g.
list(diet = c("plant", "invert", "vert")). Forwarded toimpute()/preprocess_traits(). DefaultNULL.- log_transform
logical. Auto-log positive continuous columns (default
TRUE).- missing_frac
numeric. Fraction of observed cells held out for validation/test during training (default
0.25). Passed through toimpute().- covariates
data.frame or matrix of environmental covariates (fully observed, numeric). Passed through to
impute(). DefaultNULL(no covariates).- epochs
integer. Maximum GNN training epochs (default
2000).- verbose
logical. Print progress (default
TRUE).- seed
integer. Random seed (default
1).- ...
additional arguments forwarded to
fit_pigauto()viaimpute(). Seefit_pigauto()for the full list; the "Safety floor" section below describes the relevant new v0.9.1.9002 argument.
Value
An object of class "pigauto_mi" with components:
datasetsA list of length
m. Each element is a data.frame with the same shape and column types as the inputtraits; observed cells are preserved and missing cells are filled with the corresponding imputation draw. Pass this list towith_imputations()to fit downstream models.mNumber of imputations.
pooled_pointA single data.frame whose missing cells are replaced by the MC-averaged point estimate. Convenient for reporting but does not propagate imputation uncertainty – use
datasets+pool_mi()for inference.seMatrix of per-cell standard errors combining the baseline SE and the between-imputation standard deviation.
imputed_maskLogical matrix;
TRUEwhere a cell was originally missing.fitThe underlying
pigauto_fitobject, retained for diagnostics and for calls topredict()on new data.dataThe
pigauto_dataobject.treeThe input phylogeny.
species_colPassed-through species-column name or
NULL.
Details
Multiple imputation is a method for doing downstream analysis
under missing data, not an end in itself. Plugging a single
point-estimate imputation into a regression underestimates standard
errors because it treats imputed cells as if they were observed.
The standard remedy, due to Rubin (1987), is to generate M
stochastic completions, fit the downstream model on each, and pool
the results. multi_impute() + with_imputations() + pool_mi()
implement this workflow end to end.
draws_method = "conformal" (default): Run the model once; missing
cells are sampled from
\(x_{ij}^{(k)} \sim \mathrm{N}(\hat\mu_{ij},\; q_{j}/1.96)\)
where \(q_j\) is the trait-level conformal score (the empirical
97.5th percentile of held-out absolute residuals, in latent z-score
units back-transformed to the original scale). The draw width is
therefore calibrated against actual prediction error regardless of
whether the BM or GNN term dominates. For discrete traits (binary,
categorical) it uses Bernoulli / categorical draws from the estimated
probability vector. For multi_proportion groups it draws the
K CLR latent columns with their BM latent SEs, projects back to
sum-zero CLR space, and decodes to the simplex. This is the preferred
default for pigauto.
draws_method = "mc_dropout": Run M GNN forward passes in
training mode (dropout active). Caution: when the per-trait
calibrated gate r_cal = 0 (which happens whenever the BM baseline
already fits well, typically for continuous traits with strong
phylogenetic signal), every MC pass is identical to the BM point
estimate and draws have zero between-imputation variance. Check
mi$fit$calibrated_gates after fitting — if all gates for the traits
of interest are zero, use draws_method = "conformal" instead.
Nakagawa & Freckleton (2008, 2011) review the consequences of ignoring missing data in ecological and comparative analyses and argue for multiple imputation as the default.
When to use this
pigauto provides two multiple-imputation functions. Pick based on how many trees you have:
One tree (single published phylogeny, single time-calibrated tree): use
multi_impute(). ThemMC-dropout imputations capture model uncertainty.Multiple posterior trees (BirdTree samples, BEAST posterior, etc.): use
multi_impute_trees(). Between-tree variation is added to the pooled SEs via Rubin's rules (Nakagawa & de Villemereuil 2019).
The two functions share the same downstream API — both return objects
compatible with with_imputations() and pool_mi().
Safety floor (v0.9.1.9002+)
When fit_pigauto() was called with safety_floor = TRUE
(the default since v0.9.1.9002), the 3-way blend
r_BM * BM + r_GNN * GNN + r_MEAN * MEAN propagates through
every imputation draw automatically via the updated
predict.pigauto_fit(). For draws_method = "mc_dropout"
the mean term contributes no between-draw variance (it is a
deterministic scalar per column), so Rubin-pooled SE stays correctly
calibrated: variance comes from the BM-draw and GNN-dropout terms
only. For draws_method = "conformal" the blend centre is the
3-way prediction and conformal scores remain calibrated on the
blended residuals.
References
Rubin DB (1987). Multiple Imputation for Nonresponse in Surveys. Wiley.
Nakagawa S, Freckleton RP (2008). "Missing inaction: the dangers of ignoring missing data." Trends in Ecology & Evolution 23(11): 592-596.
Nakagawa S, Freckleton RP (2011). "Model averaging, missing data and multiple imputation: a case study for behavioural ecology." Behavioral Ecology and Sociobiology 65(1): 103-116.
See also
impute() for single-point imputation, with_imputations()
for applying a model-fitting function across the M datasets,
pool_mi() for Rubin's rules pooling of the resulting fits.
Examples
if (FALSE) { # \dontrun{
library(pigauto)
data(avonet300, tree300)
df <- avonet300; rownames(df) <- df$Species_Key; df$Species_Key <- NULL
# Generate 100 complete datasets
mi <- multi_impute(df, tree300, m = 100)
print(mi)
# Downstream analysis: phylogenetic GLS via nlme, pooled with Rubin's rules
fits <- with_imputations(mi, function(d) {
d$species <- rownames(d)
nlme::gls(
log(Mass) ~ log(Wing.Length),
correlation = ape::corBrownian(phy = tree300, form = ~species),
data = d, method = "ML"
)
})
pool_mi(fits)
} # }
