Generate experimental stochastic completion datasets

Run pigauto's full imputation pipeline and return M stochastic completions of the trait matrix instead of a single point estimate. The conformal-width and Brownian/MC-dropout draws returned here are experimental prediction-diagnostic draws. A 3,000-fit known-DGP campaign found that neither method passed any of the 12 downstream fixed-effect gate cells. Do not use these datasets for downstream inference or Rubin pooling. A separate analysis-aware backend, multi_impute_analysis(), has passed its package-level fixed-effect gate for a narrow set of analyses.

Usage

multi_impute(
  traits,
  tree,
  m = 100L,
  draws_method = c("conformal", "mc_dropout"),
  species_col = NULL,
  trait_types = NULL,
  multi_proportion_groups = NULL,
  log_transform = TRUE,
  missing_frac = 0.25,
  covariates = NULL,
  epochs = 2000L,
  verbose = TRUE,
  seed = 1L,
  ...
)

Arguments

traits

data.frame with species as rownames and trait columns. Same input format as impute(). Supported column types are numeric, integer, factor, ordered factor, and logical.

tree

object of class phylo aligned with traits.

m

integer. Number of stochastic completion datasets to generate (default 100). Observed cells are identical across all M datasets; only originally-missing cells vary.

draws_method

character. How stochastic draws are generated for missing cells. One of:

"conformal": (default) Run the model once, then sample each originally-missing cell from a Normal distribution centred on the point estimate with SD = conformal_score / 1.96. Converting a split-conformal residual quantile to a Normal scale is a heuristic; the conformal coverage guarantee does not establish that these draws are proper multiple imputations. Falls back to BM-SE-based Normal sampling when conformal scores are unavailable, and to Bernoulli / Categorical draws for discrete traits.
"mc_dropout": Run M stochastic GNN forward passes in training mode (dropout active) on top of stochastic Brownian-motion baseline draws. Brownian draws still contribute between-draw variation when a calibrated GNN gate is zero.

species_col

character or NULL. If set, marks the column in traits containing species identifiers and enables multiple observations per species. See impute() for details.

trait_types

named character vector overriding auto-detected trait types for specific columns. Required for "proportion" and "zi_count". See impute() and preprocess_traits(). Default NULL (auto-detect).

multi_proportion_groups

named list declaring compositional trait groups (rows summing to 1), e.g. list(diet = c("plant", "invert", "vert")). Forwarded to impute() / preprocess_traits(). Default NULL.

log_transform

logical. Auto-log positive continuous columns (default TRUE).

missing_frac

numeric. Fraction of observed cells held out for validation/test during training (default 0.25). Passed through to impute().

covariates

data.frame or matrix of environmental covariates (fully observed, numeric). Passed through to impute(). Default NULL (no covariates).

epochs

integer. Maximum GNN training epochs (default 2000).

verbose

logical. Print progress (default TRUE).

seed

integer. Random seed (default 1).

...

additional arguments forwarded to fit_pigauto() via impute(). See fit_pigauto() for the full list; the "Safety floor" section below describes the relevant new v0.9.1.9002 argument.

Value

An object of class "pigauto_mi" with components:

datasets: A list of length m. Each element is a data.frame with the same shape and column types as the input traits; observed cells are preserved and missing cells are filled with the corresponding stochastic draw. These datasets are for prediction diagnostics, not downstream inference.
m: Number of stochastic completion datasets.
pooled_point: A single data.frame whose missing cells are replaced by the MC-averaged point estimate. Convenient for reporting but does not provide a valid downstream MI analysis.
se: Matrix of per-cell uncertainty summaries combining the baseline SE and the between-draw standard deviation.
imputed_mask: Logical matrix; TRUE where a cell was originally missing.
fit: The underlying pigauto_fit object, retained for diagnostics and for calls to predict() on new data.
data: The pigauto_data object.
tree: The input phylogeny.
species_col: Passed-through species-column name or NULL.

Details

These draws do not condition on a declared substantive analysis model. Consequently, stochastic variation alone does not make them proper or congenial multiple imputations. The analysis-aware backend requires the analysis model before generating draws and dispatches only across its documented supported model classes.

draws_method = "conformal" (default): Run the model once; missing cells are sampled from \(x_{ij}^{(k)} \sim \mathrm{N}(\hat\mu_{ij},\; q_{j}/1.96)\) where \(q_j\) is the trait-level split-conformal residual quantile. Dividing this quantile by 1.96 is a pragmatic Normal-scale construction, not a consequence of the conformal coverage guarantee. For discrete traits (binary, categorical) it uses Bernoulli / categorical draws from the estimated probability vector. For multi_proportion groups it draws the K CLR latent columns with their BM latent SEs, projects back to sum-zero CLR space, and decodes to the simplex.

draws_method = "mc_dropout": Run M GNN forward passes in training mode (dropout active) on top of stochastic BM baseline draws. When r_cal = 0, the GNN-dropout term disappears but the BM draw still contributes between-draw variance.

Nakagawa & Freckleton (2008, 2011) review the consequences of ignoring missing data in ecological and comparative analyses and argue for multiple imputation as the default.

When to use this

This function is useful for comparing stochastic prediction behavior from one tree. It is not the analysis-aware inferential backend.

multi_impute_trees() provides an experimental posterior-tree sensitivity path, but tree uncertainty is not supported by multi_impute_analysis().

Safety floor (v0.9.1.9002+)

When fit_pigauto() was called with safety_floor = TRUE (the default since v0.9.1.9002), the 3-way blend r_BM * BM + r_GNN * GNN + r_MEAN * MEAN propagates through every imputation draw automatically via the updated predict.pigauto_fit(). For draws_method = "mc_dropout" the mean term contributes no between-draw variance (it is a deterministic scalar per column); between-draw variance comes from the BM-draw and GNN-dropout terms only. For draws_method = "conformal" the blend centre is the 3-way prediction and conformal scores remain calibrated on the blended residuals.

References

Rubin DB (1987). Multiple Imputation for Nonresponse in Surveys. Wiley.

Nakagawa S, Freckleton RP (2008). "Missing inaction: the dangers of ignoring missing data." Trends in Ecology & Evolution 23(11): 592-596.

Nakagawa S, Freckleton RP (2011). "Model averaging, missing data and multiple imputation: a case study for behavioural ecology." Behavioral Ecology and Sociobiology 65(1): 103-116.

Examples

if (FALSE) { # \dontrun{
library(pigauto)
data(avonet300, tree300)
df <- avonet300; rownames(df) <- df$Species_Key; df$Species_Key <- NULL

# Generate 100 complete datasets
mi <- multi_impute(df, tree300, m = 100)
print(mi)

# Inspect stochastic prediction sensitivity only.
lapply(mi$datasets, head)
} # }