Posterior-tree prediction sensitivity

Run pigauto's full imputation pipeline on each of T posterior phylogenies, generating m_per_tree stochastic completions per tree for a total of T * m_per_tree completed datasets. Each completed dataset is conditional on a specific posterior tree (recorded in mi$tree_index).

Usage

multi_impute_trees(
  traits,
  trees,
  m_per_tree = 1L,
  species_col = NULL,
  trait_types = NULL,
  multi_proportion_groups = NULL,
  log_transform = TRUE,
  missing_frac = 0.25,
  covariates = NULL,
  epochs = 2000L,
  verbose = TRUE,
  seed = 1L,
  share_gnn = TRUE,
  reference_tree = NULL,
  ...
)

Arguments

traits: data.frame. Same format as multi_impute() and impute().
trees: list of phylo objects (class multiPhylo or plain list). Each tree must contain the species in traits as tips. Posterior samples from BirdTree.org (Jetz et al. 2012) are ideal; the bundled trees300 dataset provides 50 posterior trees for avonet300.
m_per_tree: integer. Number of stochastic completion draws per tree (default 1). Total datasets = length(trees) * m_per_tree.
species_col: character or NULL. See impute().
trait_types: named character vector overriding auto-detected trait types. Required for "proportion" and "zi_count". See impute() / preprocess_traits(). Default NULL (auto-detect).
multi_proportion_groups: named list declaring compositional trait groups (rows summing to 1), forwarded to impute() / preprocess_traits(). Default NULL.
log_transform: logical. Auto-log positive continuous columns (default TRUE).
missing_frac: numeric. Fraction held out for validation/test during training (default 0.25).
covariates: data.frame or matrix of environmental covariates (fully observed, numeric). Passed through to impute(). Default NULL (no covariates).
epochs: integer. Maximum GNN training epochs per tree (default 2000).
verbose: logical. Print progress (default TRUE).
seed: integer. Base random seed; each tree uses seed + t - 1 so results are reproducible (default 1).
share_gnn: logical. If TRUE (default), fit the GNN once on a reference tree and reuse it across all posterior trees, recomputing only the BM baseline per tree. Gives a ~10-15x speedup at n=10k. See the "Share-GNN" section below for tree-uncertainty propagation details. Set FALSE to fit from scratch on every tree (the pre-v0.9.1 behaviour) when you need exact tree-by-tree model independence.
reference_tree: optional phylo used as the training tree when share_gnn = TRUE. Default NULL selects the maximum-clade-credibility tree via phangorn::maxCladeCred(trees). If phangorn is not installed, falls back to trees[[1]] with a warning.
...: additional arguments forwarded to fit_pigauto() via impute().

Value

An object of class "pigauto_mi_trees", inheriting from "pigauto_mi", with components:

datasets: List of T * m_per_tree completed data.frames. Observed cells are preserved; missing cells are filled with stochastic draws for prediction-sensitivity diagnostics.
m: Total number of datasets (T * m_per_tree).
n_trees: Number of posterior trees used.
m_per_tree: Imputations per tree.
tree_index: Integer vector of length m; element i gives the tree index (1..T) for dataset i.
pooled_point: Single data.frame averaging across all T * m_per_tree datasets. For reporting, not inference.
se: Matrix of per-cell pooled SEs (NA if not available).
imputed_mask: Logical matrix; TRUE where a cell was originally missing.
share_gnn: Logical; TRUE if the shared-GNN path was used.
fit: Single pigauto_fit trained on the reference tree when share_gnn = TRUE; NULL otherwise.
fits: List of T pigauto_fit objects (one per tree) when share_gnn = FALSE; NULL when share_gnn = TRUE.
reference_tree: The reference phylo used for GNN training when share_gnn = TRUE; NULL otherwise.
trees: The input posterior trees.
species_col: Passed-through species column name.

Details

This is an experimental prediction-sensitivity workflow. It compares stochastic completions across a posterior tree sample; it does not provide a validated downstream inferential workflow. See vignette("tree-uncertainty").

Every completed dataset carries a different tree's signal so that between-tree variation enters the returned stochastic completions. This path is not supported by multi_impute_analysis() and has not passed the analysis-aware inferential gate. Do not use it for downstream inference.

For each tree the function runs the full pigauto pipeline (preprocess -> baseline -> GNN -> predict) when share_gnn = FALSE. With the default share_gnn = TRUE, the GNN is trained once and only the baseline is recomputed per tree. Topologies and branch lengths vary across trees, so the phylogenetic baseline covariance differs for each tree.

The returned datasets may be compared descriptively to assess sensitivity of point imputations to the tree sample. No calibrated downstream standard error or Rubin-pooling claim is made for this path.

Computation time. With share_gnn = TRUE (default): one GNN fit

T cheap baseline passes. Rough budget on a modern CPU laptop:

Species n	1 fit	T = 50 share_gnn=TRUE	T = 50 share_gnn=FALSE
300	~30-60 s	~3-5 min	25-50 min
5,000	~5-10 min	~10-20 min	4-8 hr
10,000	~20-40 min	~30-60 min	17-33 hr

When to use this

Choose the prediction-diagnostic function based on how many trees you have:

One tree (single published phylogeny, single time-calibrated tree): use multi_impute() and select a draw method explicitly when needed.
Multiple posterior trees (BirdTree samples, BEAST posterior, etc.): use multi_impute_trees() only for prediction sensitivity. Tree uncertainty is unsupported by the analysis-aware backend.

Under share_gnn = TRUE the GNN weights and spectral features are trained once on the reference tree (MCC by default). For each posterior tree the BM / joint-MVN baseline is recomputed, and the prediction is the blend (1 - r_cal) * baseline_t + r_cal * gnn_shared. Because r_cal is calibrated once on held-out data at the reference tree and applied uniformly, the tree-uncertainty contribution is:

Fully preserved when the gate is closed (r_cal near 0): the GNN contributes nothing, and the baseline varies per tree.
Partially preserved when the gate is open: the baseline portion still varies, but the GNN portion is a tree-invariant constant — this slightly under-estimates tree variance in the GNN channel.
Lost in the GNN channel when the gate is fully open (rare on real data; the baseline channel still carries tree variation).

On every real dataset benchmarked in the v0.9.0 campaign the gate closed partially or fully. This evidence is specific to those benchmark regimes and does not guarantee tree-variance calibration elsewhere. Set share_gnn = FALSE if you need exact per-tree model independence.

When share_gnn = TRUE with safety_floor = TRUE, the grand-mean baseline mean_baseline_per_col and the three calibrated weights (r_cal_bm, r_cal_gnn, r_cal_mean) are computed ONCE on the reference tree and reused across all posterior trees. They are properties of the observed training traits, not of the tree topology. Each posterior tree only recomputes the BM baseline; the GNN delta and the three weights stay fixed. This preserves prediction sensitivity to the per-tree baseline without re-calibrating the safety floor; it does not establish downstream inferential validity.

References

Nakagawa S, de Villemereuil P (2019). "A general method for simultaneously accounting for phylogenetic and species sampling uncertainty via Rubin's rules in comparative analysis." Systematic Biology 68(4): 632-641.

Jetz W, Thomas GH, Joy JB, Hartmann K, Mooers AO (2012). "The global diversity of birds in space and time." Nature 491(7424): 444-448.

Examples

if (FALSE) { # \dontrun{
library(pigauto)
data(avonet300, trees300)
df <- avonet300; rownames(df) <- df$Species_Key; df$Species_Key <- NULL

# Posterior-tree prediction sensitivity
mi <- multi_impute_trees(df, trees300, m_per_tree = 1L)
print(mi)

mass_by_tree <- vapply(mi$datasets, function(dat) dat$Mass,
                       numeric(nrow(df)))
apply(mass_by_tree, 1L, stats::sd) # descriptive, not an MI standard error
} # }