Posterior-Tree Prediction Sensitivity

Scope

This article is for descriptive prediction sensitivity only. Tree uncertainty was not part of the analysis-aware MI validation campaign. multi_impute_trees() does not produce datasets supported for downstream inference, and it cannot currently be combined with multi_impute_analysis().

Use this article only when you have a posterior sample of trees (from BEAST, MrBayes, BirdTree.org, etc.) and want to see how point imputations change across that sample.

Prediction-sensitivity workflow

With share_gnn = TRUE (the default), the GNN is trained once on a reference tree and the baseline is recomputed for every posterior tree. The returned datasets can be compared descriptively; do not pass them to pool_mi().

library(pigauto)
data(avonet300, trees300)
df <- avonet300
rownames(df) <- df$Species_Key
df$Species_Key <- NULL

mi <- multi_impute_trees(df, trees = trees300, m_per_tree = 1L)
# share_gnn = TRUE, reference_tree = MCC via phangorn -- all default

mass_by_tree <- vapply(mi$datasets, function(dat) dat$Mass, numeric(nrow(df)))
apply(mass_by_tree, 1L, stats::sd)  # descriptive sensitivity, not an MI SE

The code above is illustrative and left unevaluated because the tree loop is computationally expensive.

Why `share_gnn = TRUE` preserves tree signal

The calibrated gate r_cal controls how much of each prediction comes from the baseline vs the GNN. In high-phylogenetic-signal regimes the gate often closes or nearly closes, so pred = baseline(tree_t) and the per-tree baseline carries the tree-uncertainty signal. When the gate is partly open, the GNN component is shared across trees and the per-tree baseline still varies with tree_t. See ?multi_impute_trees under “Share-GNN (tree-sharing) mode” for the fully-open and partially-open cases.

If you need exact per-tree model independence (e.g. for methodological comparison), set share_gnn = FALSE:

mi_slow <- multi_impute_trees(df, trees300, m_per_tree = 1L,
                               share_gnn = FALSE)
# fits T = length(trees300) full pigauto models -- ~10-15x slower.

What this does not establish

Variation across these datasets is not a calibrated standard error, confidence interval, or fraction of missing information. It does not validate Rubin pooling, variance components, correlations, BLUPs, conditional modes, or latent loadings. A future tree-aware analysis backend requires a separate known-DGP campaign.

References

Jetz W et al. (2012). The global diversity of birds in space and time. Nature 491: 444-448.

Scope

Prediction-sensitivity workflow

Why share_gnn = TRUE preserves tree signal

What this does not establish

References

Why `share_gnn = TRUE` preserves tree signal