Run pigauto's full imputation pipeline on each of T posterior
phylogenies, generating m_per_tree stochastic completions per tree
for a total of T * m_per_tree completed datasets. Each completed
dataset is conditional on a specific posterior tree
(recorded in mi$tree_index).
Usage
multi_impute_trees(
traits,
trees,
m_per_tree = 1L,
species_col = NULL,
trait_types = NULL,
multi_proportion_groups = NULL,
log_transform = TRUE,
missing_frac = 0.25,
covariates = NULL,
epochs = 2000L,
verbose = TRUE,
seed = 1L,
share_gnn = TRUE,
reference_tree = NULL,
...
)Arguments
- traits
data.frame. Same format as
multi_impute()andimpute().- trees
list of
phyloobjects (classmultiPhyloor plain list). Each tree must contain the species intraitsas tips. Posterior samples from BirdTree.org (Jetz et al. 2012) are ideal; the bundled trees300 dataset provides 50 posterior trees for avonet300.- m_per_tree
integer. Number of MC-dropout imputations per tree (default
1). Total datasets =length(trees) * m_per_tree. The canonical Nakagawa & de Villemereuil (2019) workflow uses T = 50 posterior trees × m_per_tree = 1 = M = 50 total datasets.- species_col
character or
NULL. Seeimpute().- trait_types
named character vector overriding auto-detected trait types. Required for
"proportion"and"zi_count". Seeimpute()/preprocess_traits(). DefaultNULL(auto-detect).- multi_proportion_groups
named list declaring compositional trait groups (rows summing to 1), forwarded to
impute()/preprocess_traits(). DefaultNULL.- log_transform
logical. Auto-log positive continuous columns (default
TRUE).- missing_frac
numeric. Fraction held out for validation/test during training (default
0.25).- covariates
data.frame or matrix of environmental covariates (fully observed, numeric). Passed through to
impute(). DefaultNULL(no covariates).- epochs
integer. Maximum GNN training epochs per tree (default
2000).- verbose
logical. Print progress (default
TRUE).- seed
integer. Base random seed; each tree uses
seed + t - 1so results are reproducible (default1).logical. If
TRUE(default), fit the GNN once on a reference tree and reuse it across all posterior trees, recomputing only the BM baseline per tree. Gives a ~10-15x speedup at n=10k. See the "Share-GNN" section below for tree-uncertainty propagation details. SetFALSEto fit from scratch on every tree (the pre-v0.9.1 behaviour) when you need exact tree-by-tree model independence.- reference_tree
optional
phyloused as the training tree whenshare_gnn = TRUE. DefaultNULLselects the maximum-clade-credibility tree viaphangorn::maxCladeCred(trees). Ifphangornis not installed, falls back totrees[[1]]with a warning.- ...
additional arguments forwarded to
fit_pigauto()viaimpute().
Value
An object of class "pigauto_mi_trees", inheriting from
"pigauto_mi", with components:
datasetsList of
T * m_per_treecompleted data.frames. Observed cells are preserved; missing cells are filled with imputation draws. Compatible withwith_imputations().mTotal number of datasets (
T * m_per_tree).n_treesNumber of posterior trees used.
m_per_treeImputations per tree.
tree_indexInteger vector of length
m; elementigives the tree index (1..T) for dataseti.pooled_pointSingle data.frame averaging across all
T * m_per_treedatasets. For reporting, not inference.seMatrix of per-cell pooled SEs (NA if not available).
imputed_maskLogical matrix;
TRUEwhere a cell was originally missing.share_gnnLogical;
TRUEif the shared-GNN path was used.fitSingle
pigauto_fittrained on the reference tree whenshare_gnn = TRUE;NULLotherwise.fitsList of
Tpigauto_fitobjects (one per tree) whenshare_gnn = FALSE;NULLwhenshare_gnn = TRUE.reference_treeThe reference
phyloused for GNN training whenshare_gnn = TRUE;NULLotherwise.treesThe input posterior trees.
species_colPassed-through species column name.
Details
This is step 1 of the two-step workflow for propagating tree
uncertainty. Step 2 — varying the downstream analysis tree in
lockstep with the imputation tree — is the user's responsibility
and is described in the Examples section and in
vignette("tree-uncertainty").
multi_impute_trees() handles the imputation half (step 1) cleanly:
every completed dataset carries a different tree's signal so that
between-tree variation propagates into the pooled standard errors.
Step 2 is where Nakagawa & de Villemereuil (2019) enters: for each
completed dataset, fit the downstream model (e.g. nlme::gls() with
a corBrownian on trees[[ mi$tree_index[i] ]]), then pool the
T × M fits with pool_mi().
For each tree the function runs the full pigauto pipeline
(preprocess -> baseline -> GNN -> predict) when share_gnn = FALSE.
With the default share_gnn = TRUE, the GNN is trained once and only
the baseline is recomputed per tree. Topologies and branch lengths vary
across trees, so the phylogenetic baseline covariance differs for each
tree.
Downstream usage is identical to multi_impute(): pass the result
to with_imputations() to fit a model on each dataset, then to
pool_mi() for Rubin's-rules pooling. The pooled standard errors
will be wider than those from a single tree because they incorporate
the extra between-tree variance.
Variance decomposition. The between-imputation variance from
Rubin's rules has two sources: (1) within-tree sampling variance
(MC-dropout noise), and (2) between-tree variance (phylogenetic
uncertainty at the imputation step). The fraction of missing
information (FMI) reported by pool_mi() reflects both. To decompose
them, compare FMI from multi_impute() (single tree) with FMI from
multi_impute_trees().
Computation time. With share_gnn = TRUE (default): one GNN fit
T cheap baseline passes. Rough budget on a modern CPU laptop:
| Species n | 1 fit | T = 50 share_gnn=TRUE | T = 50 share_gnn=FALSE |
| 300 | ~30-60 s | ~3-5 min | 25-50 min |
| 5,000 | ~5-10 min | ~10-20 min | 4-8 hr |
| 10,000 | ~20-40 min | ~30-60 min | 17-33 hr |
When to use this
pigauto provides two multiple-imputation functions. Pick based on how many trees you have:
One tree (single published phylogeny, single time-calibrated tree): use
multi_impute(). ThemMC-dropout imputations capture model uncertainty.Multiple posterior trees (BirdTree samples, BEAST posterior, etc.): use
multi_impute_trees(). Between-tree variation is added to the pooled SEs via Rubin's rules (Nakagawa & de Villemereuil 2019).
The two functions share the same downstream API — both return objects
compatible with with_imputations() and pool_mi().
Share-GNN (tree-sharing) mode
Under share_gnn = TRUE the GNN weights and spectral features are
trained once on the reference tree (MCC by default). For each
posterior tree the BM / joint-MVN baseline is recomputed, and the
prediction is the blend (1 - r_cal) * baseline_t + r_cal * gnn_shared.
Because r_cal is calibrated once on held-out data at the reference
tree and applied uniformly, the tree-uncertainty contribution is:
Fully preserved when the gate is closed (r_cal near 0): the GNN contributes nothing, and the baseline varies per tree.
Partially preserved when the gate is open: the baseline portion still varies, but the GNN portion is a tree-invariant constant — this slightly under-estimates tree variance in the GNN channel.
Lost in the GNN channel when the gate is fully open (rare on real data; the baseline channel still carries tree variation).
On every real dataset benchmarked in the v0.9.0 campaign the gate
closed partially or fully, so share_gnn = TRUE is cheap AND honest.
Set share_gnn = FALSE if you need exact per-tree model independence.
Safety floor + share_gnn (v0.9.1.9002+)
When share_gnn = TRUE with safety_floor = TRUE, the
grand-mean baseline mean_baseline_per_col and the three
calibrated weights (r_cal_bm, r_cal_gnn,
r_cal_mean) are computed ONCE on the reference tree and
reused across all posterior trees. They are properties of the
observed training traits, not of the tree topology. Each posterior
tree only recomputes the BM baseline; the GNN delta and the three
weights stay fixed. This preserves the Nakagawa & de Villemereuil
(2019) tree-uncertainty integration story without re-calibrating
the safety floor per tree, and keeps the shared-GNN speedup intact.
References
Nakagawa S, de Villemereuil P (2019). "A general method for simultaneously accounting for phylogenetic and species sampling uncertainty via Rubin's rules in comparative analysis." Systematic Biology 68(4): 632-641.
Jetz W, Thomas GH, Joy JB, Hartmann K, Mooers AO (2012). "The global diversity of birds in space and time." Nature 491(7424): 444-448.
See also
multi_impute() for single-tree MI, with_imputations(),
pool_mi(), trees300
Examples
if (FALSE) { # \dontrun{
library(pigauto)
data(avonet300, trees300)
df <- avonet300; rownames(df) <- df$Species_Key; df$Species_Key <- NULL
# ---- Step 1: tree-aware imputation (canonical N&dV 2019 workflow) --
# 50 trees x 1 imputation = 50 completed datasets (fast with share_gnn=TRUE)
mi <- multi_impute_trees(df, trees300, m_per_tree = 1L)
print(mi)
# ---- Step 2: tree-aware analysis (Nakagawa & de Villemereuil 2019)
# For each completed dataset, fit the downstream model using the SAME
# tree that produced that dataset. `mi$tree_index[i]` gives the tree
# index (1..T) for dataset `i`.
fits <- with_imputations(mi, function(dat, tree) {
dat$species <- rownames(dat)
nlme::gls(
log(Mass) ~ log(Wing.Length),
correlation = ape::corBrownian(
phy = tree, form = ~species),
data = dat, method = "ML"
)
})
# Rubin's rules: pooled SEs include both trait-imputation and
# phylogenetic-tree uncertainty.
pool_mi(fits)
} # }
