Skip to contents

Do I need this article?

Short answer: only if you have a posterior sample of trees (from BEAST, MrBayes, BirdTree.org, etc.) and you want the tree-topology uncertainty to show up in your pooled standard errors and p-values.

Situation Use this article?
One tree (published phylogeny, time-calibrated tree) No — use multi_impute() and the mixed-types vignette.
Posterior sample (2 or more trees) Yes.

The two-step workflow

Tree uncertainty enters the analysis in two places. pigauto handles step 1. Step 2 is your responsibility because the downstream model is your choice.

+--------------------------------------+
| Step 1 -- imputation                 |
|                                      |
| multi_impute_trees(traits, trees)    |
|   -> T x m_per_tree completed        |
|      data.frames, each tagged with   |
|      the tree that produced it       |
+------------------+-------------------+
                   |
                   v
+--------------------------------------+
| Step 2 -- analysis + pool            |
|                                      |
| for dataset i:                       |
|   fit model with trees[[t_i]]        |
| pool_mi(fits)                        |
|                                      |
| The SAME tree that produced          |
| dataset i is used to fit model i.    |
+--------------------------------------+

The canonical workflow (Nakagawa & de Villemereuil 2019)

With share_gnn = TRUE (the default), T = 50 posterior trees is cheap. Use one imputation per tree (M = 50 total), fit the downstream model 50 times (each with the matching tree), and pool with Rubin’s rules.

library(pigauto)
data(avonet300, trees300)
df <- avonet300
rownames(df) <- df$Species_Key
df$Species_Key <- NULL

mi <- multi_impute_trees(df, trees = trees300, m_per_tree = 1L)
# share_gnn = TRUE, reference_tree = MCC via phangorn -- all default

fits <- with_imputations(mi, function(dat, tree) {
  dat$species <- rownames(dat)
  nlme::gls(
    log(Mass) ~ log(Wing.Length),
    correlation = ape::corBrownian(phy = tree, form = ~species),
    data = dat, method = "ML"
  )
})
pool_mi(fits)   # pooled SEs include both imputation and tree uncertainty

The code above is illustrative — full execution takes ~25 min because it fits pigauto on the MCC reference tree, then runs a GLS model for each of the 50 posterior trees. Running the chunk is left to the reader.

Why share_gnn = TRUE preserves tree signal

The calibrated gate r_cal controls how much of each prediction comes from the baseline vs the GNN. In high-phylogenetic-signal regimes the gate often closes or nearly closes, so pred = baseline(tree_t) and the per-tree baseline carries the tree-uncertainty signal. When the gate is partly open, the GNN component is shared across trees and the per-tree baseline still varies with tree_t. See ?multi_impute_trees under “Share-GNN (tree-sharing) mode” for the fully-open and partially-open cases.

If you need exact per-tree model independence (e.g. for methodological comparison), set share_gnn = FALSE:

mi_slow <- multi_impute_trees(df, trees300, m_per_tree = 1L,
                               share_gnn = FALSE)
# fits T = length(trees300) full pigauto models -- ~10-15x slower.

Scale choices

T m_per_tree M When
50 1 50 Default. Canonical N&dV 2019.
20 2 40 Smaller posterior, still stable.
10 5 50 Very small posterior; per-tree variance helps.
<10 bump m_per_tree >=25 Runtime warning fires; Rubin’s rules unstable below M=25.

References

  • Nakagawa S, de Villemereuil P (2019). A general method for simultaneously accounting for phylogenetic and species sampling uncertainty via Rubin’s rules in comparative analysis. Systematic Biology 68(4): 632-641.
  • Rubin DB (1987). Multiple Imputation for Nonresponse in Surveys. Wiley.
  • Jetz W et al. (2012). The global diversity of birds in space and time. Nature 491: 444-448.