From species names to a phylogenetic posterior — prepR4pcm + pigauto • prepR4pcm

Phylogenetic comparative methods (PCMs) propagate uncertainty imperfectly. The standard pipeline takes one tree, fits one model, and reports one set of confidence intervals. But the tree itself is typically a point estimate from a posterior, the trait data typically have missing values, and each of those is its own source of uncertainty.

This vignette walks through a pipeline that handles both:

prepR4pcm reconciles species names, fetches a posterior of trees (not just one), and prepares the data for downstream modelling.
pigauto imputes the missing trait values, fits the model on each posterior tree, and pools the results via Rubin’s rules so the reported standard errors reflect both the tree and imputation uncertainty.

The two packages compose: prepR4pcm produces multiPhylo, pigauto consumes multiPhylo. Once the contract is set up, swapping the backend (rtrees vs datelife vs fishtree) doesn’t change the rest of the pipeline.

What we’ll build

A model-ready dataset for fish, with:

A posterior of 50 fish chronograms from the Fish Tree of Life (Rabosky et al. 2018).
5 imputations of the missing trait values per tree (so 5 × 50 = 250 model fits — adjust m_per_tree upward for a real analysis).
Pooled estimates with valid standard errors that account for both axes of uncertainty.

library(prepR4pcm)
# Suggests packages we'll lean on:
# install.packages("fishtree")
# pak::pak("phylotastic/datelife")
# Optional, for the pooling step at the end:
# pak::pak("itchyshin/pigauto")

Step 1 — Retrieve a posterior of trees

Start with whatever names appear in your dataset. They never quite match a tree’s tip labels — formatting drifts, synonyms creep in. pr_get_tree() copes: it normalises every name (underscores, whitespace, authority strings) and, for the fishtree backend, resolves synonyms through Open Tree of Life’s Taxonomic Name Resolution Service before querying the backend.

# Your trait data — typically loaded from disk. Names need not be
# tidy: `Esox_lucius` carries an underscore, which pr_get_tree()
# normalises away.
my_data <- data.frame(
  species   = c("Salmo salar", "Esox_lucius", "Oncorhynchus mykiss"),
  body_size = c(98, NA, 50)        # NA is fine; pigauto will impute it
)

Every backend that can return multiple trees does so via n_tree > 1. Here we ask fishtree for 50 stochastic resolutions (Chang, Rabosky, Alfaro 2019, Syst. Biol.).

trees <- pr_get_tree(my_data,
                     species_col = "species",
                     source      = "fishtree",
                     n_tree      = 50)

class(trees$tree)      # "multiPhylo"
length(trees$tree)     # 50
trees$matched          # the 3 species, in their original input form
trees$unmatched        # character(0) — every species was placed

# Per-tree provenance: one list entry per returned tree, each
# recording source_index, citation, calibration_method, and n_tips.
length(trees$backend_meta$tree_provenance)   # 50
trees$backend_meta$tree_provenance[[1]]      # the provenance of tree 1

pr_get_tree() records what happened to every name in trees$mapping (one row per input species), so a name that quietly fell out of the tree cannot go unnoticed.

For the universal-but-slower option, use source = "datelife" — the chronogram-database backend that returns one calibrated tree per source (Hedges, Bininda-Emonds, Upham, etc.).

trees <- pr_get_tree(my_data,
                     species_col = "species",
                     source      = "datelife",
                     n_tree      = 50)
# trees$tree is multiPhylo; per-source citations are in
# trees$backend_meta$tree_provenance[[i]]$citation

If you already have a topology and want to date it (rather than fetch a new one), use pr_date_tree():

my_topology <- ape::read.tree("my_topology.nex")
dated_trees <- pr_date_tree(my_topology, n_dated = 50)
class(dated_trees$tree)        # "multiPhylo"
length(dated_trees$tree)       # up to 50

What n_dated = 50 actually returns: 50 chronograms that all share my_topology but have different branch lengths. Each variant comes from a different source paper in DateLife’s database (e.g. variant 1 dated against Hedges et al. 2015, variant 2 against Bininda-Emonds et al. 2007, etc.). The topology is fixed; the dating varies.

To get 50 different topologies, fetch them separately first (e.g. pr_get_tree(species, source = "rtrees", taxon = "mammal") returns 100 mammal topologies sampled from VertLife / Upham et al. 2019), then pass that multiPhylo to pr_date_tree() to date each one. That gives you the topology × source cross-product.

Step 2 — Reconcile the data to the posterior

pr_get_tree() resolves names well enough to retrieve a tree, but the modelling step needs a data frame whose rows line up one-to-one with the tree’s tips. reconcile_tree() matches your data against the retrieved topology, and reconcile_apply() returns the aligned data–tree pair. The 50 posterior trees share one tip set, so reconciling against the first tree aligns the data to all of them.

rec <- reconcile_tree(my_data, trees$tree[[1]],
                      x_species = "species",
                      authority = NULL)   # tree tips are plain binomials
rec        # match summary: 1 exact, 2 normalized, 0 unresolved

aligned <- reconcile_apply(rec,
                           data            = my_data,
                           tree            = trees$tree[[1]],
                           species_col     = "species",
                           drop_unresolved = TRUE)

aligned$data    # one row per tree tip — the model-ready frame
#>               species body_size
#> 1         Salmo salar        98
#> 2         Esox_lucius        NA
#> 3 Oncorhynchus mykiss        50

aligned$data carries one row per species in the tree, with Esox_lucius keeping the NA that pigauto will impute; aligned$tree is the matching pruned tree.

Step 3 — Format citations (do this before you submit)

Reviewers ask. Save yourself the round-trip.

# Plain text for an email or a methods paragraph
pr_cite_tree(trees)

# Markdown for a GitHub issue or PR description
cat(pr_cite_tree(trees, format = "markdown"))

# BibTeX for a manuscript bibliography (sanity-check before submitting)
cat(pr_cite_tree(trees, format = "bibtex"))

Step 4 — Hand the posterior to pigauto

This is the contract. pigauto::multi_impute_trees(df, trees, m_per_tree) takes df (data with NAs) and trees (multiPhylo) and returns m_per_tree complete-data imputations per tree. The output knows which imputation came from which tree, and pigauto’s pooling step later uses that index.

library(pigauto)

# 5 imputations per tree × 50 trees = 250 complete datasets
mi <- multi_impute_trees(aligned$data,
                         trees      = trees$tree,
                         m_per_tree = 5)

# Fit your model to each completed dataset. Replace the formula
# with whatever your hypothesis is.
fits <- with_imputations(mi, function(df) {
  glmmTMB::glmmTMB(body_size ~ 1 + (1 | species),
                   data = df)
})

# Pool via Rubin's rules — the standard errors reflect BOTH the
# imputation uncertainty AND the tree-posterior uncertainty.
pooled <- pool_mi(fits,
                  coef_fun = function(fit) fixef(fit)$cond,
                  vcov_fun = function(fit) vcov(fit)$cond)

pooled

The reported intervals are wider than the equivalent single-tree-single-imputation pipeline — and correctly so. The narrower intervals were always overconfident.

Choosing a backend

Verified on a clean macOS R 4.4 install on 2026-05-01. See the Comparing tree backends vignette for the full status table.

If your taxon is…	And you want…	Use	Status
Universal	A single best-guess tree	`pr_get_tree(source = "rotl")`	✅ verified — returns the Open Tree of Life synthesis tree
Universal	A posterior of dated trees	`pr_get_tree(source = "datelife", n_tree = 50)`	likely ✅ — `datelife` is in `Enhances`; install separately with `pak::pak("phylotastic/datelife")`
Birds	A single tree, current Clements	`pr_get_tree(source = "clootl", n_tree = 1)`	✅ verified — works out of the box (uses the v1.6/2025 taxonomy bundled with `clootl`)
Birds	A posterior of trees, current Clements	`pr_get_tree(source = "clootl", n_tree = 50)`	⚠️ requires `clootl::get_avesdata_repo(".")` once before this works; without it `sampleTrees()` errors with `AvesData repo not found`. Capped at 100 upstream.
Fish	Time-calibrated, posterior	`pr_get_tree(source = "fishtree", n_tree = 50)`	✅ verified — returns `multiPhylo[50]`
Bird, mammal, fish, amphibian, reptile, plant, shark, bee, butterfly	Taxon-specific mega-tree, possibly grafted	`pr_get_tree(source = "rtrees", taxon = "<group>")`	✅ works; ⚠️ `n_tree` is informational only — `rtrees::get_tree()` has no `n_tree` argument, so the count is fixed by the chosen mega-tree (`taxon = "bird"` → 100, `taxon = "mammal"` → 100, `taxon = "fish"` → 1, etc.).

For the dating-only case (you already have a topology):

If your topology is…	Use	Status
Universal taxa	`pr_date_tree(my_tree, n_dated = 50)`	likely ✅ — untested in this run; same dependency story as `datelife`

What’s not in this pipeline

Tree comparison across backends. prepR4pcm provides pr_tree_compare() for quick pairwise Jaccard / Robinson-Foulds / branch-length comparisons (see the comparing tree backends vignette). For richer visual comparison use phytools::cophyloplot() or phangorn::densiTree().
Trait imputation alone. That’s pigauto’s impute() / multi_impute(). This vignette skips straight to the trees-aware variant because that’s the integration point.

For repeated runs of the same retrieval (e.g. while iterating on a manuscript), pass cache = TRUE to pr_get_tree(). The cache lives under tools::R_user_dir() by default; redirect with pr_tree_cache_dir() if you want it inside the project.

From species names to a phylogenetic posterior — prepR4pcm + pigauto