
Getting started with pigauto: Phylogenetic Imputation via Graph AUTO-encoders
Source:vignettes/getting-started.Rmd
getting-started.RmdWhat pigauto does
Many comparative datasets have missing trait values: species were not measured, specimens were unavailable, or data were lost. pigauto fills in those gaps using three sources of information:
- The phylogenetic tree. Closely related species tend to share similar traits, so known values of relatives are informative about missing ones.
- Cross-trait correlations. If body mass predicts beak length, then observed masses help impute missing beak lengths even when the species itself was never measured for that trait.
- Environmental covariates (optional). Climate, habitat, or geographic variables can improve imputation when trait variation has a strong environmental component beyond phylogenetic signal.
The package handles continuous measurements, counts, binary
variables, ordered categories, unordered categories, bounded
proportions, zero-inflated counts, and compositional multi-proportion
data — all in a single call to impute().
How it works (briefly)
pigauto combines a phylogenetic baseline with a graph neural network (GNN) correction. The baseline uses Brownian-motion conditional imputation for continuous-family latent columns and phylogenetic label-propagation or threshold/liability candidates for discrete-family columns. The GNN learns additional patterns from tree topology and cross-trait structure. A per-trait gate controls how much the GNN contributes: when the baseline is already good enough, the gate closes and the GNN stays out of the way. This is verified on held-out data after training.
Technical note. The internal class name
ResidualPhyloDAErefers to the ResNet-style skip connections inside the GNN layers, not to a statistical residualy - baseline.
The phylogenetic structure is encoded as:
- A symmetric-normalised Gaussian-kernel adjacency matrix derived from cophenetic (patristic) distances.
- k Laplacian eigenvectors as node positional features — these encode phylogenetic clusters at multiple scales.
Installation
pak::pak("itchyshin/pigauto")
# torch backend (required; ~1 GB first-time download)
torch::install_torch()Quick start
For most users, a single function call is all you need:
result <- impute(traits, tree)
result$completed # data.frame: observed values preserved, NAs filled
result$imputed_mask # logical matrix: TRUE where a cell was imputed
result$prediction$se # per-cell uncertainty (original units)impute() runs the full pipeline internally —
preprocessing, baseline fitting, graph construction, GNN training,
calibration, and prediction. The rest of this vignette walks through
each step individually for users who need fine-grained control, want to
cache intermediate objects, or are benchmarking components.
The AVONET bird data
pigauto ships with avonet300, a 300-species subset of
the AVONET v3 bird morphology database (Tobias et al. 2022, Ecology
Letters), and tree300, a matching pruned Hackett MCC
phylogeny from BirdTree.org.
data(avonet300, tree300, package = "pigauto")
head(avonet300)
#> Species_Key Mass Beak.Length_Culmen Tarsus.Length Wing.Length
#> 1 Eudromia_formosa 772.9 32.5 49.5 223.9
#> 2 Nothoprocta_pentlandii 300.7 27.9 37.0 148.2
#> 3 Rhea_americana 23000.0 86.5 308.0 604.5
#> 4 Ptilopachus_petrosus 193.0 16.9 28.0 117.2
#> 5 Meleagris_ocellata 5525.0 39.8 123.0 362.5
#> 6 Dendragapus_obscurus 1047.3 26.4 38.4 226.5
#> Trophic.Level Primary.Lifestyle Migration
#> 1 Herbivore Terrestrial Resident
#> 2 Herbivore Terrestrial Resident
#> 3 Omnivore Terrestrial Resident
#> 4 Herbivore Terrestrial Resident
#> 5 Herbivore Terrestrial Resident
#> 6 Herbivore Generalist Resident
ape::Ntip(tree300)
#> [1] 300The four morphological traits — body mass, beak length, tarsus length, and wing length — are all log-normally distributed and highly heritable, making them ideal test cases for phylogenetic imputation.
Step 1: Preprocess
preprocess_traits() log-transforms and z-scores the
traits, and reorders rows to match the phylogeny’s tip order (required
by the graph operations).
traits <- avonet300[, -1] # drop Species_Key column
rownames(traits) <- avonet300$Species_Key
pd <- preprocess_traits(traits, tree300, log_transform = TRUE)
print(pd)
#> pigauto_data
#> Species: 300
#> Traits: 7
#> Types: categorical=2, continuous=4, ordinal=1
#> Latent columns: 14
#> Missing values: 0 %The object pd is a pigauto_data list
containing:
-
X_scaled: the z-scored, log-transformed trait matrix (300 × 4) -
means,sds: per-trait normalisation parameters (for back-transformation) -
log_transform:TRUE, recorded sopredict()can invert it automatically
Step 2: Create evaluation splits
We designate 25% of all matrix cells as “missing” for evaluation. These cells are not used during training or baseline fitting; they provide an honest estimate of imputation error.
splits <- make_missing_splits(pd$X_scaled, missing_frac = 0.25, seed = 42)
cat("Total held-out cells:", length(splits$val_idx) + length(splits$test_idx), "\n")
#> Total held-out cells: 1049
cat(" Validation:", length(splits$val_idx), "\n")
#> Validation: 262
cat(" Test: ", length(splits$test_idx), "\n")
#> Test: 787The missing cells are split further: validation cells guide early stopping, test cells provide the final unbiased performance estimate.
Step 3: Fit the BM baseline
fit_baseline() fits the phylogenetic BM baseline after
masking the held-out cells. This gives the benchmark against which the
GNN will be compared.
baseline <- fit_baseline(pd, tree300, splits = splits, model = "BM")
cat("Baseline object contains:\n")
#> Baseline object contains:
cat(" mu matrix:", nrow(baseline$mu), "x", ncol(baseline$mu), "\n")
#> mu matrix: 300 x 14
cat(" se matrix:", nrow(baseline$se), "x", ncol(baseline$se), "\n")
#> se matrix: 300 x 14Step 4: Build the phylogenetic graph
build_phylo_graph() computes the adjacency matrix and
Laplacian spectral features from the tree. The result can be cached to
disk for reuse.
graph <- build_phylo_graph(
tree300,
k_eigen = 8,
sigma_mult = 0.5
)
cat("Graph: n =", graph$n, "species\n")
#> Graph: n = 300 species
cat("Adjacency: [", nrow(graph$adj), "x", ncol(graph$adj), "]\n")
#> Adjacency: [ 300 x 300 ]
cat("Spectral coords: [", nrow(graph$coords), "x", ncol(graph$coords), "]\n")
#> Spectral coords: [ 300 x 8 ]
cat("Kernel bandwidth sigma:", round(graph$sigma, 3), "\n")
#> Kernel bandwidth sigma: 78.41For repeated runs, supply cache_path = "graph_cache.rds"
to skip recomputation.
Step 5: Train the model
The training loop corrupts a random 55% of observed cells each epoch
with a learnable mask token, then minimises the type-appropriate loss on
the corrupted cells plus a shrinkage penalty on
delta - baseline that pulls the GNN toward the baseline.
Early stopping is based on held-out validation RMSE.
fit <- fit_pigauto(
data = pd,
tree = tree300,
splits = splits,
graph = graph,
baseline = baseline,
hidden_dim = 64,
k_eigen = 8,
dropout = 0.10,
lr = 3e-3,
epochs = 2000,
corruption_rate = 0.55,
lambda_shrink = 0.03,
eval_every = 100,
patience = 10,
verbose = TRUE,
seed = 1
)Example console output during training:
Epoch 100 | train loss: 0.8431 | val RMSE: 0.8852
Epoch 200 | train loss: 0.7215 | val RMSE: 0.8601
...
Epoch 900 | train loss: 0.6108 | val RMSE: 0.8289
Early stopping at epoch 900 (no improvement for 10 evals).
print(fit)<pigauto_fit>
Species : 300
Traits : 4 -- Mass, Beak.Length_Culmen, Tarsus.Length, Wing.Length
Architecture: hidden_dim = 64, k_eigen = 8, dropout = 0.1
Val RMSE : 0.828 (z-score)
Test RMSE: 0.831 (z-score)
Step 6: Predict
predict.pigauto_fit() runs a single forward pass through
the trained model, optionally repeated with MC dropout to generate
multiple stochastic completions for uncertainty quantification. The
result is back-transformed to the original (non-log) scale.
pred <- predict(fit, return_se = TRUE)
# pred$imputed: 300 x 4 matrix in original units
# pred$se: 300 x 4 uncertainty matrix (original units)
head(pred$imputed)
# Conformal prediction intervals (95% marginal coverage guarantee)
pred$conformal_lower[["Mass"]] # lower bound per species (original units)
pred$conformal_upper[["Mass"]] # upper bound per species (original units)
pred$conformal_coverage # empirical coverage on val set (target ≈ 0.95)Expected output (values will vary by seed and convergence):
Mass Beak.Length_Culmen Tarsus.Length Wing.Length
t1 28.4150 14.2316 18.0923 136.421
t2 41.2031 18.5034 21.3441 154.832
...
Step 7: Evaluate and compare baselines
We compare the BM baseline to the GNN on the held-out test cells (both evaluated in z-score scale for comparability).
# BM baseline RMSE on test cells
eval_bm <- evaluate_imputation(baseline$mu, pd$X_scaled, splits)
eval_bm[eval_bm$split == "test", c("trait", "n", "rmse", "pearson_r")] trait n rmse pearson_r
Mass Mass 56 0.872 0.611
Beak.Length_Culmen Beak.Length_Culmen 56 0.901 0.582
Tarsus.Length Tarsus.Length 56 0.863 0.628
Wing.Length Wing.Length 56 0.884 0.595
# GNN test RMSE stored in fit object
data.frame(
trait = fit$trait_names,
bm_rmse = eval_bm$rmse[eval_bm$split == "test"],
gnn_rmse = fit$test_rmse
) trait bm_rmse gnn_rmse
1 Mass 0.872 0.831
2 Beak.Length_Culmen 0.901 0.856
3 Tarsus.Length 0.863 0.822
4 Wing.Length 0.884 0.841
In this example output, the GNN reduces test RMSE relative to the BM baseline. How large the improvement is depends on the trait, clade, missingness pattern, and validation split.
Step 8: Visualise
Training history
plot(fit, type = "history")This plots the validation RMSE over epochs, showing where early stopping triggered.
Uncertainty ribbons
plot(pred, type = "intervals", trait = "Mass")Species are sorted by predicted mass. The blue ribbon shows the
conformal prediction interval when conformal scores are available. Grey
points (if truth is supplied) show known values — a useful
visual check of calibration.
Multiple imputation for downstream inference
The single completed dataset from impute() ignores
imputation uncertainty — downstream standard errors will be too small.
Use multi_impute() to generate M stochastic complete
datasets, each a draw from the model’s calibrated uncertainty
distribution, then pool coefficients with Rubin’s rules.
# Step 1: generate M = 50 complete datasets
# draws_method = "conformal" draws from conformal-calibrated
# uncertainty — properly calibrated against held-out residuals.
mi <- multi_impute(traits, tree, m = 50L)
# Step 2: fit your downstream model to each imputed dataset
fits <- with_imputations(mi, function(d) {
lm(log(Wing.Length) ~ log(Mass) + Trophic.Level, data = d)
})
# Step 3: pool with Rubin's rules
pool_mi(fits)
# Returns a tidy data.frame with estimate, std.error, statistic, p.value,
# df (Barnard-Rubin degrees of freedom), fmi (fraction of missing information),
# and riv (relative increase in variance) per coefficient.Choose M ≥ 50 for stable pooled SEs; M ≥ 100 × max(fmi) for reliable
p-values. The fmi column directly tells you how much each
coefficient is affected by missing data — if fmi < 0.05,
imputation uncertainty is negligible for that term.
Phylogenetic tree uncertainty
A single fixed tree ignores phylogenetic uncertainty. pigauto splits this into two steps, matching Nakagawa & de Villemereuil (2019, Syst. Biol. 68:632–641):
Step 1 — imputation.
multi_impute_trees() fits pigauto on each of the T
posterior trees and returns T × M completed datasets. Every dataset is
conditional on a specific tree, recorded in
mi$tree_index:
data(trees300, package = "pigauto") # 50 posterior trees
mi <- multi_impute_trees(traits, trees = trees300, m_per_tree = 5L)
# 50 trees × 5 imputations = 250 completed datasetsStep 2 — analysis. For each completed dataset, fit the downstream comparative model using the same tree that produced that dataset, then pool all T × M fits with Rubin’s rules:
fits <- with_imputations(mi, function(dat, tree) {
dat$species <- rownames(dat)
nlme::gls(
log(Mass) ~ log(Wing.Length),
correlation = ape::corBrownian(
phy = tree, form = ~species),
data = dat, method = "ML"
)
})
pool_mi(fits)The pooled standard errors reflect trait-imputation uncertainty, phylogenetic tree uncertainty, and their interaction in one step.
Compute cost scales linearly with T (each tree
requires a fresh pigauto fit). At n = 300 species this is
~30 minutes for T = 50 trees; at n = 10,000 species it can
be 17+ hours. For large trees we recommend reducing T to 10–20 (the
fmi column in pool_mi() output typically
indicates convergence well before T = 50), or parallelising tree fits
across machines.
Active imputation: where to measure next
When you can measure k more species but not all of them, which ones
should you prioritise to maximally reduce imputation uncertainty across
the rest of your tree? suggest_next_observation() answers
this with a closed-form variance-reduction calculation built on the BM
conditional MVN.
res <- impute(traits, tree)
# Top-10 individual (species, trait) cells, ordered by expected
# total variance reduction:
suggest_next_observation(res, top_n = 10, by = "cell")
# Or aggregated by species (sum of variance reductions across the
# species' currently-missing continuous-family traits):
suggest_next_observation(res, top_n = 10, by = "species")The delta_var_total column is the closed-form expected
reduction in total predictive variance across all currently-missing
cells if you observed that candidate next:
where is the residual matrix at currently-missing cells, is the current relative leverage of cell , and is the REML BM variance. The formula is the standard active-learning variance-reduction bound (Cohn et al. 1996); applying it to phylogenetic trait imputation appears to be new.
Scope. suggest_next_observation() v2
(2026-05-01) supports all eight pigauto trait types:
-
Variance reduction for continuous-family traits
(
continuous,count,ordinal,proportion) and the magnitude column ofzi_count, computed by closed-form Sherman-Morrison rank-1 update on the BM phylogenetic-correlation matrix. -
Entropy reduction for
binary,categorical, and the gate column ofzi_count, computed by closed-form label-propagation posteriors. -
Multi-component variance reduction (per-component
BM, summed across the K CLR-z latent columns) for
multi_proportion.
For zi_count rows the output populates BOTH
delta_var_total (probability-weighted magnitude variance
reduction) AND delta_entropy_total (gate entropy
reduction); the metric column is set to
"variance" so rows sort on the magnitude scale. See
?suggest_next_observation for full details.
Variance vs entropy are not directly comparable —
the cross- metric ordering by delta_var_total is
approximate. When you want a strict ranking on a single metric, filter
by metric first or restrict types to
continuous-family-only or discrete-only.
Multi-obs (multiple observations per species) inputs are not supported and error with a clear message.
Optional advanced features
Phylogenetic-signal gating
impute() runs a per-trait Pagel’s-λ check before
training and routes traits with very weak phylogenetic signal to the
grand-mean corner of the safety-floor simplex (i.e., it
predicts the column mean for those traits). This protects against
over-fitting noise on traits where the phylogeny carries little
information.
It is on by default since v0.9.1.9003. The two arguments you can tune:
result <- impute(traits, tree,
phylo_signal_gate = TRUE, # default
phylo_signal_threshold = 0.2, # default; min lambda to keep BM/GNN
phylo_signal_method = "lambda")If a trait’s λ falls below phylo_signal_threshold, the
calibrated gate routes it to the grand-mean corner.
print(fit) reports which traits triggered the gate and
their λ values.
To disable (e.g., when comparing against an older snapshot or for benchmarking purposes):
result <- impute(traits, tree, phylo_signal_gate = FALSE)Assembling covariates from public data
Two helpers are bundled for the common case of building a covariate matrix from species-level environmental data:
# Step 1: GBIF occurrence centroids (median lat/lon across cleaned points)
gbif <- pull_gbif_centroids(species = rownames(my_traits),
cache_dir = "cache/gbif",
occurrence_limit = 500L)
#> data.frame with columns species, centroid_lat, centroid_lon, n_occurrences
# Step 2: WorldClim bioclim summaries (median + IQR of all 19 bio variables
# across each species' GBIF occurrences)
clim <- pull_worldclim_per_species(species = rownames(my_traits),
gbif_cache_dir = "cache/gbif",
worldclim_cache_dir = "cache/worldclim",
resolution = "10m")
#> data.frame with bio1..bio19 medians + IQRs + n_extracted per species
# Step 3: hand the whole climate frame to impute() as covariates
result <- impute(my_traits, my_tree, covariates = clim)Caching is on by default — both helpers write to the
cache_dir you supply and re-use existing extracts on
subsequent runs. GBIF requires no API key. WorldClim raster downloads
are ~50–500 MB depending on resolution and are cached in
worldclim_cache_dir. Both functions error with a clear
message if the optional dependencies (rgbif,
terra) are not installed.
For a worked end-to-end example with covariates, use this section
alongside ?impute, ?pull_gbif_centroids, and
?pull_worldclim_per_species. The old static covariate
walk-through is being refreshed and is not the current source of
truth.
Tips and next steps
Use GPU if available. pigauto automatically selects CUDA > MPS > CPU. Check availability with:
torch::cuda_is_available() # NVIDIA GPU
torch::backends_mps_is_available() # Apple Silicon GPUCache the phylogenetic graph. For repeated experiments, save time by writing the graph to disk:
graph <- build_phylo_graph(tree300, k_eigen = 8, cache_path = "tree300_graph.rds")Large trees (N > 2000).
build_phylo_graph() computes a dense N × N cophenetic
distance matrix and a Laplacian eigendecomposition. The
eigendecomposition is O(N³) in the worst case, but
k_eigen = "auto" caps the number of eigenvectors at 32 (the
default for all trees), which keeps computation tractable in practice —
pigauto has been tested up to 10,000 species on a standard laptop. The
main cost at very large N is memory for the N × N distance matrix (~800
MB at N = 10,000). If memory is tight, prune to a clade of interest or
watch for a future release that switches to a sparse Lanczos
eigensolver.
Supplying your own data. The workflow above works for any phylogenetic tree and trait matrix:
tree <- ape::read.tree("my_phylogeny.nwk")
traits <- read_traits("my_traits.csv", species_col = "species")
pd <- preprocess_traits(traits, tree)
# ...proceed as aboveTrait columns can be any of eight supported types. Five are
auto-detected from R column class: numeric (continuous),
integer (count), factor with 2 levels
(binary), factor with >2 levels (categorical), and
ordered (ordinal). Three require an explicit declaration
because they cannot be inferred from class alone: numeric
(0–1) with trait_types = "proportion" for bounded
proportions; integer with
trait_types = "zi_count" for zero-inflated counts; and K
numeric columns declared via the
multi_proportion_groups argument for compositional
(simplex) data whose rows sum to 1. pigauto reads column classes to
decide how each trait is encoded and imputed. The species names in
traits must overlap with tip labels in tree
(partial overlap is allowed; unmatched species are dropped with a
warning).
References
- Tobias, J.A. et al. (2022). AVONET: morphological, ecological and geographical data for all birds. Ecology Letters, 25, 581–597.
- Jetz, W. et al. (2012). The global diversity of birds in space and time. Nature, 491, 444–448.
- Goolsby, E.W., Bruggeman, J. & Ané, C. (2017). Rphylopars: fast multivariate phylogenetic comparative methods for missing data and within-species variation. Methods in Ecology and Evolution, 8, 22–27. (pigauto’s internal BM baseline implements the same conditional-MVN approach.)
- Cohn, D.A., Ghahramani, Z. & Jordan, M.I. (1996). Active
learning with statistical models. Journal of Artificial Intelligence
Research, 4, 129–145. (Variance-reduction framework that
suggest_next_observation()adapts to the phylogenetic setting.)