Getting started with pigauto: Phylogenetic Imputation via Graph AUTO-encoders • pigauto

What pigauto does

Many comparative datasets have missing trait values: species were not measured, specimens were unavailable, or data were lost. pigauto fills in those gaps using three sources of information:

The phylogenetic tree. Closely related species tend to share similar traits, so known values of relatives are informative about missing ones.
Cross-trait correlations. If body mass predicts beak length, then observed masses help impute missing beak lengths even when the species itself was never measured for that trait.
Environmental covariates (optional). Climate, habitat, or geographic variables can improve imputation when trait variation has a strong environmental component beyond phylogenetic signal.

The package handles continuous measurements, counts, binary variables, ordered categories, unordered categories, bounded proportions, zero-inflated counts, and compositional multi-proportion data — all in a single call to impute().

How it works (briefly)

pigauto combines a phylogenetic baseline with a graph neural network (GNN) correction. The baseline uses Brownian-motion conditional imputation for continuous-family latent columns and phylogenetic label-propagation or threshold/liability candidates for discrete-family columns. The GNN learns additional patterns from tree topology and cross-trait structure. A per-trait gate controls how much the GNN contributes: when the baseline is already good enough, the gate closes and the GNN stays out of the way. This is verified on held-out data after training.

Technical note. The internal class name ResidualPhyloDAE refers to the ResNet-style skip connections inside the GNN layers, not to a statistical residual y - baseline.

The phylogenetic structure is encoded as:

A symmetric-normalised Gaussian-kernel adjacency matrix derived from cophenetic (patristic) distances.
k Laplacian eigenvectors as node positional features — these encode phylogenetic clusters at multiple scales.

Installation

pak::pak("itchyshin/pigauto")

# torch backend (required; ~1 GB first-time download)
torch::install_torch()

library(pigauto)
library(ape)

Quick start

For most users, a single function call is all you need:

result <- impute(traits, tree)

result$completed    # data.frame: observed values preserved, NAs filled
result$imputed_mask # logical matrix: TRUE where a cell was imputed
result$prediction$se # per-cell uncertainty (original units)

impute() runs the full pipeline internally — preprocessing, baseline fitting, graph construction, GNN training, calibration, and prediction. The rest of this vignette walks through each step individually for users who need fine-grained control, want to cache intermediate objects, or are benchmarking components.

The AVONET bird data

pigauto ships with avonet300, a 300-species subset of the AVONET v3 bird morphology database (Tobias et al. 2022, Ecology Letters), and tree300, a matching pruned Hackett MCC phylogeny from BirdTree.org.

data(avonet300, tree300, package = "pigauto")

head(avonet300)
#>              Species_Key    Mass Beak.Length_Culmen Tarsus.Length Wing.Length
#> 1       Eudromia_formosa   772.9               32.5          49.5       223.9
#> 2 Nothoprocta_pentlandii   300.7               27.9          37.0       148.2
#> 3         Rhea_americana 23000.0               86.5         308.0       604.5
#> 4   Ptilopachus_petrosus   193.0               16.9          28.0       117.2
#> 5     Meleagris_ocellata  5525.0               39.8         123.0       362.5
#> 6   Dendragapus_obscurus  1047.3               26.4          38.4       226.5
#>   Trophic.Level Primary.Lifestyle Migration
#> 1     Herbivore       Terrestrial  Resident
#> 2     Herbivore       Terrestrial  Resident
#> 3      Omnivore       Terrestrial  Resident
#> 4     Herbivore       Terrestrial  Resident
#> 5     Herbivore       Terrestrial  Resident
#> 6     Herbivore        Generalist  Resident
ape::Ntip(tree300)
#> [1] 300

The four morphological traits — body mass, beak length, tarsus length, and wing length — are all log-normally distributed and highly heritable, making them ideal test cases for phylogenetic imputation.

Step 1: Preprocess

preprocess_traits() log-transforms and z-scores the traits, and reorders rows to match the phylogeny’s tip order (required by the graph operations).

traits <- avonet300[, -1]              # drop Species_Key column
rownames(traits) <- avonet300$Species_Key

pd <- preprocess_traits(traits, tree300, log_transform = TRUE)
print(pd)
#> pigauto_data
#>   Species: 300 
#>   Traits:  7 
#>   Types:   categorical=2, continuous=4, ordinal=1 
#>   Latent columns: 14 
#>   Missing values: 0 %

The object pd is a pigauto_data list containing:

X_scaled: the z-scored, log-transformed trait matrix (300 × 4)
means, sds: per-trait normalisation parameters (for back-transformation)
log_transform: TRUE, recorded so predict() can invert it automatically

Step 2: Create evaluation splits

We designate 25% of all matrix cells as “missing” for evaluation. These cells are not used during training or baseline fitting; they provide an honest estimate of imputation error.

splits <- make_missing_splits(pd$X_scaled, missing_frac = 0.25, seed = 42)

cat("Total held-out cells:", length(splits$val_idx) + length(splits$test_idx), "\n")
#> Total held-out cells: 1049
cat("  Validation:", length(splits$val_idx), "\n")
#>   Validation: 262
cat("  Test:      ", length(splits$test_idx), "\n")
#>   Test:       787

The missing cells are split further: validation cells guide early stopping, test cells provide the final unbiased performance estimate.

Step 3: Fit the BM baseline

fit_baseline() fits the phylogenetic BM baseline after masking the held-out cells. This gives the benchmark against which the GNN will be compared.

baseline <- fit_baseline(pd, tree300, splits = splits, model = "BM")

cat("Baseline object contains:\n")
#> Baseline object contains:
cat("  mu matrix:", nrow(baseline$mu), "x", ncol(baseline$mu), "\n")
#>   mu matrix: 300 x 14
cat("  se matrix:", nrow(baseline$se), "x", ncol(baseline$se), "\n")
#>   se matrix: 300 x 14

Step 4: Build the phylogenetic graph

build_phylo_graph() computes the adjacency matrix and Laplacian spectral features from the tree. The result can be cached to disk for reuse.

graph <- build_phylo_graph(
  tree300,
  k_eigen    = 8,
  sigma_mult = 0.5
)

cat("Graph: n =", graph$n, "species\n")
#> Graph: n = 300 species
cat("Adjacency: [", nrow(graph$adj), "x", ncol(graph$adj), "]\n")
#> Adjacency: [ 300 x 300 ]
cat("Spectral coords: [", nrow(graph$coords), "x", ncol(graph$coords), "]\n")
#> Spectral coords: [ 300 x 8 ]
cat("Kernel bandwidth sigma:", round(graph$sigma, 3), "\n")
#> Kernel bandwidth sigma: 78.41

For repeated runs, supply cache_path = "graph_cache.rds" to skip recomputation.

Step 5: Train the model

The training loop corrupts a random 55% of observed cells each epoch with a learnable mask token, then minimises the type-appropriate loss on the corrupted cells plus a shrinkage penalty on delta - baseline that pulls the GNN toward the baseline. Early stopping is based on held-out validation RMSE.

fit <- fit_pigauto(
  data            = pd,
  tree            = tree300,
  splits          = splits,
  graph           = graph,
  baseline        = baseline,
  hidden_dim      = 64,
  k_eigen         = 8,
  dropout         = 0.10,
  lr              = 3e-3,
  epochs          = 2000,
  corruption_rate = 0.55,
  lambda_shrink   = 0.03,
  eval_every      = 100,
  patience        = 10,
  verbose         = TRUE,
  seed            = 1
)

Example console output during training:

Epoch  100 | train loss: 0.8431 | val RMSE: 0.8852
Epoch  200 | train loss: 0.7215 | val RMSE: 0.8601
...
Epoch  900 | train loss: 0.6108 | val RMSE: 0.8289
Early stopping at epoch 900 (no improvement for 10 evals).

print(fit)

<pigauto_fit>
  Species  : 300
  Traits   : 4 -- Mass, Beak.Length_Culmen, Tarsus.Length, Wing.Length
  Architecture: hidden_dim = 64, k_eigen = 8, dropout = 0.1
  Val RMSE : 0.828 (z-score)
  Test RMSE: 0.831 (z-score)

Step 6: Predict

predict.pigauto_fit() runs a single forward pass through the trained model, optionally repeated with MC dropout to generate multiple stochastic completions for uncertainty quantification. The result is back-transformed to the original (non-log) scale.

pred <- predict(fit, return_se = TRUE)

# pred$imputed: 300 x 4 matrix in original units
# pred$se:      300 x 4 uncertainty matrix (original units)
head(pred$imputed)

# Conformal prediction intervals (95% marginal coverage guarantee)
pred$conformal_lower[["Mass"]]    # lower bound per species (original units)
pred$conformal_upper[["Mass"]]    # upper bound per species (original units)
pred$conformal_coverage           # empirical coverage on val set (target ≈ 0.95)

Expected output (values will vary by seed and convergence):

       Mass Beak.Length_Culmen Tarsus.Length Wing.Length
t1  28.4150           14.2316       18.0923     136.421
t2  41.2031           18.5034       21.3441     154.832
...

Step 7: Evaluate and compare baselines

We compare the BM baseline to the GNN on the held-out test cells (both evaluated in z-score scale for comparability).

# BM baseline RMSE on test cells
eval_bm <- evaluate_imputation(baseline$mu, pd$X_scaled, splits)
eval_bm[eval_bm$split == "test", c("trait", "n", "rmse", "pearson_r")]

               trait  n  rmse pearson_r
Mass            Mass 56 0.872     0.611
Beak.Length_Culmen Beak.Length_Culmen 56 0.901     0.582
Tarsus.Length Tarsus.Length 56 0.863     0.628
Wing.Length Wing.Length 56 0.884     0.595

# GNN test RMSE stored in fit object
data.frame(
  trait    = fit$trait_names,
  bm_rmse  = eval_bm$rmse[eval_bm$split == "test"],
  gnn_rmse = fit$test_rmse
)

               trait bm_rmse gnn_rmse
1               Mass   0.872    0.831
2 Beak.Length_Culmen   0.901    0.856
3      Tarsus.Length   0.863    0.822
4        Wing.Length   0.884    0.841

In this example output, the GNN reduces test RMSE relative to the BM baseline. How large the improvement is depends on the trait, clade, missingness pattern, and validation split.

Step 8: Visualise

Training history

plot(fit, type = "history")

This plots the validation RMSE over epochs, showing where early stopping triggered.

Uncertainty ribbons

plot(pred, type = "intervals", trait = "Mass")

Species are sorted by predicted mass. The blue ribbon shows the conformal prediction interval when conformal scores are available. Grey points (if truth is supplied) show known values — a useful visual check of calibration.

Multiple imputation for downstream inference

The single completed dataset from impute() ignores imputation uncertainty — downstream standard errors will be too small. Use multi_impute() to generate M stochastic complete datasets, each a draw from the model’s calibrated uncertainty distribution, then pool coefficients with Rubin’s rules.

# Step 1: generate M = 50 complete datasets
# draws_method = "conformal" draws from conformal-calibrated
# uncertainty — properly calibrated against held-out residuals.
mi <- multi_impute(traits, tree, m = 50L)

# Step 2: fit your downstream model to each imputed dataset
fits <- with_imputations(mi, function(d) {
  lm(log(Wing.Length) ~ log(Mass) + Trophic.Level, data = d)
})

# Step 3: pool with Rubin's rules
pool_mi(fits)
# Returns a tidy data.frame with estimate, std.error, statistic, p.value,
# df (Barnard-Rubin degrees of freedom), fmi (fraction of missing information),
# and riv (relative increase in variance) per coefficient.

Choose M ≥ 50 for stable pooled SEs; M ≥ 100 × max(fmi) for reliable p-values. The fmi column directly tells you how much each coefficient is affected by missing data — if fmi < 0.05, imputation uncertainty is negligible for that term.

Phylogenetic tree uncertainty

A single fixed tree ignores phylogenetic uncertainty. pigauto splits this into two steps, matching Nakagawa & de Villemereuil (2019, Syst. Biol. 68:632–641):

Step 1 — imputation. multi_impute_trees() fits pigauto on each of the T posterior trees and returns T × M completed datasets. Every dataset is conditional on a specific tree, recorded in mi$tree_index:

data(trees300, package = "pigauto")   # 50 posterior trees

mi <- multi_impute_trees(traits, trees = trees300, m_per_tree = 5L)
# 50 trees × 5 imputations = 250 completed datasets

Step 2 — analysis. For each completed dataset, fit the downstream comparative model using the same tree that produced that dataset, then pool all T × M fits with Rubin’s rules:

fits <- with_imputations(mi, function(dat, tree) {
  dat$species <- rownames(dat)
  nlme::gls(
    log(Mass) ~ log(Wing.Length),
    correlation = ape::corBrownian(
      phy = tree, form = ~species),
    data = dat, method = "ML"
  )
})
pool_mi(fits)

The pooled standard errors reflect trait-imputation uncertainty, phylogenetic tree uncertainty, and their interaction in one step.

Compute cost scales linearly with T (each tree requires a fresh pigauto fit). At n = 300 species this is ~30 minutes for T = 50 trees; at n = 10,000 species it can be 17+ hours. For large trees we recommend reducing T to 10–20 (the fmi column in pool_mi() output typically indicates convergence well before T = 50), or parallelising tree fits across machines.

Active imputation: where to measure next

When you can measure k more species but not all of them, which ones should you prioritise to maximally reduce imputation uncertainty across the rest of your tree? suggest_next_observation() answers this with a closed-form variance-reduction calculation built on the BM conditional MVN.

res <- impute(traits, tree)

# Top-10 individual (species, trait) cells, ordered by expected
# total variance reduction:
suggest_next_observation(res, top_n = 10, by = "cell")

# Or aggregated by species (sum of variance reductions across the
# species' currently-missing continuous-family traits):
suggest_next_observation(res, top_n = 10, by = "species")

The delta_var_total column is the closed-form expected reduction in total predictive variance across all currently-missing cells if you observed that candidate next:

$\Delta V(s_{\text{new}}) = \sigma_t^2 \sum_{i \in \mathrm{miss}_t} \frac{D_{ik}^2}{\alpha_k}$

where $D = R_{mm} - R_{mo} R_{oo}^{-1} R_{om}$ is the residual matrix at currently-missing cells, $\alpha_k = D_{kk}$ is the current relative leverage of cell $k$ , and $\sigma_t^2$ is the REML BM variance. The formula is the standard active-learning variance-reduction bound (Cohn et al. 1996); applying it to phylogenetic trait imputation appears to be new.

Scope. suggest_next_observation() v2 (2026-05-01) supports all eight pigauto trait types:

Variance reduction for continuous-family traits (continuous, count, ordinal, proportion) and the magnitude column of zi_count, computed by closed-form Sherman-Morrison rank-1 update on the BM phylogenetic-correlation matrix.
Entropy reduction for binary, categorical, and the gate column of zi_count, computed by closed-form label-propagation posteriors.
Multi-component variance reduction (per-component BM, summed across the K CLR-z latent columns) for multi_proportion.

For zi_count rows the output populates BOTH delta_var_total (probability-weighted magnitude variance reduction) AND delta_entropy_total (gate entropy reduction); the metric column is set to "variance" so rows sort on the magnitude scale. See ?suggest_next_observation for full details.

Variance vs entropy are not directly comparable — the cross- metric ordering by delta_var_total is approximate. When you want a strict ranking on a single metric, filter by metric first or restrict types to continuous-family-only or discrete-only.

Multi-obs (multiple observations per species) inputs are not supported and error with a clear message.

Optional advanced features

Phylogenetic-signal gating

impute() runs a per-trait Pagel’s-λ check before training and routes traits with very weak phylogenetic signal to the grand-mean corner of the safety-floor simplex (i.e., it predicts the column mean for those traits). This protects against over-fitting noise on traits where the phylogeny carries little information.

It is on by default since v0.9.1.9003. The two arguments you can tune:

result <- impute(traits, tree,
                 phylo_signal_gate     = TRUE,   # default
                 phylo_signal_threshold = 0.2,   # default; min lambda to keep BM/GNN
                 phylo_signal_method   = "lambda")

If a trait’s λ falls below phylo_signal_threshold, the calibrated gate routes it to the grand-mean corner. print(fit) reports which traits triggered the gate and their λ values.

To disable (e.g., when comparing against an older snapshot or for benchmarking purposes):

result <- impute(traits, tree, phylo_signal_gate = FALSE)

Assembling covariates from public data

Two helpers are bundled for the common case of building a covariate matrix from species-level environmental data:

# Step 1: GBIF occurrence centroids (median lat/lon across cleaned points)
gbif <- pull_gbif_centroids(species          = rownames(my_traits),
                            cache_dir        = "cache/gbif",
                            occurrence_limit = 500L)
#> data.frame with columns species, centroid_lat, centroid_lon, n_occurrences

# Step 2: WorldClim bioclim summaries (median + IQR of all 19 bio variables
# across each species' GBIF occurrences)
clim <- pull_worldclim_per_species(species             = rownames(my_traits),
                                    gbif_cache_dir      = "cache/gbif",
                                    worldclim_cache_dir = "cache/worldclim",
                                    resolution          = "10m")
#> data.frame with bio1..bio19 medians + IQRs + n_extracted per species

# Step 3: hand the whole climate frame to impute() as covariates
result <- impute(my_traits, my_tree, covariates = clim)

Caching is on by default — both helpers write to the cache_dir you supply and re-use existing extracts on subsequent runs. GBIF requires no API key. WorldClim raster downloads are ~50–500 MB depending on resolution and are cached in worldclim_cache_dir. Both functions error with a clear message if the optional dependencies (rgbif, terra) are not installed.

For a worked end-to-end example with covariates, use this section alongside ?impute, ?pull_gbif_centroids, and ?pull_worldclim_per_species. The old static covariate walk-through is being refreshed and is not the current source of truth.

Tips and next steps

Use GPU if available. pigauto automatically selects CUDA > MPS > CPU. Check availability with:

torch::cuda_is_available()       # NVIDIA GPU
torch::backends_mps_is_available() # Apple Silicon GPU

Cache the phylogenetic graph. For repeated experiments, save time by writing the graph to disk:

graph <- build_phylo_graph(tree300, k_eigen = 8, cache_path = "tree300_graph.rds")

Large trees (N > 2000). build_phylo_graph() computes a dense N × N cophenetic distance matrix and a Laplacian eigendecomposition. The eigendecomposition is O(N³) in the worst case, but k_eigen = "auto" caps the number of eigenvectors at 32 (the default for all trees), which keeps computation tractable in practice — pigauto has been tested up to 10,000 species on a standard laptop. The main cost at very large N is memory for the N × N distance matrix (~800 MB at N = 10,000). If memory is tight, prune to a clade of interest or watch for a future release that switches to a sparse Lanczos eigensolver.

Supplying your own data. The workflow above works for any phylogenetic tree and trait matrix:

tree   <- ape::read.tree("my_phylogeny.nwk")
traits <- read_traits("my_traits.csv", species_col = "species")
pd     <- preprocess_traits(traits, tree)
# ...proceed as above

Trait columns can be any of eight supported types. Five are auto-detected from R column class: numeric (continuous), integer (count), factor with 2 levels (binary), factor with >2 levels (categorical), and ordered (ordinal). Three require an explicit declaration because they cannot be inferred from class alone: numeric (0–1) with trait_types = "proportion" for bounded proportions; integer with trait_types = "zi_count" for zero-inflated counts; and K numeric columns declared via the multi_proportion_groups argument for compositional (simplex) data whose rows sum to 1. pigauto reads column classes to decide how each trait is encoded and imputed. The species names in traits must overlap with tip labels in tree (partial overlap is allowed; unmatched species are dropped with a warning).

References

Tobias, J.A. et al. (2022). AVONET: morphological, ecological and geographical data for all birds. Ecology Letters, 25, 581–597.
Jetz, W. et al. (2012). The global diversity of birds in space and time. Nature, 491, 444–448.
Goolsby, E.W., Bruggeman, J. & Ané, C. (2017). Rphylopars: fast multivariate phylogenetic comparative methods for missing data and within-species variation. Methods in Ecology and Evolution, 8, 22–27. (pigauto’s internal BM baseline implements the same conditional-MVN approach.)
Cohn, D.A., Ghahramani, Z. & Jordan, M.I. (1996). Active learning with statistical models. Journal of Artificial Intelligence Research, 4, 129–145. (Variance-reduction framework that suggest_next_observation() adapts to the phylogenetic setting.)