Comparative datasets increasingly contain multiple data points per species measured under different experimental or environmental conditions: critical thermal maximum (CTmax) at different acclimation temperatures, metabolic rate at different body temperatures, performance at different substrate concentrations. Missing data is ubiquitous in these datasets, and different species may be missing measurements at different condition levels.
The challenge: can imputation methods use observation-level covariates (the experimental condition under which each measurement was taken) to produce covariate-conditional predictions? A species measured at 20°C acclimation should receive a different CTmax imputation than the same species measured at 30°C. Standard phylogenetic imputation methods that operate at the species level cannot make this distinction.
The data-generating process simulates a thermal physiology scenario:
CTmax_ij = mu_i + beta * acclim_temp_j + epsilon_ij
where:
mu_i ~ phylogenetic BM (species-level intercept)
acclim_temp ~ experimental condition (observation-level covariate)
beta = within-species slope (swept: 0, 0.5, 1.0, 1.5)
epsilon_ij ~ N(0, sigma^2) residual noise
Each species has multiple observations at different acclimation
temperatures. The parameter beta controls the strength
of the within-species covariate effect. When beta = 0,
there is no within-species variation driven by the covariate; when
beta > 0, the covariate contains information that
should improve imputation.
| Method | Description |
|---|---|
| Mean imputation | Column mean of observed values. Ignores phylogeny and covariates. |
| pigauto (no covariates) | Phylogenetic BM baseline + GNN correction. Uses the tree and cross-trait correlations but no observation-level covariates. |
| pigauto + obs-level covariates | Same architecture, but the acclimation temperature is supplied as an observation-level covariate. The refinement MLP can learn within-species adjustments. |
| Scenario | Mean imputation | pigauto (no covariates) | pigauto + obs-level covariates |
|---|---|---|---|
| lambda=0.5 beta=0.0 miss=50% | 1.123 | 0.891 | 0.890 |
| lambda=0.9 beta=0.0 miss=50% | 1.382 | 1.090 | 1.101 |
| lambda=0.5 beta=0.5 miss=50% | 2.927 | 2.967 | 2.226 |
| lambda=0.9 beta=0.5 miss=50% | 3.060 | 3.006 | 2.320 |
| lambda=0.5 beta=1.0 miss=50% | 5.107 | 6.214 | 4.003 |
| lambda=0.9 beta=1.0 miss=50% | 5.537 | 6.164 | 4.220 |
| lambda=0.5 beta=0.0 miss=80% | 1.413 | 1.081 | 1.072 |
| lambda=0.9 beta=0.0 miss=80% | 1.975 | 1.552 | 1.557 |
| lambda=0.5 beta=0.5 miss=80% | 2.925 | 3.044 | 2.292 |
| lambda=0.9 beta=0.5 miss=80% | 3.063 | 3.223 | 2.495 |
| lambda=0.5 beta=1.0 miss=80% | 5.228 | 5.524 | 3.984 |
| lambda=0.9 beta=1.0 miss=80% | 5.452 | 5.666 | 4.111 |
| Scenario | Mean imputation | pigauto (no covariates) | pigauto + obs-level covariates |
|---|---|---|---|
| lambda=0.5 beta=0.0 miss=50% | 0.3781 | 0.6754 | 0.6744 |
| lambda=0.9 beta=0.0 miss=50% | 0.3523 | 0.6846 | 0.6775 |
| lambda=0.5 beta=0.5 miss=50% | 0.1267 | 0.2903 | 0.6553 |
| lambda=0.9 beta=0.5 miss=50% | 0.1570 | 0.3471 | 0.6508 |
| lambda=0.5 beta=1.0 miss=50% | 0.0729 | 0.0247 | 0.6414 |
| lambda=0.9 beta=1.0 miss=50% | 0.0644 | 0.0814 | 0.6547 |
| lambda=0.5 beta=0.0 miss=80% | 0.2043 | 0.7041 | 0.7057 |
| lambda=0.9 beta=0.0 miss=80% | 0.2048 | 0.6496 | 0.6480 |
| lambda=0.5 beta=0.5 miss=80% | 0.0599 | 0.1836 | 0.6308 |
| lambda=0.9 beta=0.5 miss=80% | 0.0676 | 0.2254 | 0.5747 |
| lambda=0.5 beta=1.0 miss=80% | -0.0051 | 0.1100 | 0.6522 |
| lambda=0.9 beta=1.0 miss=80% | 0.0448 | 0.1397 | 0.6808 |
RMSE ratio of pigauto + covariates relative to pigauto (no covariates). Values < 1 indicate observation-level covariates help.
| Scenario | RMSE (no cov) | RMSE (+ cov) | Ratio | Lift |
|---|---|---|---|---|
| lambda=0.5 beta=0.0 miss=50% | 0.891 | 0.890 | 0.999 | +0.1% |
| lambda=0.9 beta=0.0 miss=50% | 1.090 | 1.101 | 1.010 | -1.0% |
| lambda=0.5 beta=0.5 miss=50% | 2.967 | 2.226 | 0.750 | +25.0% |
| lambda=0.9 beta=0.5 miss=50% | 3.006 | 2.320 | 0.772 | +22.8% |
| lambda=0.5 beta=1.0 miss=50% | 6.214 | 4.003 | 0.644 | +35.6% |
| lambda=0.9 beta=1.0 miss=50% | 6.164 | 4.220 | 0.685 | +31.5% |
| lambda=0.5 beta=0.0 miss=80% | 1.081 | 1.072 | 0.992 | +0.8% |
| lambda=0.9 beta=0.0 miss=80% | 1.552 | 1.557 | 1.003 | -0.3% |
| lambda=0.5 beta=0.5 miss=80% | 3.044 | 2.292 | 0.753 | +24.7% |
| lambda=0.9 beta=0.5 miss=80% | 3.223 | 2.495 | 0.774 | +22.6% |
| lambda=0.5 beta=1.0 miss=80% | 5.524 | 3.984 | 0.721 | +27.9% |
| lambda=0.9 beta=1.0 miss=80% | 5.666 | 4.111 | 0.726 | +27.4% |
Standard phylogenetic imputation operates at the species level: one prediction per species per trait. To handle multiple observations per species, pigauto uses a two-stage architecture:
scatter_mean),
performs phylogenetic message passing on the species graph, then
broadcasts the species-level representation back to observations
(index_select). This captures inter-species structure
from the phylogenetic tree.
# Schematic of the multi-obs + covariate pipeline
#
# observations species level observations
# (n_obs x p) (n_species x d) (n_obs x p)
# | | |
# scatter_mean ---> GNN message passing ---> broadcast
# |
# concat obs covariates
# |
# refinement MLP
# |
# delta_obs
The final prediction is still the gated blend:
pred = (1 - r_cal) * baseline + r_cal * delta_obs
where baseline is the phylogenetic BM prediction
(species-level, broadcast to observations) and
delta_obs is the observation-level GNN output that
incorporates both phylogenetic structure and covariate information.
Driver: script/bench_multi_obs.R.
Tree: 200 species.
Observations per species: 5.
Training: 200 epochs.
To reproduce:
Rscript script/bench_multi_obs.R
Rscript script/make_bench_multi_obs_html.R