Continuous-trait benchmark: BM, OU, regime shift, nonlinear

Tree: ape::rtree(300) · Traits: 4 continuous per scenario · Models: BM, OU (α = 2), regime shift, nonlinear · Methods: mean · BM baseline · pigauto · Replicates: 5 · Missingness: 25% MCAR (primary) · Commit 794537121b · Report generated 2026-05-30 12:04 · Total wall: 135.5 min

Bottom line. Under pure Brownian motion the BM baseline is near-optimal (RMSE 0.469) and pigauto stays close to it (0.478) — the calibrated gate is expected to stay near zero when the baseline is already the true model.

Across OU, regime shift, and nonlinear models, pigauto stays close to the BM baseline with RMSE deltas of +7.8%, -1.6%, and -6.9% respectively. Read these as scenario-specific deltas, not a general dominance claim.

Primary sweep: evolutionary model comparison (25% missingness)

Average across 4 traits and 5 replicates. ★ marks the best method per scenario.

Model	RMSE (lower is better)			Pearson r (higher is better)
Model	Mean	BM	pigauto	Mean	BM	pigauto
BM	0.997	0.469 ★	0.478	–	0.876 ★	0.869
OU (α = 2)	1.021 ★	1.123	1.035	–	0.158 ★	0.030
Regime shift	0.988	0.360 ★	0.366	–	0.926 ★	0.923
Nonlinear	0.998	0.608 ★	0.650	–	0.796 ★	0.772

Secondary sweep: RMSE vs missingness (BM + OU)

How each method degrades as the held-out fraction increases. Average across traits and replicates.

What the benchmark shows

BM is hard to beat when BM is the truth. Under pure Brownian motion the Rphylopars baseline is the maximum-likelihood estimator. pigauto should be judged against that baseline, not against mean imputation alone.
Non-BM scenarios are mixed. OU (stabilising selection), regime shifts (clade-specific optima), and nonlinear inter-trait relationships all violate BM’s assumptions, but the observed GNN contribution differs by scenario and trait.
Mean imputation is a proper null. The gap between mean imputation and the BM baseline quantifies the phylogenetic signal in the data. The gap between BM and pigauto is the incremental model contribution and can be positive, zero, or negative.
Higher missingness degrades all methods. BM and pigauto remain well ahead of mean imputation in these sweeps, while their ordering is scenario- and trait-dependent. With more data held out, calibration has more validation data, but the GNN correction still needs to be read from the measured cells rather than assumed.

Reproducibility

Driver: script/bench_continuous.R. Tree: ape::rtree(300) with per-cell seeds rep × 100 + scenario_index. Traits: simulate_bm_traits() or simulate_non_bm() (4 traits per scenario). Training: 500 epochs with early stopping. To reproduce: Rscript script/bench_continuous.R, then Rscript script/make_bench_continuous_html.R.

Source: script/bench_continuous.R · Results: script/bench_continuous.rds · Report: pkgdown/assets/dev/bench_continuous.html