Tree: ape::rtree(300) ·
Traits: 4 continuous per scenario ·
Models: BM, OU (α = 2), regime shift, nonlinear ·
Methods: mean · BM baseline · pigauto ·
Replicates: 5 ·
Missingness: 25% MCAR (primary) ·
Commit c7082ba089 ·
Run on 2026-05-15 13:42 ·
Total wall: 16.2 min
Bottom line. Under pure Brownian motion the BM baseline is near-optimal (RMSE 0.468) and pigauto stays close to it (0.500) — the calibrated gate is expected to stay near zero when the baseline is already the true model.
Across OU, regime shift, and nonlinear models, pigauto stays close to the BM baseline with RMSE deltas of +8.4%, -2.1%, and -6.4% respectively. Read these as scenario-specific deltas, not a general dominance claim.
Primary sweep: evolutionary model comparison (25% missingness)
Average across 4 traits and 5 replicates. ★ marks the best method per scenario.
Model
RMSE (lower is better)
Pearson r (higher is better)
Mean
BM
pigauto
Mean
BM
pigauto
BM
0.997
0.468 ★
0.500
–
0.876 ★
0.873
OU (α = 2)
1.021 ★
1.148
1.052
–
0.140 ★
0.030
Regime shift
0.988
0.363 ★
0.371
–
0.925 ★
0.922
Nonlinear
0.998
0.595 ★
0.633
–
0.805 ★
0.785
Secondary sweep: RMSE vs missingness (BM + OU)
How each method degrades as the held-out fraction increases. Average across traits and replicates.
What the benchmark shows
BM is hard to beat when BM is the truth. Under pure Brownian motion the Rphylopars baseline is the maximum-likelihood estimator. pigauto should be judged against that baseline, not against mean imputation alone.
Non-BM scenarios are mixed. OU (stabilising selection), regime shifts (clade-specific optima), and nonlinear inter-trait relationships all violate BM’s assumptions, but the observed GNN contribution differs by scenario and trait.
Mean imputation is a proper null. The gap between mean imputation and the BM baseline quantifies the phylogenetic signal in the data. The gap between BM and pigauto is the incremental model contribution and can be positive, zero, or negative.
Higher missingness degrades all methods. BM and pigauto remain well ahead of mean imputation in these sweeps, while their ordering is scenario- and trait-dependent. With more data held out, calibration has more validation data and the GNN can sometimes provide a larger correction.
Reproducibility
Driver: script/bench_continuous.R. Tree: ape::rtree(300) with per-cell seeds rep × 100 + scenario_index. Traits: simulate_bm_traits() or simulate_non_bm() (4 traits per scenario). Training: 500 epochs with early stopping. To reproduce: Rscript script/bench_continuous.R, then Rscript script/make_bench_continuous_html.R.