Continuous-trait benchmark: BM, OU, regime shift, nonlinear

Tree: ape::rtree(300) · Traits: 4 continuous per scenario · Models: BM, OU (α = 2), regime shift, nonlinear · Methods: mean · BM baseline · pigauto · Replicates: 5 · Missingness: 25% MCAR (primary) · Commit c7082ba089 · Run on 2026-05-15 13:42 · Total wall: 16.2 min

Bottom line. Under pure Brownian motion the BM baseline is near-optimal (RMSE 0.468) and pigauto stays close to it (0.500) — the calibrated gate is expected to stay near zero when the baseline is already the true model.

Across OU, regime shift, and nonlinear models, pigauto stays close to the BM baseline with RMSE deltas of +8.4%, -2.1%, and -6.4% respectively. Read these as scenario-specific deltas, not a general dominance claim.

Primary sweep: evolutionary model comparison (25% missingness)

Average across 4 traits and 5 replicates. ★ marks the best method per scenario.

ModelRMSE (lower is better)Pearson r (higher is better)
MeanBMpigautoMeanBMpigauto
BM0.9970.468 0.5000.876 0.873
OU (α = 2)1.021 1.1481.0520.140 0.030
Regime shift0.9880.363 0.3710.925 0.922
Nonlinear0.9980.595 0.6330.805 0.785
Average RMSE by evolutionary model 0.00 0.33 0.66 0.99 1.32 RMSE (latent z-score) 0.997 0.468 0.500 BM 1.021 1.148 1.052 OU (α = 2) 0.988 0.363 0.371 Regime shift 0.998 0.595 0.633 Nonlinear Evolutionary model Mean imputation BM baseline pigauto (BM + GNN)
Average Pearson r by evolutionary model 0.00 0.25 0.50 0.75 1.00 Pearson r 0.876 0.873 BM 0.140 0.030 OU (α = 2) 0.925 0.922 Regime shift 0.805 0.785 Nonlinear Evolutionary model Mean imputation BM baseline pigauto (BM + GNN)

Secondary sweep: RMSE vs missingness (BM + OU)

How each method degrades as the held-out fraction increases. Average across traits and replicates.

BM 0.00 0.28 0.56 0.84 1.12 RMSE (latent z-score) 15% 30% 50% Missingness Mean imputation BM baseline pigauto (BM + GNN)
OU (α = 2) 0.00 0.32 0.64 0.96 1.28 RMSE (latent z-score) 15% 30% 50% Missingness Mean imputation BM baseline pigauto (BM + GNN)

What the benchmark shows

Reproducibility

Driver: script/bench_continuous.R. Tree: ape::rtree(300) with per-cell seeds rep × 100 + scenario_index. Traits: simulate_bm_traits() or simulate_non_bm() (4 traits per scenario). Training: 500 epochs with early stopping. To reproduce: Rscript script/bench_continuous.R, then Rscript script/make_bench_continuous_html.R.