Categorical-trait benchmark: K-level sweep

Tree: ape::rtree(300) · Traits: 2 categorical per scenario · Categories: 3 – 12 · Methods: mode · phylo label propagation · pigauto · Replicates: 5 · Missingness: 25% MCAR · Commit 1ac34b11a9 · Run on 2026-05-11 10:54 · Total wall: 15.2 min

Bottom line. With K = 3 categories, phylo label propagation achieves 74.6% accuracy and pigauto achieves 69.0%. Fewer categories means higher baseline accuracy, leaving less room for the GNN.

With K = 12 categories, baseline accuracy drops to 54.0% while pigauto achieves 54.0% (+0.0 pp). More categories make the classification task harder; in this run pigauto should be read against label propagation scenario by scenario.

Primary sweep: accuracy by number of categories (25% missingness)

Average across 2 traits and 5 replicates. ★ marks the best method per scenario.

KModeLPpigauto
K = 30.5180.746 0.690
K = 50.3590.636 0.636
K = 80.2840.574 0.559
K = 120.2320.540 0.540
Accuracy by number of categories 0.13 0.30 0.46 0.63 0.80 Accuracy 0.518 0.746 0.690 K = 3 0.359 0.636 0.636 K = 5 0.284 0.574 0.559 K = 8 0.232 0.540 0.540 K = 12 Number of categories Mode imputation Phylo label propagation pigauto (LP + GNN)

Secondary sweep: signal strength (K = 5)

Accuracy by phylogenetic signal at fixed K = 5 categories.

Accuracy by phylogenetic signal (K = 5) 0.21 0.35 0.48 0.61 0.74 Accuracy 0.390 0.518 0.491 Signal = 0.3 0.314 0.639 0.639 Signal = 0.6 0.347 0.695 0.681 Signal = 1.0 Phylogenetic signal Mode imputation Phylo label propagation pigauto (LP + GNN)

What the benchmark shows

Reproducibility

Driver: script/bench_categorical.R. Tree: ape::rtree(300). Traits: simulate_categorical_traits(). Training: 500 epochs with early stopping. To reproduce: Rscript script/bench_categorical.R, then Rscript script/make_bench_categorical_html.R.