Bottom line. With K = 3 categories, phylo label propagation achieves 74.6% accuracy and pigauto achieves 69.0%. Fewer categories means higher baseline accuracy, leaving less room for the GNN.
With K = 12 categories, baseline accuracy drops to 52.9% while pigauto achieves 52.9% (+0.0 pp). More categories make the classification task harder; in this run pigauto should be read against label propagation scenario by scenario.
Primary sweep: accuracy by number of categories (25% missingness)
Average across 2 traits and 5 replicates. ★ marks the best method per scenario.
K
Mode
LP
pigauto
K = 3
0.518
0.746 ★
0.690
K = 5
0.359
0.639 ★
0.639
K = 8
0.284
0.571 ★
0.571
K = 12
0.232
0.529 ★
0.529
Secondary sweep: signal strength (K = 5)
Accuracy by phylogenetic signal at fixed K = 5 categories.
What the benchmark shows
Phylogenetic label propagation is a strong baseline for categorical traits. It uses phylogenetic distance to weight neighbours and predict the most likely category. With strong phylogenetic signal in this simulator, it can be difficult for the GNN to improve on.
More categories make the task harder for all methods. As K increases, the chance level drops (1/K) and each category has fewer training examples. The method gap should be read from the table because pigauto sometimes ties, trails, or slightly improves on label propagation.
pigauto often matches label propagation but can trail it. The calibrated gate limits unsupported GNN contribution, but this run still has scenarios where pure label propagation is better.
Signal strength remains the dominant factor. Even with K = 12 categories, high phylogenetic signal yields good accuracy. Low signal makes the task difficult regardless of method.
Reproducibility
Driver: script/bench_categorical.R. Tree: ape::rtree(300). Traits: simulate_categorical_traits(). Training: 500 epochs with early stopping. To reproduce: Rscript script/bench_categorical.R, then Rscript script/make_bench_categorical_html.R.