
Build a phylogenetic graph representation from a tree
Source:R/build_phylo_graph.R
build_phylo_graph.RdComputes a symmetric-normalised Gaussian kernel adjacency matrix and
Laplacian spectral node features from the cophenetic distance matrix.
Results are optionally cached to an .rds file.
Arguments
- tree
object of class
"phylo".- k_eigen
integer or
"auto". Number of Laplacian eigenvectors to use as node features. When"auto"(default), scales with tree size:min(max(ceiling(n/20), 4), 32), giving 4 for very small trees, 8 for 100-160 tips, 15 for 300 tips, and 32 for 640+ tips.- sigma_mult
numeric. Bandwidth multiplier (call it \(s\)): \(\sigma = \mathrm{median}(D) \times s\) (default
0.5).- cache_path
character or
NULL. Path to an.rdscache file. If the file exists and dimensions match, it is loaded instead of recomputing.NULLdisables caching (default).
Value
A list with:
- adj
Numeric matrix (n x n). Symmetric-normalised adjacency.
- coords
Numeric matrix (n x k_eigen). Spectral node features.
- n
Integer. Number of tips.
- sigma
Numeric. Bandwidth used.
- D
Numeric matrix (n x n). Cophenetic (patristic) distances between tips, row/column-ordered by
tree$tip.label. Returned so downstream functions can reuse it instead of callingape::cophenetic.phylo()again.- D_sq
Numeric matrix (n x n). Element-wise squared cophenetic distances (
D * D). Used byGraphTransformerBlockfor the learnable per-head Gaussian bandwidth (B2 rate-aware attention). Not freed duringimpute()training — onlyDis freed.- R_phy
Numeric matrix (n x n). Phylogenetic correlation matrix
cov2cor(ape::vcv(tree))(diagonal = 1). Used byfit_baselinefor BM conditional imputation.
Details
The graph is built in three steps: (1) cophenetic distances between all
pairs of tips; (2) Gaussian kernel \(A_{ij} = \exp(-d_{ij}^2 /
(2\sigma^2))\) with \(\sigma = \mathrm{median}(D) \times s\) (where
\(s\) is the user-supplied sigma_mult); (3) symmetric normalisation
\(\tilde{A} = D^{-1/2} A D^{-1/2}\) with self-loops added before
normalisation.
Spectral node features are the k_eigen smallest non-zero
eigenvectors of the unnormalised Laplacian, which encode the broad
cluster structure of the phylogeny.
Scaling
For trees with more than 7500 tips, build_phylo_graph() uses
RSpectra's sparse Lanczos eigensolver (eigs_sym) instead
of the dense base::eigen(). This reduces the spectral step
from \(O(n^3)\) (hours at \(n = 10{,}000\)) to \(O(n \cdot k
\cdot \mathrm{iters})\) (seconds to a minute). The threshold is
conservative because the dense eigensolver is competitive on small
and mid-size trees, and on pathological ultrametric simulations
(ape::rcoal) the Laplacian spectrum has tight degenerate
clusters that slow down Lanczos until the tree is fairly large. On
realistic asymmetric phylogenies with variable branch lengths
(e.g. the AVONET BirdTree Stage2 Hackett tree) the sparse path wins
much earlier, and at \(n = 9{,}993\) it delivers a ~10x speedup
over dense. The dense path is kept as the default for smaller trees
and as a fallback when RSpectra is not installed, or if
Lanczos fails to converge. The cophenetic distance matrix is
computed once per call and reused for both the adjacency and the
spectral features; it is also returned in the result so that
downstream consumers (notably fit_baseline) can skip
recomputation.