Skip to contents

Computes a symmetric-normalised Gaussian kernel adjacency matrix and Laplacian spectral node features from the cophenetic distance matrix. Results are optionally cached to an .rds file.

Usage

build_phylo_graph(tree, k_eigen = "auto", sigma_mult = 0.5, cache_path = NULL)

Arguments

tree

object of class "phylo".

k_eigen

integer or "auto". Number of Laplacian eigenvectors to use as node features. When "auto" (default), scales with tree size: min(max(ceiling(n/20), 4), 32), giving 4 for very small trees, 8 for 100-160 tips, 15 for 300 tips, and 32 for 640+ tips.

sigma_mult

numeric. Bandwidth multiplier (call it \(s\)): \(\sigma = \mathrm{median}(D) \times s\) (default 0.5).

cache_path

character or NULL. Path to an .rds cache file. If the file exists and dimensions match, it is loaded instead of recomputing. NULL disables caching (default).

Value

A list with:

adj

Numeric matrix (n x n). Symmetric-normalised adjacency.

coords

Numeric matrix (n x k_eigen). Spectral node features.

n

Integer. Number of tips.

sigma

Numeric. Bandwidth used.

D

Numeric matrix (n x n). Cophenetic (patristic) distances between tips, row/column-ordered by tree$tip.label. Returned so downstream functions can reuse it instead of calling ape::cophenetic.phylo() again.

D_sq

Numeric matrix (n x n). Element-wise squared cophenetic distances (D * D). Used by GraphTransformerBlock for the learnable per-head Gaussian bandwidth (B2 rate-aware attention). Not freed during impute() training — only D is freed.

R_phy

Numeric matrix (n x n). Phylogenetic correlation matrix cov2cor(ape::vcv(tree)) (diagonal = 1). Used by fit_baseline for BM conditional imputation.

Details

The graph is built in three steps: (1) cophenetic distances between all pairs of tips; (2) Gaussian kernel \(A_{ij} = \exp(-d_{ij}^2 / (2\sigma^2))\) with \(\sigma = \mathrm{median}(D) \times s\) (where \(s\) is the user-supplied sigma_mult); (3) symmetric normalisation \(\tilde{A} = D^{-1/2} A D^{-1/2}\) with self-loops added before normalisation.

Spectral node features are the k_eigen smallest non-zero eigenvectors of the unnormalised Laplacian, which encode the broad cluster structure of the phylogeny.

Scaling

For trees with more than 7500 tips, build_phylo_graph() uses RSpectra's sparse Lanczos eigensolver (eigs_sym) instead of the dense base::eigen(). This reduces the spectral step from \(O(n^3)\) (hours at \(n = 10{,}000\)) to \(O(n \cdot k \cdot \mathrm{iters})\) (seconds to a minute). The threshold is conservative because the dense eigensolver is competitive on small and mid-size trees, and on pathological ultrametric simulations (ape::rcoal) the Laplacian spectrum has tight degenerate clusters that slow down Lanczos until the tree is fairly large. On realistic asymmetric phylogenies with variable branch lengths (e.g. the AVONET BirdTree Stage2 Hackett tree) the sparse path wins much earlier, and at \(n = 9{,}993\) it delivers a ~10x speedup over dense. The dense path is kept as the default for smaller trees and as a fallback when RSpectra is not installed, or if Lanczos fails to converge. The cophenetic distance matrix is computed once per call and reused for both the adjacency and the spectral features; it is also returned in the result so that downstream consumers (notably fit_baseline) can skip recomputation.

Examples

set.seed(1)
tree <- ape::rtree(30)
g <- build_phylo_graph(tree, k_eigen = 4)
dim(g$adj)    # 30 x 30
#> [1] 30 30
dim(g$coords) # 30 x 4
#> [1] 30  4
dim(g$D)      # 30 x 30
#> [1] 30 30