Skip to contents

Dispatches to pigauto's phylogenetic baseline machinery and returns imputed latent-scale means and standard errors for every species.

Usage

fit_baseline(
  data,
  tree,
  splits = NULL,
  model = "BM",
  graph = NULL,
  multi_obs_aggregation = c("hard", "soft"),
  em_iterations = 0L,
  em_tol = 0.001,
  em_offdiag = FALSE
)

Arguments

data

object of class "pigauto_data".

tree

object of class "phylo".

splits

list (output of make_missing_splits) or NULL.

model

character. Evolutionary model: "BM" (default) or "OU".

graph

optional list returned by build_phylo_graph. When supplied, graph$D (cophenetic distances) is reused for label propagation and graph$R_phy (phylogenetic correlation matrix) is reused for BM imputation, avoiding duplicate \(O(n^2)\) allocations. When NULL (default), both matrices are computed here.

multi_obs_aggregation

character. How to aggregate multiple observations per species before the Level-C (Rphylopars) baseline: "hard" (default) thresholds binary proportions at 0.5 and uses argmax for categorical, matching Phase 10 behaviour. "soft" preserves species-level proportions and dispatches the truncated-Gaussian soft E-step (estep_liability_binary_soft) so that intermediate class frequencies contribute fractional liability evidence. Only relevant for multi-obs data with binary or categorical traits when the Level-C joint baseline is active.

em_iterations

integer. Number of Phase 6 EM iterations for the threshold-joint baseline (binary + ordinal + OVR categorical). Default 0L disables the EM loop and preserves v0.9.1 output byte-for-byte. When >= 1, the BM rate \(\Sigma\) learned by Rphylopars::phylopars() at iteration \(k\) is fed back as the per-trait prior SD at iteration \(k+1\), up to em_iterations times or until em_tol convergence. em_iterations = 1L is a degenerate single-pass run and produces the same baseline output as 0L; >= 2L is needed for actual iteration. Only affects the threshold-joint path (continuous-only traits pass through the existing joint MVN path unchanged).

em_tol

numeric. Relative-Frobenius convergence tolerance for the Phase 6 / 7 EM loop. Early-stops when \(||\Sigma_k - \Sigma_{k-1}||_F / ||\Sigma_{k-1}||_F < \) em_tol. Default 1e-3.

em_offdiag

logical. Phase 7 opt-in: when TRUE AND em_iterations >= 2L, each liability cell's prior at iteration \(k+1\) is the conditional-MVN \((\mu, sd)\) given the posterior liability of other traits at iteration \(k\), using the full off- diagonal entries of \(\Sigma\). Binary + ordinal only (OVR categorical stays on Phase 6 diagonal). Default FALSE preserves Phase 6 behaviour.

Value

A list with:

mu

Numeric matrix (n_species x p_latent), baseline means in latent scale.

se

Numeric matrix (n_species x p_latent), standard errors.

Details

When splits is supplied the val and test cells are masked to NA before fitting, so the baseline is evaluated under the same conditions as fit_pigauto.

Continuous-family columns use Brownian-motion conditional MVN baselines on the phylogenetic correlation matrix, either independently or through the joint MVN path when the data and optional dependencies support it. Binary, ordinal, categorical, and zero-inflated gate columns use the appropriate label-propagation or threshold/liability baseline candidates, with per-column fallbacks when a joint path is not available.

Examples

if (FALSE) { # \dontrun{
data(avonet300, tree300, package = "pigauto")
traits <- avonet300; rownames(traits) <- traits$Species_Key
traits$Species_Key <- NULL
pd     <- preprocess_traits(traits, tree300)
splits <- make_missing_splits(pd$X_scaled, trait_map = pd$trait_map)
bl     <- fit_baseline(pd, tree300, splits)
} # }