Impute missing phylogenetic traits (convenience wrapper)

One-call interface to the full pigauto pipeline: preprocessing, baseline fitting, GNN training, and prediction. For fine-grained control, use the individual functions (preprocess_traits, fit_baseline, fit_pigauto, etc.) directly.

Usage

impute(
  traits,
  tree,
  species_col = NULL,
  trait_types = NULL,
  multi_proportion_groups = NULL,
  log_transform = TRUE,
  missing_frac = 0.25,
  n_imputations = 1L,
  covariates = NULL,
  epochs = 2000L,
  verbose = TRUE,
  seed = 1L,
  multi_obs_aggregation = c("hard", "soft"),
  em_iterations = 0L,
  em_tol = 0.001,
  em_offdiag = FALSE,
  pool_method = c("median", "mean", "mode"),
  clamp_outliers = FALSE,
  clamp_factor = 5,
  match_observed = c("none", "pmm"),
  pmm_K = 5L,
  safety_floor = TRUE,
  phylo_signal_gate = TRUE,
  phylo_signal_threshold = 0.2,
  phylo_signal_method = "lambda",
  ...
)

Arguments

traits

data.frame with species as rownames and trait columns, or (when species_col is supplied) a data.frame with a species column that may have multiple rows per species. Supported column types: numeric (continuous), integer (count), factor (binary/categorical), ordered (ordinal), character (factor → binary/categorical), logical (binary). See the Trait type auto-detection section below.

tree

object of class "phylo".

species_col

character. Name of the column in traits that identifies species. When supplied, multiple observations per species are supported. Default NULL uses row names (one row per species).

trait_types

named character vector overriding the auto-detected type for specific trait columns, e.g. c(Survival = "proportion", Parasites = "zi_count"). Required for the two types that cannot be inferred from R class (see Trait type auto-detection below). Default NULL (auto).

multi_proportion_groups

named list declaring compositional (multi_proportion) traits, e.g. list(colour = c("black", "blue", "red", "yellow")). Each list element names a group and gives the K trait columns that form a simplex (rows summing to 1). Encoded via CLR + per-component z-score. Multi_proportion traits cannot be declared through trait_types — use this argument instead. Default NULL (no multi_proportion groups).

log_transform

logical. Auto-log positive continuous columns (default TRUE).

missing_frac

numeric. Fraction of observed cells held out for validation/test evaluation (default 0.25). Set to 0 to skip splitting (all cells used for training, no evaluation).

n_imputations

integer. Number of MC-dropout imputation sets (default 1). Values > 1 enable between-imputation uncertainty.

covariates

data.frame or matrix of environmental covariates (fully observed — no NAs). Covariates are conditioners: they inform imputation but are not themselves imputed. Numeric/integer columns are z-scored; factor/ordered columns are one-hot encoded automatically. If a variable has missing values, include it in traits instead. Same number of rows as traits. Default NULL (no covariates).

epochs

integer. Maximum GNN training epochs (default 2000).

verbose

logical. Print progress (default TRUE).

seed

integer. Random seed (default 1).

multi_obs_aggregation

character. How to aggregate multiple observations per species before the Level-C baseline. "hard" (default) thresholds binary proportions at 0.5 and uses argmax for categorical. "soft" preserves species-level proportions and dispatches a soft E-step so that intermediate class frequencies contribute fractional liability evidence. Passed to fit_baseline.

em_iterations

integer. Phase 6 EM iterations for the threshold-joint baseline (binary + ordinal + OVR categorical). Default 0L preserves v0.9.1 behaviour byte-for-byte. When >= 2L, the BM rate $\Sigma$ learned by Rphylopars::phylopars() at iteration $k$ is fed back as the per-trait prior SD at iteration $k+1$, up to em_iterations times or until em_tol convergence. Passed to fit_baseline.

em_tol

numeric. Relative-Frobenius convergence tolerance for the Phase 6 / 7 EM loop. Default 1e-3.

em_offdiag

logical. Phase 7 opt-in: when TRUE AND em_iterations >= 2L, each liability cell's prior uses the full conditional-MVN from $\Sigma$'s off-diagonal entries, so that observing one discrete trait shifts (not just tightens) the prior on correlated other traits. Binary + ordinal only; OVR categorical stays on Phase 6 diagonal. Default FALSE. Passed to fit_baseline.

pool_method

character. How to pool multiple imputation draws (n_imputations > 1) for count, proportion, and zi_count magnitude traits: "median" (default) takes the per-cell median of the M decoded draws — robust to dropout-noisy latents amplified by expm1() / plogis() decoders. "mean" restores the pre-v0.9.2 arithmetic-mean pooling. "mode" (Phase H, v0.9.1.9010+) is intended for ordinal traits: per-cell majority vote across the M draws, avoiding the integer-mean-round bias toward middle classes. For continuous-family traits, "mode" falls back to "median". Binary / categorical / multi_proportion traits always pool by probability average; unaffected by this argument. See issue #40.

clamp_outliers

logical. Phase G (v0.9.1.9011+). When TRUE, post-back-transform predictions for log-transformed continuous, count, and zi_count magnitude traits are capped at tm$obs_max * clamp_factor (and tm$obs_max is the observed maximum on the original scale, recorded at preprocess time). Targets the AVONET Mass tail-extrapolation mode documented in useful/MEMO_2026-05-01_avonet_mass_diag.md where a $+3$-$4 \sigma$ latent overshoot becomes a 50x-100x value error after expm1(). Default FALSE preserves v0.9.1 behaviour exactly.

clamp_factor

numeric scalar (>= 1). Multiplicative factor on the observed maximum used by clamp_outliers. Default 5 (Tukey-style outlier definition: anything >= 5x the observed max is implausible). Ignored when clamp_outliers = FALSE.

match_observed

character, one of c("none", "pmm"). Phase G' (v0.9.1.9012+). Pass-through to predict.pigauto_fit. When "pmm", uses Predictive Mean Matching for log-transformed continuous, count, zi_count magnitude, and proportion traits: imputed values are drawn from the observed value pool, never extrapolated.

When to use: PMM is a niche feature. pigauto already provides conformal prediction intervals (calibrated against held-out residuals) and multi_impute(draws_method = "conformal") for multi-imputation workflows; those are the recommended paths for honest standard errors on downstream regression. PMM is only worth enabling for: (a) methodological comparison against mice, or (b) workflows that specifically require imputed values to come from the observed data pool. For tail safety, prefer clamp_outliers = TRUE. For honest MI inference, prefer multi_impute(draws_method = "conformal").

Default "none" preserves pre-G' behaviour.

pmm_K

integer (>= 1). Donor pool size for PMM. Default 5L (mice convention). Ignored when match_observed = "none".

safety_floor

logical. When TRUE (default since v0.9.1.9002), calibration searches the 3-way simplex r_BM * BM + r_GNN * GNN + r_MEAN * MEAN so the grand mean is always in the candidate set. Under the validation metric used for calibration, the selected candidate cannot be worse than that grand-mean corner on the validation cells. This is a validation safeguard, not a guarantee about future held-out data. When FALSE, the v0.9.1 1-D calibration is used exactly. See the Safety floor section below.

phylo_signal_gate, phylo_signal_threshold, phylo_signal_method

Pass-through to fit_pigauto(). See that help page for details.

...

additional arguments passed to fit_pigauto.

Value

An object of class "pigauto_result" with components:

completed: The input traits data.frame with observed values preserved and only missing cells filled in. This is the primary output – typically what users want.
imputed_mask: Logical matrix (same shape as completed) that is TRUE for cells that were imputed (originally NA) and FALSE for observed cells.
prediction: A pigauto_pred object from predict.pigauto_fit containing raw model predictions for every cell (observed + missing), standard errors, class probabilities, and conformal intervals.
fit: The trained pigauto_fit object.
baseline: The phylogenetic baseline.
data: The preprocessed pigauto_data object.
splits: The val/test splits (or NULL if missing_frac = 0).
evaluation: Evaluation metrics on test set (or NULL).

Trait type auto-detection

pigauto infers each trait's type from its R class — no trait_types argument is needed for most data:

R class	pigauto type
`numeric`	continuous (auto-log if all-positive)
`integer`	count
`factor` with 2 levels	binary
`factor` (unordered) with >2 levels	categorical
`ordered` / `factor(..., ordered = TRUE)`	ordinal
`character`	→ factor → binary or categorical
`logical`	binary

Two types cannot be inferred from class alone and must be declared via trait_types:

"proportion": A numeric bounded 0–1, e.g. survival rate: trait_types = c(Survival = "proportion").
"zi_count": An integer with excess zeros, e.g. parasite count: trait_types = c(Parasites = "zi_count").

Use the trait_types argument directly (it is an explicit parameter, not a ... pass-through).

Traits vs covariates

The distinction is functional, not ontological: a trait is something you want to impute (NA values allowed in traits); a covariate is something you use to sharpen imputation accuracy (must be fully observed, passed via covariates). The same variable can be either depending on the scientific question.

Examples:

IUCN status with Data Deficient species → put it in traits as ordered(c("LC","NT","VU","EN","CR")) so pigauto predicts the unknown categories.
IUCN status fully known for all species → pass as a covariate to inform imputation of other traits (e.g. body mass, range size).
Realm / biome (factor) → pass as a covariate; pigauto one-hot encodes factor columns automatically (v0.6.1+).

Variables that belong in traits: anything with missing values you care about predicting. Variables that belong in covariates: fully observed, exogenous to the trait space (geography, climate, habitat, experimental treatment).

Safety floor (v0.9.1.9002+)

With safety_floor = TRUE (the new default), the post-training calibration grid searches a 3-way convex combination of the Brownian-motion baseline, the GNN delta, and the per-trait grand mean. The simplex is sampled at step 0.05 (231 candidates per latent column). Because the corner (0, 0, 1) — pure grand mean — is always in the grid, the selected candidate cannot be worse than the grand-mean corner on the validation cells under the calibration metric. The fit object gains four new slots: r_cal_bm, r_cal_gnn, r_cal_mean (each a named numeric of length p_latent), and mean_baseline_per_col.

Set safety_floor = FALSE to reproduce the pre-v0.9.1.9002 1-D calibration bit-identically (no mean term; r_cal_mean = 0; r_cal_bm = 1 - r_cal_gnn). See specs/2026-04-23-safety-floor-mean-gate-design.md for the design rationale and plans/2026-04-23-safety-floor-mean-gate.md for the implementation plan.

What gets imputed (read this first)

pigauto only imputes cells that are NA in the input. Observed cells are preserved as-is in result$completed. The slot result$prediction$imputed contains the model's prediction for every cell – observed and missing alike – and is intended for diagnostics (e.g. checking calibration on training cells), not as the imputed-values output. The imputed values themselves are result$completed[result$imputed_mask].

Common pitfall. If you call impute() on a fully observed trait matrix (no NAs anywhere), there is nothing to impute. result$completed is identical to the input, sum(result$imputed_mask) is 0, and result$prediction$imputed is just model predictions for already-known values. This is the right behaviour, but it can look surprising: e.g. on avonet300 (fully observed), the "imputed" Mass values you see are simply the observed body masses passed through (some bird species are 24 kg). To exercise the imputation path on a complete dataset, mask some cells first (see Examples).

Imbalanced K-class traits. At default settings (n_imputations = 1L, pool_method = "median"), a small ordinal / categorical trait whose marginal distribution is heavily skewed (e.g. AVONET Migration is ~78\ ~14\ collapse onto a corner that predicts the majority class everywhere. When this matters, increase n_imputations (>= 20 in our K=3 ordinal benches) and set pool_method = "mode" (Phase H, +6.6 pp on AVONET Migration K=3 vs the default median pool). See useful/MEMO_2026-05-01_phase_h_results.md.

Examples

if (FALSE) { # \dontrun{
# Typical use: your data already has NAs you want filled
result <- impute(my_traits_with_NAs, my_tree)
result$completed                  # observed preserved, NAs filled
result$imputed_mask               # which cells were imputed
sum(result$imputed_mask)          # how many cells were imputed

# Proportion and zi_count must be declared explicitly
result <- impute(my_traits, my_tree,
                 trait_types = c(Survival  = "proportion",
                                 Parasites = "zi_count"))

# Diagnostic: raw predictions for every cell (NOT the imputed values).
# `imputed` here means "model output", not "filled gap".
result$prediction$imputed         # model prediction at every cell
result$prediction$se              # per-cell uncertainty
result$prediction$probabilities$diet  # class probabilities

# Demonstration / sanity-check on a fully observed dataset:
# mask some cells, impute, compare predictions to truth.
data(avonet300, tree300)
df <- avonet300
rownames(df) <- df$Species_Key; df$Species_Key <- NULL
set.seed(1L)
truth <- df$Mass
df_obs <- df
hide  <- sample(which(!is.na(df$Mass)), 30L)
df_obs$Mass[hide] <- NA

result <- impute(df_obs, tree300)
truth[hide]                         # held-out truth
result$completed$Mass[hide]         # pigauto's imputations for those cells
plot(truth[hide], result$completed$Mass[hide],
     log = "xy", xlab = "truth", ylab = "imputed")
abline(0, 1, col = "red")

# For imbalanced K=3 ordinal traits (e.g. Migration), prefer:
result <- impute(df_obs, tree300, n_imputations = 20L,
                 pool_method = "mode")
} # }