One-call interface to the full pigauto pipeline: preprocessing, baseline
fitting, GNN training, and prediction. For fine-grained control, use the
individual functions (preprocess_traits,
fit_baseline, fit_pigauto, etc.) directly.
Usage
impute(
traits,
tree,
species_col = NULL,
trait_types = NULL,
multi_proportion_groups = NULL,
log_transform = TRUE,
missing_frac = 0.25,
n_imputations = 1L,
covariates = NULL,
epochs = 2000L,
verbose = TRUE,
seed = 1L,
multi_obs_aggregation = c("hard", "soft"),
em_iterations = 0L,
em_tol = 0.001,
em_offdiag = FALSE,
pool_method = c("median", "mean", "mode"),
clamp_outliers = FALSE,
clamp_factor = 5,
match_observed = c("none", "pmm"),
pmm_K = 5L,
safety_floor = TRUE,
phylo_signal_gate = TRUE,
phylo_signal_threshold = 0.2,
phylo_signal_method = "lambda",
...
)Arguments
- traits
data.frame with species as rownames and trait columns, or (when
species_colis supplied) a data.frame with a species column that may have multiple rows per species. Supported column types:numeric(continuous),integer(count),factor(binary/categorical),ordered(ordinal),character(factor → binary/categorical),logical(binary). See the Trait type auto-detection section below.- tree
object of class
"phylo".- species_col
character. Name of the column in
traitsthat identifies species. When supplied, multiple observations per species are supported. DefaultNULLuses row names (one row per species).- trait_types
named character vector overriding the auto-detected type for specific trait columns, e.g.
c(Survival = "proportion", Parasites = "zi_count"). Required for the two types that cannot be inferred from R class (see Trait type auto-detection below). DefaultNULL(auto).- multi_proportion_groups
named list declaring compositional (
multi_proportion) traits, e.g.list(colour = c("black", "blue", "red", "yellow")). Each list element names a group and gives the K trait columns that form a simplex (rows summing to 1). Encoded via CLR + per-component z-score. Multi_proportion traits cannot be declared throughtrait_types— use this argument instead. DefaultNULL(no multi_proportion groups).- log_transform
logical. Auto-log positive continuous columns (default
TRUE).- missing_frac
numeric. Fraction of observed cells held out for validation/test evaluation (default
0.25). Set to0to skip splitting (all cells used for training, no evaluation).- n_imputations
integer. Number of MC-dropout imputation sets (default
1). Values > 1 enable between-imputation uncertainty.- covariates
data.frame or matrix of environmental covariates (fully observed — no NAs). Covariates are conditioners: they inform imputation but are not themselves imputed. Numeric/integer columns are z-scored; factor/ordered columns are one-hot encoded automatically. If a variable has missing values, include it in
traitsinstead. Same number of rows astraits. DefaultNULL(no covariates).- epochs
integer. Maximum GNN training epochs (default
2000).- verbose
logical. Print progress (default
TRUE).- seed
integer. Random seed (default
1).- multi_obs_aggregation
character. How to aggregate multiple observations per species before the Level-C baseline.
"hard"(default) thresholds binary proportions at 0.5 and uses argmax for categorical."soft"preserves species-level proportions and dispatches a soft E-step so that intermediate class frequencies contribute fractional liability evidence. Passed tofit_baseline.- em_iterations
integer. Phase 6 EM iterations for the threshold-joint baseline (binary + ordinal + OVR categorical). Default
0Lpreserves v0.9.1 behaviour byte-for-byte. When>= 2L, the BM rate \(\Sigma\) learned byRphylopars::phylopars()at iteration \(k\) is fed back as the per-trait prior SD at iteration \(k+1\), up toem_iterationstimes or untilem_tolconvergence. Passed tofit_baseline.- em_tol
numeric. Relative-Frobenius convergence tolerance for the Phase 6 / 7 EM loop. Default
1e-3.- em_offdiag
logical. Phase 7 opt-in: when
TRUEANDem_iterations >= 2L, each liability cell's prior uses the full conditional-MVN from \(\Sigma\)'s off-diagonal entries, so that observing one discrete trait shifts (not just tightens) the prior on correlated other traits. Binary + ordinal only; OVR categorical stays on Phase 6 diagonal. DefaultFALSE. Passed tofit_baseline.- pool_method
character. How to pool multiple imputation draws (
n_imputations > 1) for count, proportion, and zi_count magnitude traits:"median"(default) takes the per-cell median of theMdecoded draws — robust to dropout-noisy latents amplified byexpm1()/plogis()decoders."mean"restores the pre-v0.9.2 arithmetic-mean pooling."mode"(Phase H, v0.9.1.9010+) is intended for ordinal traits: per-cell majority vote across theMdraws, avoiding the integer-mean-round bias toward middle classes. For continuous-family traits,"mode"falls back to"median". Binary / categorical / multi_proportion traits always pool by probability average; unaffected by this argument. See issue #40.- clamp_outliers
logical. Phase G (v0.9.1.9011+). When
TRUE, post-back-transform predictions for log-transformed continuous, count, and zi_count magnitude traits are capped attm$obs_max * clamp_factor(andtm$obs_maxis the observed maximum on the original scale, recorded at preprocess time). Targets the AVONET Mass tail-extrapolation mode documented inuseful/MEMO_2026-05-01_avonet_mass_diag.mdwhere a \(+3\)-\(4 \sigma\) latent overshoot becomes a 50x-100x value error afterexpm1(). DefaultFALSEpreserves v0.9.1 behaviour exactly.- clamp_factor
numeric scalar (>= 1). Multiplicative factor on the observed maximum used by
clamp_outliers. Default5(Tukey-style outlier definition: anything >= 5x the observed max is implausible). Ignored whenclamp_outliers = FALSE.- match_observed
character, one of
c("none", "pmm"). Phase G' (v0.9.1.9012+). Pass-through topredict.pigauto_fit. When"pmm", uses Predictive Mean Matching for log-transformed continuous, count, zi_count magnitude, and proportion traits: imputed values are drawn from the observed value pool, never extrapolated.When to use: PMM is a niche feature. pigauto already provides conformal prediction intervals (calibrated against held-out residuals) and
multi_impute(draws_method = "conformal")for multi-imputation workflows; those are the recommended paths for honest standard errors on downstream regression. PMM is only worth enabling for: (a) methodological comparison against mice, or (b) workflows that specifically require imputed values to come from the observed data pool. For tail safety, preferclamp_outliers = TRUE. For honest MI inference, prefermulti_impute(draws_method = "conformal").Default
"none"preserves pre-G' behaviour.- pmm_K
integer (>= 1). Donor pool size for PMM. Default
5L(mice convention). Ignored whenmatch_observed = "none".- safety_floor
logical. When
TRUE(default since v0.9.1.9002), calibration searches the 3-way simplexr_BM * BM + r_GNN * GNN + r_MEAN * MEANso the grand mean is always in the candidate set. Under the validation metric used for calibration, the selected candidate cannot be worse than that grand-mean corner on the validation cells. This is a validation safeguard, not a guarantee about future held-out data. WhenFALSE, the v0.9.1 1-D calibration is used exactly. See the Safety floor section below.- phylo_signal_gate, phylo_signal_threshold, phylo_signal_method
Pass-through to
fit_pigauto(). See that help page for details.- ...
additional arguments passed to
fit_pigauto.
Value
An object of class "pigauto_result" with components:
- completed
The input
traitsdata.frame with observed values preserved and only missing cells filled in. This is the primary output – typically what users want.- imputed_mask
Logical matrix (same shape as
completed) that isTRUEfor cells that were imputed (originallyNA) andFALSEfor observed cells.- prediction
A
pigauto_predobject frompredict.pigauto_fitcontaining raw model predictions for every cell (observed + missing), standard errors, class probabilities, and conformal intervals.- fit
The trained
pigauto_fitobject.- baseline
The phylogenetic baseline.
- data
The preprocessed
pigauto_dataobject.- splits
The val/test splits (or
NULLifmissing_frac = 0).- evaluation
Evaluation metrics on test set (or
NULL).
Trait type auto-detection
pigauto infers each trait's type from its R class — no trait_types
argument is needed for most data:
| R class | pigauto type |
numeric | continuous (auto-log if all-positive) |
integer | count |
factor with 2 levels | binary |
factor (unordered) with >2 levels | categorical |
ordered / factor(..., ordered = TRUE) | ordinal |
character | → factor → binary or categorical |
logical | binary |
Two types cannot be inferred from class alone and must be declared
via trait_types:
"proportion"A
numericbounded 0–1, e.g. survival rate:trait_types = c(Survival = "proportion")."zi_count"An
integerwith excess zeros, e.g. parasite count:trait_types = c(Parasites = "zi_count").
Use the trait_types argument directly (it is an explicit
parameter, not a ... pass-through).
Traits vs covariates
The distinction is functional, not ontological: a trait is something
you want to impute (NA values allowed in traits); a covariate is
something you use to sharpen imputation accuracy (must be fully observed,
passed via covariates). The same variable can be either depending on
the scientific question.
Examples:
IUCN status with Data Deficient species → put it in
traitsasordered(c("LC","NT","VU","EN","CR"))so pigauto predicts the unknown categories.IUCN status fully known for all species → pass as a covariate to inform imputation of other traits (e.g. body mass, range size).
Realm / biome (factor) → pass as a covariate; pigauto one-hot encodes factor columns automatically (v0.6.1+).
Variables that belong in traits: anything with missing values you
care about predicting. Variables that belong in covariates: fully
observed, exogenous to the trait space (geography, climate, habitat,
experimental treatment).
Safety floor (v0.9.1.9002+)
With safety_floor = TRUE (the new default), the post-training
calibration grid searches a 3-way convex combination of the
Brownian-motion baseline, the GNN delta, and the per-trait grand
mean. The simplex is sampled at step 0.05 (231 candidates per latent
column). Because the corner (0, 0, 1) — pure grand mean —
is always in the grid, the selected candidate cannot be worse than
the grand-mean corner on the validation cells under the calibration
metric. The fit object gains four new slots:
r_cal_bm, r_cal_gnn, r_cal_mean (each a named
numeric of length p_latent), and
mean_baseline_per_col.
Set safety_floor = FALSE to reproduce the pre-v0.9.1.9002
1-D calibration bit-identically (no mean term; r_cal_mean = 0;
r_cal_bm = 1 - r_cal_gnn). See
specs/2026-04-23-safety-floor-mean-gate-design.md for the
design rationale and
plans/2026-04-23-safety-floor-mean-gate.md for the
implementation plan.
What gets imputed (read this first)
pigauto only imputes cells that are NA in the input.
Observed cells are preserved as-is in result$completed. The
slot result$prediction$imputed contains the model's
prediction for every cell – observed and missing alike –
and is intended for diagnostics (e.g. checking calibration on
training cells), not as the imputed-values output. The
imputed values themselves are result$completed[result$imputed_mask].
Common pitfall. If you call impute() on a fully
observed trait matrix (no NAs anywhere), there is nothing
to impute. result$completed is identical to the input,
sum(result$imputed_mask) is 0, and
result$prediction$imputed is just model predictions for
already-known values. This is the right behaviour, but it can
look surprising: e.g. on avonet300 (fully observed), the
"imputed" Mass values you see are simply the observed body
masses passed through (some bird species are 24 kg). To exercise
the imputation path on a complete dataset, mask some cells first
(see Examples).
Imbalanced K-class traits. At default settings
(n_imputations = 1L, pool_method = "median"), a
small ordinal / categorical trait whose marginal distribution is
heavily skewed (e.g. AVONET Migration is ~78\
~14\
collapse onto a corner that predicts the majority class
everywhere. When this matters, increase n_imputations
(>= 20 in our K=3 ordinal benches) and set
pool_method = "mode" (Phase H, +6.6 pp on AVONET
Migration K=3 vs the default median pool). See
useful/MEMO_2026-05-01_phase_h_results.md.
Examples
if (FALSE) { # \dontrun{
# Typical use: your data already has NAs you want filled
result <- impute(my_traits_with_NAs, my_tree)
result$completed # observed preserved, NAs filled
result$imputed_mask # which cells were imputed
sum(result$imputed_mask) # how many cells were imputed
# Proportion and zi_count must be declared explicitly
result <- impute(my_traits, my_tree,
trait_types = c(Survival = "proportion",
Parasites = "zi_count"))
# Diagnostic: raw predictions for every cell (NOT the imputed values).
# `imputed` here means "model output", not "filled gap".
result$prediction$imputed # model prediction at every cell
result$prediction$se # per-cell uncertainty
result$prediction$probabilities$diet # class probabilities
# Demonstration / sanity-check on a fully observed dataset:
# mask some cells, impute, compare predictions to truth.
data(avonet300, tree300)
df <- avonet300
rownames(df) <- df$Species_Key; df$Species_Key <- NULL
set.seed(1L)
truth <- df$Mass
df_obs <- df
hide <- sample(which(!is.na(df$Mass)), 30L)
df_obs$Mass[hide] <- NA
result <- impute(df_obs, tree300)
truth[hide] # held-out truth
result$completed$Mass[hide] # pigauto's imputations for those cells
plot(truth[hide], result$completed$Mass[hide],
log = "xy", xlab = "truth", ylab = "imputed")
abline(0, 1, col = "red")
# For imbalanced K=3 ordinal traits (e.g. Migration), prefer:
result <- impute(df_obs, tree300, n_imputations = 20L,
pool_method = "mode")
} # }
