Skip to contents

This vignette collects the questions early users most often hit when running pigauto on their own data. Each section follows the same template:

  • Symptom — the surprising output, in user voice.
  • Why this happens — the mechanism in 2–3 sentences.
  • Diagnose — 1–3 R commands that confirm the pattern.
  • Fix — concrete code change with a short explanation.
  • See also — links to the relevant ?function, design memo, or Methodology bench under the Methodology navbar dropdown.

If you hit something that isn’t here and feels surprising, please open an issue — most of the items below were added because real users tripped on them.

1. “I called impute() and result$prediction$imputed looks like my input”

Symptom. You run result <- impute(df, tree) on a fully-observed dataset (e.g. the bundled avonet300) and read result$prediction$imputed$Mass expecting “the imputed values” — but the values look exactly like your input data, including legitimately huge ones (a 25 kg rhea, a 12 kg vulture).

Why this happens. impute() only imputes cells that are NA in the input. Your input was fully observed, so nothing was imputed: result$completed equals the input, sum(result$imputed_mask) is zero, and result$prediction$imputed contains the model’s prediction for every cell — observed and missing alike. For observed cells, the well-calibrated gate keeps the prediction close to the input value, so what comes back is essentially the original data passed through. The slot is intended for diagnostics (checking calibration on training cells), not as the imputed-values output.

Diagnose.

library(pigauto)
data(avonet300, tree300)
df <- avonet300
rownames(df) <- df$Species_Key
df$Species_Key <- NULL
sum(is.na(df))                  # if 0, there's nothing for impute() to do

Fix. Mask some cells before calling impute(), then evaluate predictions only on the held-out cells:

set.seed(1L)
hide   <- sample(which(!is.na(df$Mass)), 30L)
df_obs <- df
df_obs$Mass[hide] <- NA          # hide 30 mass values

result <- impute(df_obs, tree300)

result$completed$Mass[hide]      # pigauto's imputations
df$Mass[hide]                    # held-out truth, for comparison
sum(result$imputed_mask[, "Mass"])  # 30

For your own data with real NAs, the imputed values you actually care about are result$completed[result$imputed_mask], not result$prediction$imputed.

See also. ?impute (“What gets imputed (read this first)”), issue #67.

2. “My ordinal trait predicted 100 % majority class”

Symptom. You impute an ordinal trait and the prediction is the majority class for every species. For example, on avonet300$Migration (K = 3 ordinal: Resident / Partial / Full), 300/0/0.

Why this happens. Two things compound:

  1. If your input has no NAs in that column, there’s nothing to impute externally (see Pitfall 1) — result$prediction$imputed$Migration reflects the model’s calibrated-gate output, not new imputations.
  2. At default settings (n_imputations = 1L, pool_method = "median"), a small ordinal trait whose marginal distribution is heavily skewed (AVONET Migration is ~78 % Resident / 14 % Partial / 8 % Full at n = 300) can have its calibrated gate snap to a corner that returns the majority class for every species.

Diagnose.

library(pigauto)
data(avonet300, tree300)
df <- avonet300; rownames(df) <- df$Species_Key; df$Species_Key <- NULL

table(df$Migration)              # check the marginal distribution
result <- impute(df, tree300, verbose = FALSE)
table(result$prediction$imputed$Migration)

Fix. For imbalanced K-class ordinal traits, increase n_imputations and switch to pool_method = "mode" (Phase H). On the AVONET multi-seed bench this gave +6.6 percentage-point accuracy on Migration (K = 3) versus the default median pool.

set.seed(1L)
hide   <- sample(which(!is.na(df$Migration)), 30L)
df_obs <- df
df_obs$Migration[hide] <- NA

# Default settings: prone to majority-class collapse on imbalanced K = 3
result <- impute(df_obs, tree300, verbose = FALSE)
table(result$completed$Migration[hide], df$Migration[hide])

# Recommended for K = 3 ordinal: more draws + mode pooling
result_mode <- impute(df_obs, tree300, n_imputations = 20L,
                      pool_method = "mode", verbose = FALSE)
table(result_mode$completed$Migration[hide], df$Migration[hide])

See also. ?impute (“Imbalanced K-class traits”), Phase H memo, issue #68.

3. “The gate stays closed and the GNN seems to do nothing”

Symptom. You expected the GNN to dominate, but inspecting the fitted model shows the calibrated gate is fully or near-fully closed (r_cal_gnn ≈ 0) — predictions equal the BM baseline.

Why this happens. This is the safety-floor design behaviour, not a bug. After training, pigauto picks the per-latent-column gate that minimises validation loss across the simplex rBMBM+rGNNGNN+rMEANMEANr_\text{BM} \cdot \text{BM} + r_\text{GNN} \cdot \text{GNN} + r_\text{MEAN} \cdot \text{MEAN}. When the GNN cannot beat BM on the held-out validation set, the optimum can be r_cal_gnn = 0. In that case the calibrated prediction stays on the validation-supported baseline or mean corner instead of forcing a GNN contribution. This is what the package was designed to do on high-phylogenetic-signal traits where BM is already hard to beat.

Diagnose.

library(pigauto)
data(avonet300, tree300)
df <- avonet300; rownames(df) <- df$Species_Key; df$Species_Key <- NULL
fit <- impute(df, tree300, verbose = FALSE)$fit

# Per-latent-column calibrated gates (since v0.9.1.9002):
fit$r_cal_bm        # r assigned to the BM baseline
fit$r_cal_gnn       # r assigned to the GNN delta
fit$r_cal_mean      # r assigned to the grand mean

A row where r_cal_gnn is small (< 0.1) means the gate has effectively closed for that latent column.

Fix. Often there is nothing to fix — the closed gate is evidence of high phylogenetic signal, not a problem. If you suspect the GNN should be helping (e.g. you’ve added covariates, or the trait has known cross-trait structure) but the gate is closed:

  • Check the validation set is not pathologically small (the “small validation set” warning during fitting is a red flag — see ?fit_pigauto “Calibration at small n”).
  • Verify covariates are not all-NA or constant after preprocessing.
  • For ordinal / categorical traits, see Pitfall 2 — the gate may be closing onto a majority-class corner that mode pooling resolves.

See also. ?fit_pigauto (phylo_signal_gate, “Safety floor”), design spec.

4. “How do I know if my dataset has enough phylogenetic signal?”

Symptom. You aren’t sure whether pigauto’s BM kriging baseline will outperform a simple mean impute on your dataset.

Why this matters. pigauto’s BM baseline buys you accuracy in proportion to phylogenetic signal in the trait. At Pagel’s λ ≈ 0 (no signal), BM kriging reduces to the species mean and pigauto won’t beat a simple mean baseline; at λ ≈ 1 (strong signal), BM kriging materially outperforms the mean. The Phase 8 signal-strength sweep (re-running locally produces the evidence; the deployed Methodology dropdown surfaces it once the bench HTML is regenerated) shows the crossover empirically.

Diagnose. The fitted object stores the per-trait λ values used by phylo_signal_gate:

library(pigauto)
data(avonet300, tree300)
df <- avonet300; rownames(df) <- df$Species_Key; df$Species_Key <- NULL

fit <- impute(df, tree300)$fit
fit$phylo_signal_per_trait

The output reports λ on the observed cells where it can be estimated. Discrete traits use the package’s internal continuous proxy on the latent/liability scale.

Fix. Use the lambda estimate to set expectations:

  • λ > 0.7: pigauto’s BM baseline alone usually beats grand-mean. GNN may add little if the trait is well-conserved.
  • 0.3 < λ < 0.7: pigauto’s GNN typically helps on top of BM, especially if you have covariates.
  • λ < 0.3: phylogenetic information is weak. Consider whether a phylogenetic imputation method is the right tool at all. A simple mean impute or a covariate-only regression may do as well.

See also. ?fit_pigauto (phylo_signal_gate), Phase 8 signal sweep memo.

5. “Predictions are way bigger than anything I observed”

Symptom. A masked log-transformed continuous trait (body mass, seed mass, fish weight) predicts a value 50–100× larger than anything observed. On AVONET, the canonical case is the cassowary: truth ≈ 35 kg, predicted up to ~540 kg.

Why this happens. For log-transformed traits, the GNN’s MC-dropout draws are on the log scale. A latent ~+3-4 σ above the training distribution survives as a ~50-100× value error after expm1() back-transformation. With n_imputations = 1, a single unlucky dropout pattern can produce this; with pool_method = "median" (default) the median of M draws is robust to one bad draw, but a small M (≤ 5) on the long tail of the latent distribution can still mis-pool.

Diagnose.

# After running impute(), check whether any imputation exceeds the
# observed maximum by an unrealistic factor:
predicted_mass <- result$completed$Mass[result$imputed_mask[, "Mass"]]
obs_max <- max(df$Mass, na.rm = TRUE)
sum(predicted_mass > 5 * obs_max)

A non-zero count is a signal of tail extrapolation.

Fix. Phase G clamp_outliers = TRUE caps post-back-transform predictions for log-transformed continuous, count, and zi_count magnitude traits at obs_max * clamp_factor (default 5). This is opt-in because for legitimate growth-curve datasets where 5× the observed maximum is plausible, you don’t want it.

result <- impute(df_obs, tree300,
                 clamp_outliers = TRUE,
                 clamp_factor   = 5,    # Tukey-style outlier cap
                 verbose = FALSE)

See also. ?impute (clamp_outliers, clamp_factor arguments), AVONET Mass diagnosis memo, Phase G results.