Suggest which cell to observe next to maximise imputation precision

For a fitted pigauto_result (returned by impute), compute the closed-form expected reduction in total predictive variance across all currently-missing cells if each candidate cell were observed next. Useful for sampling-design guidance: when you have time/budget to measure k more species, this function tells you which ones contribute most to imputation precision.

Usage

suggest_next_observation(
  result,
  top_n = 10L,
  by = c("cell", "species"),
  types = c("continuous", "count", "ordinal", "proportion", "binary", "categorical",
    "zi_count", "multi_proportion")
)

Arguments

result: A pigauto_result object returned by impute. Must have been produced from single-obs data; multi-obs inputs error with a clear message.
top_n: integer, default 10L. Number of suggestions to return (descending by delta).
by: character, one of "cell" (default) or "species". "cell" returns individual (species, trait) pairs. "species" aggregates by species (summing reductions across the species' currently-missing traits).
types: character vector of pigauto trait types to include. Default includes all eight supported types: continuous, count, ordinal, proportion, binary, categorical, zi_count (added v2, 2026-05-01), multi_proportion (added v2).

Value

A data.frame of class "pigauto_active". Columns when by = "cell": species, trait, type, metric ("variance" or "entropy"), delta, delta_var_total (NA for discrete rows), and delta_entropy_total (NA for continuous rows), sorted by delta descending. When by = "species": species, delta_var_total, delta_entropy_total, n_traits_missing, sorted by the SUM of available metrics.

Details

Two metrics are supported, dispatched by trait type:

Continuous-family traits (continuous, count, ordinal, proportion) use the BM variance-reduction formula. The variance- reduction formula is derived from a Sherman-Morrison rank-1 inverse update on the BM conditional MVN: adding species $s$ to the observed set updates the inverse correlation matrix by a known closed form. For each candidate cell $(s, t)$, $$ \Delta V(s, t) = \sigma_t^2 \sum_{i \in \mathrm{miss}_t} \frac{D_{ik}^2}{\alpha_k} $$ where $D = R_{mm} - R_{mo} R_{oo}^{-1} R_{om}$ is the residual matrix at currently-missing cells, $\alpha_k = D_{kk}$ is the current relative leverage of cell $k$, and $\sigma_t^2$ is the REML BM variance for trait $t$.

Discrete traits (binary, categorical) use a label- propagation expected-entropy-reduction formula. The current LP probability at species $i$ is $p_i = \mathrm{sim}[i, \mathrm{obs}] y_{\mathrm{obs}} / \sum \mathrm{sim}[i, \mathrm{obs}]$, with entropy $H(p_i) = -\sum_k p_{i,k} \log p_{i,k}$. After observing $s_{\mathrm{new}}$ with unknown class $y_{\mathrm{new}}$, the new LP probability has a closed form, and the expected entropy is averaged over $P(y_{\mathrm{new}})$ = current LP estimate at $s_{\mathrm{new}}$. Total expected entropy reduction sums across all currently-missing cells (the entropy at $s_{\mathrm{new}}$ itself drops to 0).

Variance and entropy are NOT directly comparable. The output sorts within each metric and the cross-metric ordering by delta is approximate. When you want a strict ranking, filter by metric first.

Reductions are summed across the included traits for each species when by = "species", supporting the typical use case where measuring a species observes all of its currently-missing traits at once. At by = "species", the per-trait variance and entropy reductions are summed separately into delta_var_total and delta_entropy_total columns; the delta column is whichever is non-NA (or delta_var_total when both are populated). Cross-type species-level ranking is approximate – see the variance-vs- entropy caveat above.

zi_count (v2): observing a missing zi_count cell reveals the gate value (entropy reduction at the gate column, computed via the LP binary formula) AND, with probability $p_{\mathrm{gate}}$ (current LP estimate at $s_{\mathrm{new}}$), reveals a magnitude (variance reduction at the magnitude column, computed via the BM Sherman-Morrison formula on the gate=1 subset). Output rows for zi_count populate BOTH delta_var_total (= expected magnitude variance reduction = $p_{\mathrm{gate}} \times \Delta V_{\mathrm{mag}}$) AND delta_entropy_total (= gate entropy reduction). metric is set to "variance" so the row sorts on the magnitude scale; delta_entropy_total is available for users who care about gate-uncertainty separately.

multi_proportion (v2): observing a row reveals all K simplex components simultaneously. Per-component variance reductions are computed via BM Sherman-Morrison on each CLR-z latent column, summed across components. metric is "variance"; delta_var_total is the K-component sum.

Examples