Skip to contents

For a fitted pigauto_result (returned by impute), compute the closed-form expected reduction in total predictive variance across all currently-missing cells if each candidate cell were observed next. Useful for sampling-design guidance: when you have time/budget to measure k more species, this function tells you which ones contribute most to imputation precision.

Usage

suggest_next_observation(
  result,
  top_n = 10L,
  by = c("cell", "species"),
  types = c("continuous", "count", "ordinal", "proportion", "binary", "categorical",
    "zi_count", "multi_proportion")
)

Arguments

result

A pigauto_result object returned by impute. Must have been produced from single-obs data; multi-obs inputs error with a clear message.

top_n

integer, default 10L. Number of suggestions to return (descending by delta).

by

character, one of "cell" (default) or "species". "cell" returns individual (species, trait) pairs. "species" aggregates by species (summing reductions across the species' currently-missing traits).

types

character vector of pigauto trait types to include. Default includes all eight supported types: continuous, count, ordinal, proportion, binary, categorical, zi_count (added v2, 2026-05-01), multi_proportion (added v2).

Value

A data.frame of class "pigauto_active". Columns when by = "cell": species, trait, type, metric ("variance" or "entropy"), delta, delta_var_total (NA for discrete rows), and delta_entropy_total (NA for continuous rows), sorted by delta descending. When by = "species": species, delta_var_total, delta_entropy_total, n_traits_missing, sorted by the SUM of available metrics.

Details

Two metrics are supported, dispatched by trait type:

Continuous-family traits (continuous, count, ordinal, proportion) use the BM variance-reduction formula. The variance- reduction formula is derived from a Sherman-Morrison rank-1 inverse update on the BM conditional MVN: adding species \(s\) to the observed set updates the inverse correlation matrix by a known closed form. For each candidate cell \((s, t)\), $$ \Delta V(s, t) = \sigma_t^2 \sum_{i \in \mathrm{miss}_t} \frac{D_{ik}^2}{\alpha_k} $$ where \(D = R_{mm} - R_{mo} R_{oo}^{-1} R_{om}\) is the residual matrix at currently-missing cells, \(\alpha_k = D_{kk}\) is the current relative leverage of cell \(k\), and \(\sigma_t^2\) is the REML BM variance for trait \(t\).

Discrete traits (binary, categorical) use a label- propagation expected-entropy-reduction formula. The current LP probability at species \(i\) is \(p_i = \mathrm{sim}[i, \mathrm{obs}] y_{\mathrm{obs}} / \sum \mathrm{sim}[i, \mathrm{obs}]\), with entropy \(H(p_i) = -\sum_k p_{i,k} \log p_{i,k}\). After observing \(s_{\mathrm{new}}\) with unknown class \(y_{\mathrm{new}}\), the new LP probability has a closed form, and the expected entropy is averaged over \(P(y_{\mathrm{new}})\) = current LP estimate at \(s_{\mathrm{new}}\). Total expected entropy reduction sums across all currently-missing cells (the entropy at \(s_{\mathrm{new}}\) itself drops to 0).

Variance and entropy are NOT directly comparable. The output sorts within each metric and the cross-metric ordering by delta is approximate. When you want a strict ranking, filter by metric first.

Reductions are summed across the included traits for each species when by = "species", supporting the typical use case where measuring a species observes all of its currently-missing traits at once. At by = "species", the per-trait variance and entropy reductions are summed separately into delta_var_total and delta_entropy_total columns; the delta column is whichever is non-NA (or delta_var_total when both are populated). Cross-type species-level ranking is approximate – see the variance-vs- entropy caveat above.

zi_count (v2): observing a missing zi_count cell reveals the gate value (entropy reduction at the gate column, computed via the LP binary formula) AND, with probability \(p_{\mathrm{gate}}\) (current LP estimate at \(s_{\mathrm{new}}\)), reveals a magnitude (variance reduction at the magnitude column, computed via the BM Sherman-Morrison formula on the gate=1 subset). Output rows for zi_count populate BOTH delta_var_total (= expected magnitude variance reduction = \(p_{\mathrm{gate}} \times \Delta V_{\mathrm{mag}}\)) AND delta_entropy_total (= gate entropy reduction). metric is set to "variance" so the row sorts on the magnitude scale; delta_entropy_total is available for users who care about gate-uncertainty separately.

multi_proportion (v2): observing a row reveals all K simplex components simultaneously. Per-component variance reductions are computed via BM Sherman-Morrison on each CLR-z latent column, summed across components. metric is "variance"; delta_var_total is the K-component sum.

See also

Examples

if (FALSE) { # \dontrun{
data(avonet300, tree300, package = "pigauto")
res <- impute(avonet300, tree300)
suggest_next_observation(res, top_n = 5)              # top-5 cells
suggest_next_observation(res, top_n = 10, by = "species")  # top-10 species

# Continuous only:
suggest_next_observation(res, top_n = 10,
  types = c("continuous", "count", "ordinal", "proportion"))
} # }