
Suggest which cell to observe next to maximise imputation precision
Source:R/active_impute.R
suggest_next_observation.RdFor a fitted pigauto_result (returned by impute),
compute the closed-form expected reduction in total predictive
variance across all currently-missing cells if each candidate cell
were observed next. Useful for sampling-design guidance: when you
have time/budget to measure k more species, this function
tells you which ones contribute most to imputation precision.
Arguments
- result
A
pigauto_resultobject returned byimpute. Must have been produced from single-obs data; multi-obs inputs error with a clear message.- top_n
integer, default
10L. Number of suggestions to return (descending bydelta).- by
character, one of
"cell"(default) or"species"."cell"returns individual(species, trait)pairs."species"aggregates by species (summing reductions across the species' currently-missing traits).- types
character vector of pigauto trait types to include. Default includes all eight supported types:
continuous,count,ordinal,proportion,binary,categorical,zi_count(added v2, 2026-05-01),multi_proportion(added v2).
Value
A data.frame of class "pigauto_active". Columns
when by = "cell": species, trait,
type, metric ("variance" or "entropy"),
delta, delta_var_total (NA for discrete rows), and
delta_entropy_total (NA for continuous rows), sorted by
delta descending. When by = "species":
species, delta_var_total, delta_entropy_total,
n_traits_missing, sorted by the SUM of available metrics.
Details
Two metrics are supported, dispatched by trait type:
Continuous-family traits (continuous, count, ordinal, proportion) use the BM variance-reduction formula. The variance- reduction formula is derived from a Sherman-Morrison rank-1 inverse update on the BM conditional MVN: adding species \(s\) to the observed set updates the inverse correlation matrix by a known closed form. For each candidate cell \((s, t)\), $$ \Delta V(s, t) = \sigma_t^2 \sum_{i \in \mathrm{miss}_t} \frac{D_{ik}^2}{\alpha_k} $$ where \(D = R_{mm} - R_{mo} R_{oo}^{-1} R_{om}\) is the residual matrix at currently-missing cells, \(\alpha_k = D_{kk}\) is the current relative leverage of cell \(k\), and \(\sigma_t^2\) is the REML BM variance for trait \(t\).
Discrete traits (binary, categorical) use a label- propagation expected-entropy-reduction formula. The current LP probability at species \(i\) is \(p_i = \mathrm{sim}[i, \mathrm{obs}] y_{\mathrm{obs}} / \sum \mathrm{sim}[i, \mathrm{obs}]\), with entropy \(H(p_i) = -\sum_k p_{i,k} \log p_{i,k}\). After observing \(s_{\mathrm{new}}\) with unknown class \(y_{\mathrm{new}}\), the new LP probability has a closed form, and the expected entropy is averaged over \(P(y_{\mathrm{new}})\) = current LP estimate at \(s_{\mathrm{new}}\). Total expected entropy reduction sums across all currently-missing cells (the entropy at \(s_{\mathrm{new}}\) itself drops to 0).
Variance and entropy are NOT directly comparable. The
output sorts within each metric and the cross-metric ordering by
delta is approximate. When you want a strict ranking,
filter by metric first.
Reductions are summed across the included traits for each species
when by = "species", supporting the typical use case where
measuring a species observes all of its currently-missing traits
at once. At by = "species", the per-trait variance and
entropy reductions are summed separately into
delta_var_total and delta_entropy_total columns; the
delta column is whichever is non-NA (or
delta_var_total when both are populated). Cross-type
species-level ranking is approximate – see the variance-vs-
entropy caveat above.
zi_count (v2): observing a missing zi_count cell reveals
the gate value (entropy reduction at the gate column, computed via
the LP binary formula) AND, with probability \(p_{\mathrm{gate}}\)
(current LP estimate at \(s_{\mathrm{new}}\)), reveals a
magnitude (variance reduction at the magnitude column, computed
via the BM Sherman-Morrison formula on the gate=1 subset). Output
rows for zi_count populate BOTH delta_var_total (= expected
magnitude variance reduction = \(p_{\mathrm{gate}} \times \Delta
V_{\mathrm{mag}}\)) AND delta_entropy_total (= gate entropy
reduction). metric is set to "variance" so the row
sorts on the magnitude scale; delta_entropy_total is
available for users who care about gate-uncertainty separately.
multi_proportion (v2): observing a row reveals all K
simplex components simultaneously. Per-component variance
reductions are computed via BM Sherman-Morrison on each CLR-z
latent column, summed across components. metric is
"variance"; delta_var_total is the K-component sum.
Examples
if (FALSE) { # \dontrun{
data(avonet300, tree300, package = "pigauto")
res <- impute(avonet300, tree300)
suggest_next_observation(res, top_n = 5) # top-5 cells
suggest_next_observation(res, top_n = 10, by = "species") # top-10 species
# Continuous only:
suggest_next_observation(res, top_n = 10,
types = c("continuous", "count", "ordinal", "proportion"))
} # }