Skip to contents

Randomly designates a fraction of cells as "missing" and splits them into validation and test sets. When a trait_map is supplied, masking operates at the original trait level – all latent columns belonging to one trait are held out together (important for categorical traits).

Usage

make_missing_splits(
  X,
  missing_frac = 0.25,
  val_frac = 0.25,
  seed = 555,
  trait_map = NULL,
  mechanism = c("MCAR", "MAR_trait", "MAR_phylo", "MNAR"),
  mechanism_args = list(),
  tree = NULL
)

Arguments

X

numeric matrix (species x latent columns from preprocess_traits). Used only for dimensions.

missing_frac

numeric. Fraction of all (species, trait) cells to designate as missing (default 0.25).

val_frac

numeric. Fraction of missing cells for validation (default 0.25); the rest become the test set.

seed

integer. Random seed for reproducibility (default 555).

trait_map

list of trait descriptors (from pigauto_data). If NULL, masking is applied per latent column (v0.1 behaviour).

mechanism

character. Missingness mechanism: "MCAR" (default, uniform random), "MAR_trait" (trait-dependent), "MAR_phylo" (clade-structured), or "MNAR" (value-dependent).

mechanism_args

named list of mechanism-specific parameters:

For "MAR_trait":

driver_col (integer, column index in X that drives missingness; default 1), beta (numeric, severity; default 2.0).

For "MAR_phylo":

n_clades (integer, number of high-missingness clades; default 2), p_clade (numeric, within-clade missingness probability; default 0.7), p_base (numeric, background missingness probability; default 0.1).

For "MNAR":

beta (numeric, severity; default 2.0).

tree

object of class "phylo". Required for mechanism = "MAR_phylo", ignored otherwise.

Value

A list with:

val_idx

Integer vector of linear indices (latent space).

test_idx

Integer vector of linear indices (latent space).

val_idx_trait

Integer vector in original-trait space (if trait_map supplied).

test_idx_trait

Integer vector in original-trait space (if trait_map supplied).

n

Number of species (rows).

p

Number of latent columns.

n_traits

Number of original traits.

mask

Logical matrix (n x p_latent). TRUE = observed.

mechanism

Character string of the mechanism used.

Details

The returned index vectors use linear (column-major) indexing. Both original-trait-space and latent-space indices are returned when a trait_map is present.

Examples

X <- matrix(rnorm(100), nrow = 20)
splits <- make_missing_splits(X, missing_frac = 0.25, seed = 1)
length(splits$val_idx)
#> [1] 6

# MAR: missingness depends on another trait
splits_mar <- make_missing_splits(X, mechanism = "MAR_trait",
  mechanism_args = list(driver_col = 1, beta = 2))