
Preprocess trait data: align to tree, encode into latent space
Source:R/preprocess_traits.R
preprocess_traits.RdAligns species in the trait data frame to the tree, detects or accepts trait types (continuous, binary, categorical, ordinal, count, proportion, zi_count), and encodes each trait into a continuous latent matrix.
Usage
preprocess_traits(
traits,
tree,
species_col = NULL,
trait_types = NULL,
multi_proportion_groups = NULL,
log_cols = NULL,
log_transform = TRUE,
center = TRUE,
scale = TRUE,
covariates = NULL
)Arguments
- traits
data.framewith species as row names (one row per species), or with a species column identified byspecies_col(potentially multiple rows per species). Columns may be numeric, integer, factor, ordered, character, or logical.- tree
object of class
"phylo".- species_col
character. Name of the column in
traitsthat identifies species. When supplied,traitsmay have multiple rows per species. The column is removed from trait columns before encoding. DefaultNULLuses row names (one row per species).- trait_types
named character vector overriding auto-detection, e.g.
c(Mass = "continuous", Diet = "categorical"). Valid types:"continuous","binary","categorical","ordinal","count","proportion","zi_count". Proportion and zi_count are override-only (not auto-detected). Unspecified traits are auto-detected. Note that"multi_proportion"is NOT set here — usemulti_proportion_groupsinstead.- multi_proportion_groups
named list declaring compositional (multi-proportion) trait groups. Each element is a character vector of column names whose row-wise values sum to 1 (e.g.
list(diet = c("plants", "insects", "fish"))). The group name becomes a single trait in the output, encoded via centred log-ratio (CLR) + per-component z-score. Group names must NOT match any column intraits. DefaultNULL.- log_cols
character vector of continuous trait names to log-transform. Default
NULLmeans auto-detect (log if all observed values are positive). Set tocharacter(0)to disable.- log_transform
logical. Legacy parameter: if
TRUEandlog_colsisNULL, log-transform all continuous traits with all-positive values. Overridden bylog_colswhen both are supplied.- center
logical. Subtract column means for continuous/count/ordinal? Default
TRUE.- scale
logical. Divide by column SDs for continuous/count/ordinal? Default
TRUE.- covariates
data.frame or numeric matrix of environmental covariates. Covariates are conditioners: they inform imputation but are not themselves imputed, so they must be fully observed (no NAs — if a variable has missing values, put it in
traitsinstead). Must have the same number of rows astraitsafter alignment to the tree.- Numeric / integer columns
z-scored automatically.
- Factor / ordered columns
one-hot encoded (K binary columns per factor with K levels). Column names become
"var.level".- Character / logical columns
coerced to factor, then one-hot.
Default
NULL(no covariates).
Value
A list of class "pigauto_data" with components:
- X_scaled
Numeric matrix (n_obs x p_latent), latent encoding. When
species_colisNULL, n_obs = n_species.- X_raw
Numeric matrix of continuous traits after optional log but before z-scoring (for backward compatibility).
- X_original
Original data.frame (aligned to tree, before encoding).
- means
Named numeric vector of column means used for z-scoring (continuous/count/ordinal traits only).
- sds
Named numeric vector of column SDs.
- species_names
Character vector of unique species matching
tree$tip.labelorder (length = n_species).- obs_species
Character vector of species labels per observation (length = n_obs). When multi-obs, can have duplicates.
NULLwhenspecies_colisNULL.- obs_to_species
Integer vector (length = n_obs) mapping each observation to its species index in
species_names.NULLwhenspecies_colisNULL.- n_species
Integer, number of unique species.
- n_obs
Integer, number of observations (= n_species when single-obs).
- multi_obs
Logical,
TRUEwhen multiple observations per species are present.- trait_names
Character vector of original trait names.
- latent_names
Character vector of latent column names.
- trait_map
List of trait descriptors (see Details).
- p_latent
Integer, total number of latent columns.
- log_transform
Logical, legacy field (TRUE if any continuous trait was log-transformed).
Details
When each species has one observation (the default), output rows match
tree$tip.label order. When species_col is supplied,
multiple observations per species are supported: the output matrix has
one row per observation, plus an obs_to_species mapping for the
GNN (which operates at species level).
Automatic type detection (when trait_types = NULL) follows the
R class of each column — no user input is required for most data:
| R class | pigauto type |
numeric (non-integer) | continuous |
integer | count |
factor with 2 levels | binary |
factor (unordered) with >2 levels | categorical |
ordered / factor(..., ordered = TRUE) | ordinal |
character | converted to factor, then binary or categorical |
logical | binary (FALSE = 0, TRUE = 1) |
Two types require an explicit override because they cannot be distinguished from their R class alone:
"proportion"A
numericcolumn bounded 0–1 looks identical to continuous. Declare it explicitly:trait_types = c(SurvivalRate = "proportion"). Encoded viaqlogis(clamp(x, 0.001, 0.999))."zi_count"An
integercolumn with excess zeros looks identical to count. Declare it explicitly:trait_types = c(Parasites = "zi_count"). Encoded as a binary zero/non-zero gate pluslog1p-z magnitude.
Practical examples of type assignment:
Body mass (
numeric, all positive) →continuous, auto-log-transformed.Clutch size (
integer) →count.Migratory (
factorwith levels "Yes"/"No") →binary.Diet (
factorwith >2 levels) →categorical.IUCN status (
orderedfactor, LC < NT < VU < EN < CR) →ordinal. If left as an unordered factor, it becomescategorical— both are valid depending on the question.Parasite load (
integerwith many zeros) → needstrait_types = c(Parasites = "zi_count").Survival rate (
numeric, values in 0 to 1) → needstrait_types = c(Survival = "proportion").
Latent encoding per type:
- continuous
optional
log(), then z-score (1 latent column)- binary
0/1 encoding (1 latent column)
- count
log1p(), then z-score (1 latent column)- ordinal
integer coding (0 to K-1), then z-score (1 latent column)
- categorical
one-hot encoding (K latent columns)
- proportion
qlogis(clamp(x, 0.001, 0.999)), then z-score (1 latent column)- zi_count
gate (0/1) +
log1p-z of non-zeros (2 latent columns)- multi_proportion
centred log-ratio (CLR) + per-component z-score (K latent columns per group). Rows must sum to 1. Declared via the
multi_proportion_groupsargument, nottrait_types.
Examples
# Single-obs per species (backward compatible)
data(avonet300, tree300, package = "pigauto")
traits <- avonet300
rownames(traits) <- traits$Species_Key
traits$Species_Key <- NULL
pd <- preprocess_traits(traits, tree300)
dim(pd$X_scaled) # 300 x p_latent
#> [1] 300 14
# Multi-obs per species (via species_col)
pd2 <- preprocess_traits(avonet300, tree300, species_col = "Species_Key")
pd2$n_obs # number of observations
#> [1] 300
pd2$n_species # number of unique species
#> [1] 300