Skip to contents

Aligns species in the trait data frame to the tree, detects or accepts trait types (continuous, binary, categorical, ordinal, count, proportion, zi_count), and encodes each trait into a continuous latent matrix.

Usage

preprocess_traits(
  traits,
  tree,
  species_col = NULL,
  trait_types = NULL,
  multi_proportion_groups = NULL,
  log_cols = NULL,
  log_transform = TRUE,
  center = TRUE,
  scale = TRUE,
  covariates = NULL
)

Arguments

traits

data.frame with species as row names (one row per species), or with a species column identified by species_col (potentially multiple rows per species). Columns may be numeric, integer, factor, ordered, character, or logical.

tree

object of class "phylo".

species_col

character. Name of the column in traits that identifies species. When supplied, traits may have multiple rows per species. The column is removed from trait columns before encoding. Default NULL uses row names (one row per species).

trait_types

named character vector overriding auto-detection, e.g. c(Mass = "continuous", Diet = "categorical"). Valid types: "continuous", "binary", "categorical", "ordinal", "count", "proportion", "zi_count". Proportion and zi_count are override-only (not auto-detected). Unspecified traits are auto-detected. Note that "multi_proportion" is NOT set here — use multi_proportion_groups instead.

multi_proportion_groups

named list declaring compositional (multi-proportion) trait groups. Each element is a character vector of column names whose row-wise values sum to 1 (e.g. list(diet = c("plants", "insects", "fish"))). The group name becomes a single trait in the output, encoded via centred log-ratio (CLR) + per-component z-score. Group names must NOT match any column in traits. Default NULL.

log_cols

character vector of continuous trait names to log-transform. Default NULL means auto-detect (log if all observed values are positive). Set to character(0) to disable.

log_transform

logical. Legacy parameter: if TRUE and log_cols is NULL, log-transform all continuous traits with all-positive values. Overridden by log_cols when both are supplied.

center

logical. Subtract column means for continuous/count/ordinal? Default TRUE.

scale

logical. Divide by column SDs for continuous/count/ordinal? Default TRUE.

covariates

data.frame or numeric matrix of environmental covariates. Covariates are conditioners: they inform imputation but are not themselves imputed, so they must be fully observed (no NAs — if a variable has missing values, put it in traits instead). Must have the same number of rows as traits after alignment to the tree.

Numeric / integer columns

z-scored automatically.

Factor / ordered columns

one-hot encoded (K binary columns per factor with K levels). Column names become "var.level".

Character / logical columns

coerced to factor, then one-hot.

Default NULL (no covariates).

Value

A list of class "pigauto_data" with components:

X_scaled

Numeric matrix (n_obs x p_latent), latent encoding. When species_col is NULL, n_obs = n_species.

X_raw

Numeric matrix of continuous traits after optional log but before z-scoring (for backward compatibility).

X_original

Original data.frame (aligned to tree, before encoding).

means

Named numeric vector of column means used for z-scoring (continuous/count/ordinal traits only).

sds

Named numeric vector of column SDs.

species_names

Character vector of unique species matching tree$tip.label order (length = n_species).

obs_species

Character vector of species labels per observation (length = n_obs). When multi-obs, can have duplicates. NULL when species_col is NULL.

obs_to_species

Integer vector (length = n_obs) mapping each observation to its species index in species_names. NULL when species_col is NULL.

n_species

Integer, number of unique species.

n_obs

Integer, number of observations (= n_species when single-obs).

multi_obs

Logical, TRUE when multiple observations per species are present.

trait_names

Character vector of original trait names.

latent_names

Character vector of latent column names.

trait_map

List of trait descriptors (see Details).

p_latent

Integer, total number of latent columns.

log_transform

Logical, legacy field (TRUE if any continuous trait was log-transformed).

Details

When each species has one observation (the default), output rows match tree$tip.label order. When species_col is supplied, multiple observations per species are supported: the output matrix has one row per observation, plus an obs_to_species mapping for the GNN (which operates at species level).

Automatic type detection (when trait_types = NULL) follows the R class of each column — no user input is required for most data:

R classpigauto type
numeric (non-integer)continuous
integercount
factor with 2 levelsbinary
factor (unordered) with >2 levelscategorical
ordered / factor(..., ordered = TRUE)ordinal
characterconverted to factor, then binary or categorical
logicalbinary (FALSE = 0, TRUE = 1)

Two types require an explicit override because they cannot be distinguished from their R class alone:

"proportion"

A numeric column bounded 0–1 looks identical to continuous. Declare it explicitly: trait_types = c(SurvivalRate = "proportion"). Encoded via qlogis(clamp(x, 0.001, 0.999)).

"zi_count"

An integer column with excess zeros looks identical to count. Declare it explicitly: trait_types = c(Parasites = "zi_count"). Encoded as a binary zero/non-zero gate plus log1p-z magnitude.

Practical examples of type assignment:

  • Body mass (numeric, all positive) → continuous, auto-log-transformed.

  • Clutch size (integer) → count.

  • Migratory (factor with levels "Yes"/"No") → binary.

  • Diet (factor with >2 levels) → categorical.

  • IUCN status (ordered factor, LC < NT < VU < EN < CR) → ordinal. If left as an unordered factor, it becomes categorical — both are valid depending on the question.

  • Parasite load (integer with many zeros) → needs trait_types = c(Parasites = "zi_count").

  • Survival rate (numeric, values in 0 to 1) → needs trait_types = c(Survival = "proportion").

Latent encoding per type:

continuous

optional log(), then z-score (1 latent column)

binary

0/1 encoding (1 latent column)

count

log1p(), then z-score (1 latent column)

ordinal

integer coding (0 to K-1), then z-score (1 latent column)

categorical

one-hot encoding (K latent columns)

proportion

qlogis(clamp(x, 0.001, 0.999)), then z-score (1 latent column)

zi_count

gate (0/1) + log1p-z of non-zeros (2 latent columns)

multi_proportion

centred log-ratio (CLR) + per-component z-score (K latent columns per group). Rows must sum to 1. Declared via the multi_proportion_groups argument, not trait_types.

Examples

# Single-obs per species (backward compatible)
data(avonet300, tree300, package = "pigauto")
traits <- avonet300
rownames(traits) <- traits$Species_Key
traits$Species_Key <- NULL
pd <- preprocess_traits(traits, tree300)
dim(pd$X_scaled)   # 300 x p_latent
#> [1] 300  14

# Multi-obs per species (via species_col)
pd2 <- preprocess_traits(avonet300, tree300, species_col = "Species_Key")
pd2$n_obs      # number of observations
#> [1] 300
pd2$n_species  # number of unique species
#> [1] 300