Fit a stacked-trait, multivariate GLLVM

Top-level entry point for the multivariate site–trait model of Nakagawa et al. (in prep) functional biogeography (and the related behavioural-syndromes / phylogenetic / systematic-mapping reduced-rank models). Long-format input with one row per (unit, trait) observation; glmmTMB-style formula syntax for fixed effects, plus a 3 x 5 grid of covariance-structure keywords organised by correlation (none / phylo / spatial) and mode (scalar / unique / indep / dep / latent):

Usage

gllvmTMB(
  formula,
  data,
  trait = "trait",
  unit = "site",
  unit_obs = "site_species",
  cluster = "species",
  family = gaussian(),
  weights = NULL,
  mesh = NULL,
  phylo_vcv = NULL,
  phylo_tree = NULL,
  known_V = NULL,
  lambda_constraint = NULL,
  control = gllvmTMBcontrol(),
  silent = TRUE,
  site = NULL,
  species = NULL
)

Arguments

formula

A glmmTMB-style formula, e.g. value ~ 0 + trait + (0 + trait):env_temp + (0 + trait):env_precip. Fixed effects and any of the 3 x 5 keyword-grid covstructs above are supported (plus phylo_slope() and meta_known_V()).

data

A long-format data frame with one row per (unit, trait) observation. Must include the response (LHS of formula), the columns named in unit, unit_obs, cluster, trait, and any predictors referenced in the formula.

trait

Name of the column holding the trait factor (the "trait" dimension of the unit × trait response matrix). Default "trait".

unit

Name of the column holding the between-unit grouping factor (the "unit" dimension of the unit × trait response matrix). Examples: "site" for site × species data, "individual" for behavioural-syndrome data, "species" for PGLLVM, "paper" for systematic mapping. Default "site".

unit_obs

Name of the column holding the within-unit grouping factor — one level per (unit, replicate) cell — used by latent(0 + trait | unit_obs, ...) and unique(0 + trait | unit_obs) for the W-tier covariance. Default "site_species" (the conventional name in joint species distribution modelling; safe for site × species data). For other domains pass e.g. unit_obs = "obs" for behavioural syndromes.

cluster

Name of the column holding the third grouping factor (the "cluster" slot). When the column name matches the row/column names of phylo_vcv (or the tip labels of phylo_tree), this slot also drives the phylogenetic random effects (phylo_latent, phylo_scalar, phylo_unique, phylo_slope). When the column does not match a phylogenetic correlation, the slot still functions as a regular crossed/nested third grouping (e.g. cluster = "population" for 3-level personality data, cluster = "study" for multi-study meta-analysis). Default "species". For datasets with no third grouping the column may be absent from data; the engine will synthesise a placeholder.

The engine does not enforce nesting between unit_obs, unit, and cluster — crossed and nested designs both fit. Two canonical patterns:

Functional biogeography (crossed): unit = "site", unit_obs = "site_species", cluster = "species". Sites and species are crossed; site_species indexes the intersection cell.
3-level personality (strictly nested): unit = "individual", unit_obs = "session_id", cluster = "population".

family

A family object. The multivariate engine (formulas with latent(), unique(), etc.) supports gaussian(), binomial() (logit / probit / cloglog), poisson() (log link), ordinal_probit() (the gllvmTMB-native ordinal threshold family with sigma_d = 1 fixed exactly; no delta-method approximation), lognormal() (log link), Gamma(link = "log"), nbinom2() (NB2 negative binomial; log link), tweedie() (compound Poisson-Gamma; log link), Beta() (logit link, mean-precision parameterisation; y in (0, 1)), betabinomial() (logit link, mean-precision parameterisation; k-of-n data), student() (heavy-tailed continuous; identity link), truncated_poisson() (zero-truncated count; log link), truncated_nbinom2() (zero-truncated NB2; log link), and the hurdle / two-part families delta_lognormal() / delta_gamma() (Bernoulli for presence + Lognormal/Gamma for the positive component; share one linear predictor under the current implementation – see Details). For multi-trial binomial or beta-binomial (k-of-n) data, write the LHS of formula as cbind(successes, failures) (the canonical R / glm() / glmmTMB convention); the engine then uses the per-row trial count successes + failures as the binomial size. A weights = numeric vector of trial counts is also accepted as an alternative API (see Details). The nbinom2() family fits one log-dispersion parameter per trait (Var = mu + mu^2 / phi). The tweedie() family fits one log-phi and one logit-(p-1) per trait, with p in (1, 2). The Beta() and betabinomial() families each fit one log-precision per trait (a = mu*phi, b = (1-mu)*phi; Smithson & Verkuilen 2006). The student() family fits one log-sigma and one log(df-1) per trait (df > 1; pass student(df = 3) to fix df at a known value). The truncated_*() families require strictly positive integer responses (y >= 1). The delta families fit one log-dispersion of the positive component per trait (sigma_lognormal or phi_gamma).

weights

Optional numeric vector of length nrow(data). Family- conditional semantics (lme4 / glmmTMB convention):

For non-binomial families, weights is interpreted as per- observation likelihood multipliers — each row's log-likelihood contribution is multiplied by weights[i]. This matches the weights argument of lme4::lmer(), glmmTMB::glmmTMB() and stats::glm(). Use this for relative-abundance weighting (sqrt(p_si)), inverse-variance weighting (1 / var_i for heteroscedastic Gaussian fits), or to down-weight outliers. Must be non-negative and finite; weights[i] = 0 zeroes that row's contribution to the joint NLL.
For binomial families fit without a cbind(succ, fail) LHS, weights continues to be interpreted as the per-row trial count (binomial size; alternative API to cbind(succ, fail)). This pre-existing semantics is preserved on a per-row basis: in mixed- family fits, binomial rows use weights[i] as their trial count and non-binomial rows use it as the likelihood multiplier. Default NULL is equivalent to a length-nrow(data) vector of ones (unweighted Bernoulli on binomial rows). Wide matrix calls through gllvmTMB_wide() and wide data-frame calls through traits() normalise their accepted weight shapes to this same long-format vector before fitting.

mesh

Optional mesh object from make_mesh(). Required when the formula includes a spatial() term; ignored otherwise.

phylo_vcv

(legacy global) Optional tip-only n_species × n_species phylogenetic correlation matrix. The canonical syntax is vcv = inside each phylo_*() keyword. — prefer tree =. The dense path inverts via Matrix::solve(), giving the same MLE as the sparse path but at O(n^2) memory and O(n^3) Cholesky cost. Use only when you have a Cphy in hand and no Newick tree (e.g. comparing against nlme::corPagel). See dev/design/01-phylo-api-canonical-tree.md for the full technical justification.

phylo_tree

(legacy global) Optional ape::phylo tree. The canonical Phase L (May 2026) syntax is to pass tree = inside each phylo_*() keyword (e.g. phylo_latent(species, d = K, tree = tree)); this argument is the older outer-level fallback and is slated for soft deprecation. When supplied (in either form) the phylo_*() terms build a sparse \(\mathbf{A}^{-1}\) over tips + internal nodes via MCMCglmm::inverseA(tree) (Hadfield & Nakagawa 2010, appendix eqs. 26-29). The result is ~5n non-zeros for an n-tip tree, vs n^2 for the dense path. This is the recommended path at any n_species; the speedup grows to ~24× at n_species = 1000. Requires the MCMCglmm package.

known_V

Optional list of block-diagonal sampling-error matrices V_t. Used by the equalto() two-stage workflow when sampling-error matrices are available from a prior stage of estimation.

lambda_constraint

Optional list with elements B and / or W, each an n_traits × d matrix of confirmatory loading constraints (galamm-style). NA entries are estimated; numerical entries are pinned. Upper-triangle entries are silently ignored — the engine's lower-triangular parameterisation already fixes those at zero. Default NULL uses the engine's exploratory lower-triangular convention.

control

Output of gllvmTMBcontrol().

silent

Logical; suppress TMB and gllvmTMB chatter. Default TRUE.

site

(deprecated) alias for unit. Kept for backward compatibility. Use unit = ... in new code.

species

(deprecated) alias for cluster. Kept for backward compatibility. Use cluster = ... in new code.

Value

A gllvmTMB object. With no covariance-structure terms in the formula the result has class "gllvmTMB" (single-response engine); with latent() or other covstruct sugar it has class c("gllvmTMB_multi", "gllvmTMB") (multi-trait engine). Either way, S3 methods such as tidy(), predict(), vcov(), logLik() etc. dispatch on gllvmTMB; gllvmTMB_multi- specific methods (e.g. trait-level extract_ICC_site(), extract_communality()) are available for multi-trait fits.

Details

correlation \ mode	scalar	unique	indep	dep	latent
none	(omit)	`unique()`	`indep()`	`dep()`	`latent()`
phylo	`phylo_scalar()`	`phylo_unique()`	`phylo_indep()`	`phylo_dep()`	`phylo_latent()`
spatial	`spatial_scalar()`	`spatial_unique()`	`spatial_indep()`	`spatial_dep()`	`spatial_latent()`

The four "quartet" modes (unique / indep / dep / latent) encode covstruct intent under a strict always-paired-vs-always-alone convention:

latent + unique (paired) — the decomposition mode: \(\boldsymbol\Sigma = \boldsymbol\Lambda \boldsymbol\Lambda^\top + \mathbf S\). unique standalone is also legitimate (e.g. observation-level random effects), but the canonical pattern is paired with latent.
indep (alone) — the marginal-only mode: each trait gets its own variance, no cross-trait covariance.
dep (alone) — the full unstructured mode: \(\boldsymbol\Sigma\) is free with \(T(T+1)/2\) parameters via a Cholesky factor.

Plus the supporting phylo_slope() (random slope), meta_known_V() (known-V meta-analytic), and the engine-internal propto() / equalto() covstructs (used by the canonical keywords above; not typically called directly). See vignette("api-keyword-grid") for a one-paragraph tour of each cell.

gllvmTMB() parses the glmmTMB-style formula, checks the long-format data, and dispatches to the underlying TMB template (built on sdmTMB's SPDE machinery). Covariance-structure terms (latent(), unique(), propto(), equalto(), spatial()) are processed by extending the formula parser and the TMB template.

Per the manuscript, when stacking traits one should set dispformula = ~ 0 so that no implicit residual variance competes with structured unique(0 + trait | …) terms. gllvmTMB() enforces this internally.

Multi-trial binomial. The TMB engine evaluates dbinom(y, n_trials, p) so binomial fits are not restricted to Bernoulli (size = 1). To pass a per-row trial count, use either:

cbind(successes, failures) ~ ... on the LHS of the formula (canonical R / glmmTMB convention; recommended), or
a flat response y with weights = n_trials (the alternative glmmTMB API).

Both interfaces produce identical fits. For Bernoulli data, omit weights and use a 0/1 response; the engine sets n_trials = 1 and the likelihood is identical to the previous Bernoulli-only behaviour.

Delta (hurdle) families. delta_lognormal() and delta_gamma() use a single linear predictor for both components: presence is \(\Pr(y > 0) = \mathrm{invlogit}(\eta)\) and the positive-component mean is \(E[y \mid y > 0] = \exp(\eta)\). This matches sdmTMB's standard delta default (one fixed-effects matrix, one set of trait random effects). The shared-predictor scheme is the simplest hurdle formulation; a future release may decouple the two predictors so presence and abundance can have independent fixed and random effects. Only type = "standard" (the default) is currently wired in the multivariate engine; type = "poisson-link" is on the roadmap. Each delta family carries one per-trait dispersion of the positive component (no extra Bernoulli dispersion). The response must be non-negative.

Per-trait residual variance: when does it activate?

"Residual" in a mixed-effects model is scale-relative: what counts as residual variance shifts as you add levels to the model. In a no-unique Gaussian fit, the residual is row-level noise captured by a single shared sigma_eps. Once you add a per-row unique(0 + trait | obs) term, the row-level residual is now T per-trait random-effect variances and sigma_eps is auto-suppressed to avoid double-counting. Once you also add unique(0 + trait | site) on top, the row-level term remains the residual and the site-level term is now an additional, higher-level random effect — not a residual at all. The dispatch table below records this explicitly for the configurations the engine supports today.

For continuous-family responses (Gaussian, lognormal, Gamma) the engine has one residual scale parameter, sigma_eps. By default sigma_eps is estimated as a single shared scalar across all continuous-family rows — per-trait residual variances only appear if you explicitly add a per-row unique(...) (or, when Phase B lands, indep(...)) term. The dispatch is automatic:

No unique / indep, continuous families present: One shared sigma_eps across all rows. Not per-trait.
No unique / indep, no continuous families: sigma_eps is mapped off; the family's intrinsic dispersion handles the residual.
unique(0 + trait | g) where g has fewer levels than rows (e.g. g = "site"): sigma_eps is still estimated as the row-level residual; the unique term adds a per-trait random effect at level g on top.
unique(0 + trait | obs) at the per-row level (one level per row), continuous families fitted: sigma_eps is auto-suppressed (mapped off, fixed at a tiny stabiliser); the T per-trait unique random effects are the residual. A one-shot cli::cli_inform fires at fit time announcing the auto-suppression.
unique(0 + trait | obs) at the per-row level, non-Gaussian or mixed-family fit: Treated as observation-level random effects (OLRE). For Bernoulli traits the OLRE is statistically unidentifiable and is mapped off; for hurdle / delta families a warning is emitted (see "Per-family-aware OLRE selection" below).
indep(0 + trait | obs) at the per-row level (Phase B, not yet implemented): Mathematically equivalent to the per-row unique case; same auto-suppression dispatch.

Mnemonic: continuous-family sigma_eps is the default; per-row unique / indep replaces it with T per-trait residuals; non-per-row unique / indep adds a higher-level random effect on top of sigma_eps; non-continuous families never carry sigma_eps regardless.

Per-family-aware OLRE selection. When unique(0 + trait | <unit_obs>) is at per-row resolution (i.e. one row per (trait, unit_obs) cell), the resulting per-trait random effects on the linear predictor are an observation-level random effect (OLRE). The engine now decides per trait what to do with the OLRE variance based on the trait's response family:

single-trial Bernoulli (binomial(), all rows have n_trials == 1): theta_diag_W[t] and the corresponding s_W column are mapped off. The OLRE is statistically unidentifiable here (no within-cell replication) and the MLE is the trivial boundary \(\sigma_W \to 0\); pinning the parameter at \(\sigma_W \approx 10^{-6}\) removes the spurious free parameter and emits a one-shot informational message. Multi-trial binomial (via cbind(succ, fail) or weights = n_trials) is identifiable and fit normally.
delta / hurdle families (delta_lognormal(), delta_gamma()): OLRE is fit as before, but a one-shot warning is emitted because the OLRE acts on the shared linear predictor of the hurdle and mixes presence and positive-component noise.
all other families: OLRE is fit as before.

In a mixed-family fit (per-row family list) the per-trait decision applies only to the affected trait. References: Nakagawa & Schielzeth (2010) Biol. Rev. 85: 935-956; Nakagawa, Johnson & Schielzeth (2017) J. R. Soc. Interface 14: 20170213.

Examples

if (FALSE) { # \dontrun{
set.seed(1)
sim <- simulate_site_trait(n_sites = 50, n_species = 8, n_traits = 3,
                           mean_species_per_site = 5)
fit <- gllvmTMB(
  value ~ 0 + trait + (0 + trait):env_1 + (0 + trait):env_2,
  data   = sim$data,
  family = gaussian()
)
summary(fit)
} # }