Adding distribution families • drmTMB

This article describes the workflow for adding a distribution family to drmTMB. It is written for contributors, not end users. A new family is not a single constructor in R/family.R; it is a contract across symbolic equations, formula grammar, R builders, TMB likelihood code, simulation, tests, documentation, pkgdown, and after-task notes.

Read Formula grammar first for the current parser vocabulary; this article is about adding a family the grammar does not yet support.

The planning references are docs/design/02-family-registry.md, docs/design/03-likelihoods.md, docs/design/05-testing-strategy.md, and docs/design/19-family-link-contract.md, plus .agents/skills/add-family/SKILL.md.

Scope first

Every new family must stay inside the package boundary:

one response or two responses only;
one formula per distributional parameter;
explicit parameter names such as mu, sigma, nu, rho12, and future second-shape tau;
no higher-dimensional multivariate response models;
no family is complete without simulation tests and documentation.

In this article, location means a mean-like parameter such as mu; scale means a residual spread or dispersion-like parameter such as sigma; shape means a family-specific parameter such as Student-t nu; and coscale means residual coupling, represented in the bivariate Gaussian model by rho12.

Do not add a family only because another package has it. Add one when it serves a distributional-regression question that drmTMB can answer clearly.

Start with the mathematical contract

Before touching code, write the model as equations. The implemented Student-t family is the simplest current example of a location-scale-shape family:

y_i | mu_i, sigma_i, nu_i ~ Student-t(mu_i, sigma_i, nu_i)
mu_i = X_mu[i, ] beta_mu
log(sigma_i) = X_sigma[i, ] beta_sigma
nu_i = 2 + exp(X_nu[i, ] beta_nu)

The matching R syntax is:

fit <- drmTMB(
  drm_formula(
    y ~ x1,
    sigma ~ x2,
    nu ~ x3
  ),
  family = student(),
  data = dat
)

This equation-syntax pair decides almost everything that follows. It tells the parser which distributional parameters are legal, the builder which model matrices to create, the TMB template which transformed parameters to use, and the tests what data-generating process to simulate.

For a two-response family, also define what any residual coupling parameter means. The implemented all-Gaussian composed family uses:

[y1_i, y2_i]' ~ MVN([mu1_i, mu2_i]', Omega_i)
Omega_i =
  [ sigma1_i^2                         rho12_i sigma1_i sigma2_i ]
  [ rho12_i sigma1_i sigma2_i          sigma2_i^2                ]
rho12_i = tanh(eta_rho12_i)

The TMB likelihood can still add a tiny numerical guard to this transform when needed for positive-definite covariance matrices. Keep the family-facing model equation readable, then document numerical guards in the likelihood implementation notes and tests.

with:

fit <- drmTMB(
  drm_formula(
    mu1 = y1 ~ x1 + x2,
    mu2 = y2 ~ x1,
    sigma1 = ~ x1 + x2,
    sigma2 = ~ x1,
    rho12 = ~ x1 + x2
  ),
  family = c(gaussian(), gaussian()),
  data = dat
)

For later mixed composed families, such as a Gaussian response plus a count response, do not assume rho12 is automatically meaningful. The family design must say whether rho12 is an observed residual correlation, a latent residual correlation, a copula parameter, or unsupported.

Define the registry entry

Each family needs a small structured contract. The design document lists these fields:

name
n_response
dpars
links
inverse_links
bounds
density_id
simulate
starting_values
check_data

The current lightweight R family constructors expose the pieces that are needed by the implemented builders:

student <- function() {
  structure(
    list(
      name = "student",
      family = "student",
      n_response = 1L,
      dpars = c("mu", "sigma", "nu"),
      links = c(mu = "identity", sigma = "log", nu = "logm2")
    ),
    class = "drm_family"
  )
}

Use the canonical GAMLSS-style names unless there is a strong reason not to:

mu     location or mean-like parameter
sigma  scale, residual SD, or dispersion-like parameter
nu     first shape parameter
tau    second shape parameter

Aliases such as skew, df, or shape should wait until the canonical grammar is stable. If an alias is ever added, tests must show that it resolves to the same internal parameter before model matrices are built.

Do not assume identity links

The name mu does not automatically mean “identity-linked arithmetic mean”. That is true for Gaussian models, but it is not a general rule.

For the implemented lognormal family:

log(y_i) | mu_i, sigma_i ~ Normal(mu_i, sigma_i^2)
E[y_i] = exp(mu_i + sigma_i^2 / 2)

predict(fit, dpar = "mu") returns the mean of log(y), while fitted(fit) returns the arithmetic mean of y. The implemented Gamma mean-CV route uses log(mu) so mu is positive and equal to E[y]. The implemented beta mean-scale route uses logit(mu) because mu is a probability.

Before adding another family with non-identity mu, update the family link contract and add tests for all three quantities:

predict(type = "link")
predict(type = "response")
fitted()

Those tests protect the API distinction between a distributional parameter and the expected response.

Connect the family to the builder

A family constructor is not enough. The high-level fitting function has to route the family to a model builder. In broad terms, that builder must:

validate the formula only uses legal distributional parameters;
reject unsupported neighbouring syntax before optimization;
build one model matrix for each distributional parameter;
create starting values and parameter maps;
assemble the data list passed to TMB;
record enough metadata for prediction, simulation, summaries, and checks.

For the Student-t family, ordinary sigma random effects, meta_V(V = V), random-effect scale formulae, and bivariate terms remain unavailable. The formula grammar documents the narrow exceptions separately: selected mu structured paths are recovery-grade and an intercept-only phylogenetic nu path is diagnostic-grade. That distinction matters: a fitted exception is not a blanket random-effect capability. A clear unsupported-feature error is better than a plausible fit whose likelihood has not been implemented.

When adding a family, update formula grammar docs only after the fitting path and tests exist. If syntax is planned but not fitted, label it as planned.

Implement the likelihood

The likelihood should be written on stable transformed scales. For the implemented Student-t family, TMB receives unconstrained linear predictors and uses:

sigma_i = exp(eta_sigma_i)
nu_i = 2 + exp(eta_nu_i)
z_i = (y_i - mu_i) / sigma_i

The log density is:

log f(y_i) =
  lgamma((nu_i + 1) / 2) -
  lgamma(nu_i / 2) -
  0.5 log(nu_i pi) -
  log(sigma_i) -
  0.5 (nu_i + 1) log(1 + z_i^2 / nu_i)

The -log(sigma_i) term is part of the scaled Student-t density. Likelihood constants matter because logLik(), AIC, BIC, comparator tests, and profile likelihoods depend on them.

For a new family, document:

the response-scale parameter definitions;
the link functions and inverse links;
the parameter bounds;
any density constants;
the starting-value strategy;
what is reported and predicted on the response scale.

Add simulation and method support

Every family should be able to simulate from a fitted model. Simulation is not just a convenience for users; it is how we test whether the likelihood and extractors agree with the symbolic model.

For a new univariate family, check these methods:

coef(fit, "mu")
coef(fit, "sigma")
predict(fit, dpar = "mu")
predict(fit, dpar = "sigma")
simulate(fit, nsim = 2, seed = 1)
residuals(fit)
check_drm(fit)
summary(fit)

Add family-specific diagnostics when a parameter has a boundary or biological interpretation risk. The implemented Student-t path adds check_drm() guidance for nu, because very small nu is boundary-prone and very large nu means the likelihood is close to Gaussian.

Write tests before calling the family implemented

A minimal family test set should include four layers.

First, test parameter recovery from a known data-generating process:

set.seed(1)
dat <- simulate_from_known_parameters()

fit <- drmTMB(
  drm_formula(y ~ x1, sigma ~ x2, nu ~ x3),
  family = student(),
  data = dat
)

expect_equal(fit$opt$convergence, 0)
expect_lt(max(abs(coef(fit, "mu") - beta_mu)), tolerance_mu)
expect_lt(max(abs(coef(fit, "sigma") - beta_sigma)), tolerance_sigma)
expect_lt(max(abs(coef(fit, "nu") - beta_nu)), tolerance_nu)

Second, add an independent likelihood check when the density is simple enough:

loglik <- sum(
  stats::dt((y - mu) / sigma, df = nu, log = TRUE) - log(sigma)
)

expect_equal(
  as.numeric(logLik(fit)),
  loglik,
  tolerance = 1e-8
)

See Testing likelihoods for how to check a new density against a reference implementation before treating a family as shipped.

Third, test methods such as prediction, simulation, residuals, summaries, and diagnostics.

Fourth, test rejection paths. Unsupported combinations should fail before TMB optimization with useful messages:

expect_error(
  drmTMB(
    drm_formula(y ~ x + meta_V(V = V), sigma ~ 1),
    family = student(),
    data = dat
  ),
  "not implemented"
)

Use comparator packages only where the parameterization overlaps cleanly. For example, metafor is useful for Gaussian meta-analysis with known sampling covariance, and lme4 is useful for overlapping Gaussian mixed models. Comparator tests should use testthat::skip_if_not_installed() so package checks remain portable.

Update user and developer documentation

A family task is incomplete until the documentation tells the same story as the code and tests.

Update these files when relevant:

R/family.R
R/drmTMB.R
R/methods.R
R/check.R
src/drmTMB.cpp
tests/testthat/test-<family>.R
docs/design/02-family-registry.md
docs/design/03-likelihoods.md
docs/design/05-testing-strategy.md
docs/design/06-distribution-roadmap.md
vignettes/distribution-families.Rmd
vignettes/testing-likelihoods.Rmd
README.md
NEWS.md
_pkgdown.yml

This list is not automatic ownership. Touch only the files affected by the feature. The point is to check whether each public claim remains true.

For exported family constructors, add roxygen documentation and examples. If the family changes the user-facing learning path, add or update a vignette and make sure _pkgdown.yml exposes it in the right section.

Close with an after-task audit

Before merging or pushing a family change, use the after-task audit. The minimum closure record should answer:

What model is now implemented?
Which equations define the likelihood?
Which R syntax is supported?
Which unsupported neighbouring syntax is rejected?
Which simulation, comparator, or independent likelihood checks passed?
Did devtools::document(), devtools::test(), pkgdown::check_pkgdown(), pkgdown::build_site(), and devtools::check() pass?
Which docs, roadmap entries, NEWS bullets, and known-limitations notes were synchronized?
What did not go smoothly?

Write the answer in docs/dev-log/check-log.md and in a compact after-task file under docs/dev-log/after-task/.

The goal is not bureaucracy. It is how drmTMB avoids the usual modelling package failure mode: code, equations, examples, and tests drifting apart while the public API still looks confident.