Supplementary Tutorial: A new effect size for meta-analysis of magnitude: lnM

Worked examples + extra results from the simulation (figures + tables)

Authors

Shinichi Nakagawa

Ayumi Mizuno

Coralie Williams

Malgorzata Lagisz

Rose E. O’Dea

Daniel W. A. Noble

Alistair M. Senior

Erick Lundgren

Santiago Ortega

Published

April 3, 2026

Code
# ------------------------------------------------------------------------------
# SETUP
# This chunk:
#   1) Sets global chunk options (figures, widths)
#   2) Loads required packages
#   3) Defines helper functions used throughout the supplement
# ------------------------------------------------------------------------------

knitr::opts_chunk$set(
  fig.align = "center",
  fig.width = 10,
  fig.height = 6,
  out.width = "100%"
)

suppressPackageStartupMessages({
  library(tidyverse)
  library(metafor)        # escalc(), rma.mv()
  library(here)           # robust file paths
  library(orchaRd)        # orchard_plot(), bubble_plot(), .safe_lnM_indep()
  library(kableExtra)     # HTML tables
  library(patchwork)      # combining ggplots   
  library(multcomp)        # multiple comparisons
})

# ---- Ensure compatible orchaRd version -------------------------------------
orchard_min_version <- "2.1.3"  # version that includes SAFE lnM + seed argument

if (packageVersion("orchaRd") < package_version(orchard_min_version)) {
  stop(
    "This tutorial requires orchaRd >= ", orchard_min_version, ".\n",
    "Your version is ", packageVersion("orchaRd"), ".\n\n",
    "Please update by running:\n",
    "  install.packages('orchaRd')\n",
    "or, if needed:\n",
    "  remotes::install_github('daniel1noble/orchaRd')"
  )
}

# ---- Check that the required SAFE function is available ---------------------
if (!exists(".safe_lnM_indep", where = asNamespace("orchaRd"), inherits = FALSE)) {
  stop(
    "This tutorial requires orchaRd:::.safe_lnM_indep(), but it was not found.\n",
    "Please update orchaRd to a version that includes SAFE lnM."
  )
}

# ---- SAFE lnM (orchaRd) -----------------------------------------------------
# Note on purrr::pmap_dfr():
# pmap_dfr() applies a function row-by-row to several input columns (m1i, m2i, sd1i, ...),
# returning a data frame with one output row per effect size.
#
# Settings below match the simulation defaults used in this project.
# Passing a fixed seed makes each SAFE estimate reproducible.

get_lnM_safe <- function(m1, m2, s1, s2, n1, n2, 
                         min_kept = 100000, 
                         chunk_init = 5000, 
                         seed=123) {  
  out <- orchaRd:::.safe_lnM_indep(x1bar = m1, x2bar = m2,
  sd1 = s1, sd2 = s2,
  n1 = n1, n2 = n2,
  min_kept = min_kept,
  chunk_init = chunk_init,seed=seed)

tibble::tibble(
  yi_lnM_safe = out$lnM_SAFE,
  vi_lnM_safe = out$var_lnM_SAFE,
  draws_kept  = out$kept,
  draws_total = out$total,
  attempts    = out$attempts,
  status      = out$status
)
}



# ---- Folded-normal helper for magnitude of SMD ------------------------------
# This converts a signed normal estimate (mu, var) into the expected magnitude
# under the implied folded-normal distribution.

folded_norm <- function(mu, var) {
  sigma  <- sqrt(var)
  if (sigma == 0) return(c(point = abs(mu), var = 0))

  z      <- mu / sigma
  meanFN <- sigma * sqrt(2 / pi) * exp(-0.5 * z^2) +
    mu    * (2 * pnorm(z) - 1)
  varFN  <- mu^2 + var - meanFN^2

  c(point = meanFN, var = varFN)
}

# ---- Delta plug-in lnM (analytic) -------------------------------------------
# IMPORTANT: This requires lnM_delta1_indep() to be available, source it here
source(here("R", "SAFE_fun.R")) 

get_lnM_raw <- function(m1, m2, s1, s2, n1, n2) {
  out <- lnM_delta1_indep(m1, m2, s1, s2, n1, n2)   # returns c(point, var)
  tibble(
    yi_lnM_raw = out["point"],
    vi_lnM_raw = out["var"]
  )
}

Overview

This supplement has two parts:

Part I — Worked examples (Examples 1–2).

We demonstrate a complete, end-to-end workflow for quantifying effect magnitude using lnM. For each dataset, we first compute a conventional standardized mean difference (SMD) and compare results obtained using absolute SMD–based measures with those obtained using lnM.

We note that we estimate lnM with two approaches:

  1. a Delta plug-in estimator and
  2. the SAFE bootstrap bias-corrected point estimate (SAFE-BC) which is designed to remain usable when the plug-in estimator is undefined (e.g., when \(MS_B \le MS_W\)).

Using these estimates, we then fit multilevel meta-analytic models and meta-regressions and visualize results using the same plot types reported in the main manuscript.

Part II — Simulation reproduction (Figures + Tables).

We reproduce all simulation-based figures and tables reported in the manuscript using pre-computed simulation summaries and, where required, outputs from individual simulation replicates (i.e. results from each Monte Carlo run within a design cell). These results characterise estimator performance in terms of bias, variance behaviour, empirical coverage, root mean square error (RMSE), and Monte Carlo standard error (MCSE).

We additionally provide compact tables summarising estimator feasibility, numerical stability, and SAFE bootstrap acceptance rates across simulation designs.

Part I — Worked examples

Part I illustrates how lnM is computed, analysed, and interpreted in real meta-analytic datasets using two published examples with contrasting data structures.

Example 1 — Almeida et al nutrition dataset

Almeida et al. (2021) examined whether experimental manipulations of nutrition produce consistently large deviations from control conditions across a range of physiological and performance traits. Because the biological interest lies in the magnitude of responses rather than their direction, this dataset provides a natural setting for comparing analyses based on absolute SMD measures with corresponding analyses based on lnM.

Effect size calculations

This section calculates four effect-size quantities from the same raw inputs (group means, SDs, and sample sizes). Each step adds new columns to dat.

# Load the dataset from CSV files

dat <- read_csv(here("data", "data_almeida.csv"), show_col_types = FALSE)
Note
Type Variable Names Description
Inputs (Group 1) m1i, sd1i, n1i Mean, SD, and Sample Size for Group 1
Inputs (Group 2) m2i, sd2i, n2i Mean, SD, and Sample Size for Group 2
Output (1) yi_d, vi_d Signed SMD
Output (2) yi_d_abs, vi_d_abs Magnitude of SMD via folded-normal
Output (3) yi_lnM_raw, vi_lnM_raw lnM via Delta plug-in
Output (4) yi_lnM_safe, vi_lnM_safe lnM via SAFE

Conventional SMD (signed)

We first compute the standardized mean difference (SMD). This is the conventional effect size used in many meta-analyses.
metafor::escalc() returns a point estimate (yi_d) and its sampling variance (vi_d).

dat <- escalc(
  measure = "SMD",
  m1i = m1i, m2i = m2i,
  sd1i = sd1i, sd2i = sd2i,
  n1i = n1i, n2i = n2i,
  data = dat, append = TRUE,
  var.names = c("yi_d", "vi_d")
)

Folded-normal magnitude of SMD

Many researchers summarise “effect magnitude” by taking \(|SMD|\). However, absolute values are biased upward, especially when sampling error is large.
Instead of taking \(|SMD|\) directly, we use the folded-normal distribution implied by \((\hat{d}, \mathrm{Var}(\hat{d}))\), yielding yi_d_abs and vi_d_abs.

dat <- dat %>%
  mutate(
    abs_d = map2_dfr(yi_d, vi_d, ~{
      out <- folded_norm(.x, .y)
      tibble(yi_d_abs = out["point"], vi_d_abs = out["var"])
    })
  ) %>%
  unnest(abs_d)
Note

About pmap_dfr() (row-wise function application)
purrr::pmap_dfr() applies a function once per row using multiple columns as inputs (e.g., means, SDs, and sample sizes), returning a data frame with one output row per effect size.

lnM via Delta plug-in estimator

We compute lnM using an analytic (Delta-method) plug-in estimator. This can be undefined near boundary conditions, which we diagnose later. The result is stored as yi_lnM_raw with sampling variance vi_lnM_raw.

dat <- dat %>%
  mutate(
    lnM_raw = purrr::pmap_dfr(
      list(m1i, m2i, sd1i, sd2i, n1i, n2i),
      get_lnM_raw
    )
  ) %>%
  unnest(lnM_raw)

lnM via SAFE

We compute lnM using SAFE (single-fit parametric bootstrap), which is designed to remain usable when the plug-in estimator fails. The result is stored as yi_lnM_safe with sampling variance vi_lnM_safe.

dat <- dat %>%
  dplyr::mutate(
    lnM_safe = purrr::pmap_dfr(
      list(m1i, m2i, sd1i, sd2i, n1i, n2i),
      get_lnM_safe
    )
  ) %>%
  tidyr::unnest(lnM_safe)

How many lnM estimates can be obtained?

The Delta plug-in estimator of lnM can be undefined for some effect sizes, particularly near boundary conditions. SAFE is designed to be more robust in these cases by relying on a single-fit parametric bootstrap. The table below summarises (i) how often each method yields a usable lnM point estimate and sampling variance, and (ii) how often SAFE provides valid estimates in cases where the Delta plug-in estimator fails.

Note

Percentages are computed relative to the total number of effect sizes in the dataset.

Because lnM can be undefined when the between-group variance is not estimatable, we first quantify how often each estimator returns usable lnM point estimates and sampling variances.

Code
stopifnot(exists("dat"), is.data.frame(dat))

n_total <- nrow(dat)

# ---- Helper counts -----------------------------------------------------------

summary_tbl <- tibble::tibble(
  Metric = c(
    "Total effect sizes",
    "Delta plug-in lnM: point estimate unavailable (undefined)",
    "Delta plug-in lnM: sampling variance unavailable",
    "SAFE lnM: point estimate unavailable",
    "SAFE lnM: sampling variance unavailable",
    "SAFE rescue: point estimate available when Delta plug-in is undefined",
    "SAFE rescue: sampling variance available when Delta plug-in is undefined"
  ),
  Count = c(
    n_total,
    sum(is.na(dat$yi_lnM_raw)),
    sum(is.na(dat$vi_lnM_raw)),
    sum(is.na(dat$yi_lnM_safe)),
    sum(is.na(dat$vi_lnM_safe)),
    sum(is.na(dat$yi_lnM_raw) & !is.na(dat$yi_lnM_safe)),
    sum(is.na(dat$vi_lnM_raw) & !is.na(dat$vi_lnM_safe))
  )
) %>%
  dplyr::mutate(
    Percent = dplyr::if_else(
      Metric == "Total effect sizes",
      NA_real_,
      100 * Count / n_total
    )
  )

# Headline numbers people care about
n_saved <- sum(!is.na(dat$yi_lnM_safe))
p_saved <- 100 * n_saved / n_total

cat(sprintf(
  "\n\n**SAFE produced defined lnM point estimates for %d/%d effect sizes (%.1f%%).**\n\n",
  n_saved, n_total, p_saved
))


**SAFE produced defined lnM point estimates for 683/683 effect sizes (100.0%).**
Code
summary_tbl %>%
  kableExtra::kable(
    digits = 1,
    caption = "Data availability for lnM estimates: Delta plug-in vs SAFE (counts and % of all effect sizes).",
    align = c("l", "r", "r")
  ) %>%
  kableExtra::kable_styling(full_width = FALSE)%>%
  kableExtra::scroll_box(width = "100%", height = "400px")
Data availability for lnM estimates: Delta plug-in vs SAFE (counts and % of all effect sizes).
Metric Count Percent
Total effect sizes 683 NA
Delta plug-in lnM: point estimate unavailable (undefined) 270 39.5
Delta plug-in lnM: sampling variance unavailable 270 39.5
SAFE lnM: point estimate unavailable 0 0.0
SAFE lnM: sampling variance unavailable 0 0.0
SAFE rescue: point estimate available when Delta plug-in is undefined 270 39.5
SAFE rescue: sampling variance available when Delta plug-in is undefined 270 39.5
Tip

Rows labelled “SAFE rescue” count effect sizes for which the Delta plug-in estimator failed but SAFE returned a valid estimate.

This table shows that the Delta plug-in estimator fails to return a defined lnM point estimate or sampling variance for a substantial fraction of effect sizes in this dataset. In contrast, SAFE returns both a point estimate and a sampling variance for all effect sizes. As a result, SAFE “rescues” a large number of comparisons that would otherwise need to be excluded from lnM-based analyses.

Meta-analysis model

In this worked example we focus on body condition outcomes from the Almeida dataset. Almeida et al. report multiple outcome categories (e.g., condition and other trait classes); to keep the tutorial concrete and comparable across methods, we restrict to the Category == "Condition" subset.

We then fit the same multilevel random-effects structure for two magnitude metrics: (i) the folded-normal magnitude of SMD and (ii) lnM estimated with SAFE. In both cases, effect sizes are nested within studies (random intercepts for Study and for the within-study effect-size ID, ES.ID).

dat_cond <- dat %>% filter(Category == "Condition") %>%
  rename(exposure = magnitude)
dim(dat_cond)
[1] 336  33

Absolute SMD

(folded-normal)

We meta-analyse the SMD magnitude using the folded-normal transformation.

ma_abs <- rma.mv(
  yi     = yi_d_abs,
  V      = vi_d_abs,
  random = list(~ 1 | Study, ~ 1 | ES.ID),
  data   = dat_cond,
  method = "REML",test   = "t"
)

summary(ma_abs)

Multivariate Meta-Analysis Model (k = 336; method: REML)

   logLik   Deviance        AIC        BIC       AICc   
-613.1255  1226.2511  1232.2511  1243.6935  1232.3236   

Variance Components:

            estim    sqrt  nlvls  fixed  factor 
sigma^2.1  1.5845  1.2588     53     no   Study 
sigma^2.2  0.6205  0.7877    336     no   ES.ID 

Test for Heterogeneity:
Q(df = 335) = 2675.5490, p-val < .0001

Model Results:

estimate      se    tval   df    pval   ci.lb   ci.ub      
  1.3451  0.1880  7.1540  335  <.0001  0.9752  1.7149  *** 

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
i2_ml(ma_abs)
I2_Total I2_Study I2_ES.ID 
95.44936 68.59090 26.85846 
ml_m1(ma_abs)
 M1_Total  M1_Study  M1_ES.ID 
0.5247102 0.4448015 0.2783387 
main <- orchaRd::mod_results(ma_abs, group = "Study")
main <- unclass(main)
main$mod_table$name <- "Overall"
main$data$moderator <- "Overall"
class(main) <- "orchard"

p1 <- orchard_plot(main,  xlab = "abs(SMD)", 
                   group = "Study",
                   trunk.size = 0.6,
                   branch.size = 5) +
  scale_colour_manual(values = rep("grey20",8)) +
  scale_fill_manual(values = "#CC6677") 

p1

lnM

(SAFE)

We meta-analyse lnM using SAFE estimates.

ma_safe <- rma.mv(
  yi     = yi_lnM_safe,
  V      = vi_lnM_safe,
  random = list(~ 1 | Study, ~ 1 | ES.ID),
  data   = dat_cond,
  method = "REML", test   = "t"
)

summary(ma_safe)

Multivariate Meta-Analysis Model (k = 336; method: REML)

   logLik   Deviance        AIC        BIC       AICc   
-403.0904   806.1807   812.1807   823.6231   812.2533   

Variance Components:

            estim    sqrt  nlvls  fixed  factor 
sigma^2.1  0.5138  0.7168     53     no   Study 
sigma^2.2  0.3049  0.5521    336     no   ES.ID 

Test for Heterogeneity:
Q(df = 335) = 3627.0378, p-val < .0001

Model Results:

estimate      se     tval   df    pval    ci.lb   ci.ub    
 -0.1445  0.1152  -1.2545  335  0.2105  -0.3710  0.0821    

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
i2_ml(ma_safe)
I2_Total I2_Study I2_ES.ID 
89.43375 56.13035 33.30340 
ml_m1(ma_safe)
 M1_Total  M1_Study  M1_ES.ID 
0.8622993 0.6831345 0.5262009 
main2 <- orchaRd::mod_results(ma_safe, group = "Study")
main2 <- unclass(main2)
main2$mod_table$name <- "Overall"
main2$data$moderator <- "Overall"
class(main2) <- "orchard"

p2 <- orchard_plot(main2,  xlab = "lnM", 
                   group = "Study",
                   trunk.size = 0.6,
                   branch.size = 5) +
  scale_colour_manual(values = rep("grey20",8)) +
  scale_fill_manual(values = "#117733") 

p2

NoteInterpreting lnM on an SMD scale

For interpretability, we map \(lnM\) onto an SMD-equivalent effect size, \(d_{\mathrm{eq}}\), which is expressed in units of the within-group standard deviation. Using the large-sample approximation:

\(d_{\mathrm{eq}} \approx \sqrt{2} e^{\ln M}\)

Code
lnm_to_deq_approx <- function(model, label = "model") {
  b0_hat <- as.numeric(coef(model)[1])
  se_b0  <- sqrt(vcov(model)[1, 1])

  df_b0 <- tryCatch(as.numeric(model$ddf[1]), error = function(e) NA_real_)
  crit  <- if (is.finite(df_b0)) qt(0.975, df = df_b0) else qnorm(0.975)

  ci_lb <- b0_hat - crit * se_b0
  ci_ub <- b0_hat + crit * se_b0

  deq_hat <- sqrt(2) * exp(b0_hat)
  deq_ci  <- sqrt(2) * exp(c(ci_lb, ci_ub))

  cat(sprintf(
    "%s: d_eq (approx) = %.2f (95%% CI: %.2f to %.2f)\n",
    label, deq_hat, deq_ci[1], deq_ci[2]
  ))
}

lnm_to_deq_approx(ma_safe, label = "Overall model (ma_safe)")
Overall model (ma_safe): d_eq (approx) = 1.22 (95% CI: 0.98 to 1.54)

Because this transformation is monotone (larger lnM always implies larger \(d_{\mathrm{eq}}\)), we can obtain a 95% interval for \(d_{\mathrm{eq}}\) by transforming the lower and upper endpoints of the lnM interval.

In the Almeida dataset, the folded-normal SMD magnitude is constrained to be non-negative and many estimates accumulate near zero, with a small number of extreme values extending the upper tail. lnM, in contrast, is defined on an unbounded scale and spreads effect magnitudes more evenly, making between-study differences in overall magnitude easier to resolve when direction is not of primary interest.

Meta-regression

We next examine whether variation in effect magnitude is associated with study-level covariates. Specifically, we regress each magnitude metric on exposure intensity, exposure duration, and recovery period while retaining the same multilevel structure (random intercepts for Study and within-study effect-size ID, ES.ID).

Absolute SMD

(folded-normal)

mod_abs <- rma.mv(
  yi   = yi_d_abs,
  V    = vi_d_abs,
  mods = ~ exposure + duration + Recovery,
  random = list(
    ~ 1 | Study,
    ~ 1 | ES.ID
  ),
  data   = dat_cond,
  method = "REML", test   = "t"
  )

summary(mod_abs)

Multivariate Meta-Analysis Model (k = 265; method: REML)

   logLik   Deviance        AIC        BIC       AICc   
-452.5102   905.0203   917.0203   938.4075   917.3510   

Variance Components:

            estim    sqrt  nlvls  fixed  factor 
sigma^2.1  2.7288  1.6519     41     no   Study 
sigma^2.2  0.3319  0.5761    265     no   ES.ID 

Test for Residual Heterogeneity:
QE(df = 261) = 1769.3579, p-val < .0001

Test of Moderators (coefficients 2:4):
F(df1 = 3, df2 = 261) = 35.0981, p-val < .0001

Model Results:

          estimate      se     tval   df    pval    ci.lb    ci.ub      
intrcpt    -1.6401  0.6135  -2.6734  261  0.0080  -2.8481  -0.4321   ** 
exposure    3.5148  0.4732   7.4281  261  <.0001   2.5831   4.4466  *** 
duration    2.6195  0.3295   7.9494  261  <.0001   1.9707   3.2684  *** 
Recovery   -0.7042  0.4981  -1.4139  261  0.1586  -1.6850   0.2765      

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
p3 <- bubble_plot(
  mod_abs, mod = "Recovery",
  ylab = "abs(SMD)",
  xlab = "Recovery period",
  group = "Study"
) +
  scale_colour_manual(values = rep("grey20", 8)) +
  scale_fill_manual(values = "#CC6677")

p3

lnM

(SAFE)

mod_safe <- rma.mv(
  yi   = yi_lnM_safe,
  V    = vi_lnM_safe,
  mods = ~ exposure + duration + Recovery,
  random = list(
    ~ 1 | Study,
    ~ 1 | ES.ID
  ),
  data   = dat_cond,
  method = "REML", test   = "t"
)

summary(mod_safe)

Multivariate Meta-Analysis Model (k = 265; method: REML)

   logLik   Deviance        AIC        BIC       AICc   
-276.0960   552.1920   564.1920   585.5792   564.5227   

Variance Components:

            estim    sqrt  nlvls  fixed  factor 
sigma^2.1  0.6506  0.8066     41     no   Study 
sigma^2.2  0.1384  0.3720    265     no   ES.ID 

Test for Residual Heterogeneity:
QE(df = 261) = 2059.1220, p-val < .0001

Test of Moderators (coefficients 2:4):
F(df1 = 3, df2 = 261) = 25.0490, p-val < .0001

Model Results:

          estimate      se     tval   df    pval    ci.lb    ci.ub      
intrcpt    -1.0267  0.3859  -2.6603  261  0.0083  -1.7867  -0.2668   ** 
exposure    1.6612  0.3023   5.4947  261  <.0001   1.0659   2.2565  *** 
duration    0.8960  0.2280   3.9303  261  0.0001   0.4471   1.3449  *** 
Recovery   -0.8383  0.3962  -2.1157  261  0.0353  -1.6185  -0.0581    * 

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
p4 <- bubble_plot(
  mod_safe, mod = "Recovery",
  ylab = "lnM",
  xlab = "Recovery period",
  group = "Study", legend.pos = "bottom.left"
) +
  scale_colour_manual(values = rep("grey20", 8)) +
  scale_fill_manual(values = "#117733")

p4

The folded-normal SMD magnitude is bounded at zero and shows strong compression near the lower end of the scale, which can attenuate moderator slopes. lnM is unbounded and approximately normal, providing greater dynamic range and making moderator relationships (e.g., with recovery period) easier to detect and interpret under the same multilevel structure.

Small-study effects diagnostic

We use an Egger-style meta-regression (Egger et al. 1997) to assess whether estimated effect magnitudes systematically vary with study size, a pattern that can be consistent with small-study effects (including publication bias). In this approach, effect sizes are regressed on a measure of study precision rather than being compared visually in a funnel plot.

For lnM, using the usual sampling standard error as a predictor is problematic because lnM and its estimated standard error are both functions of the same underlying variance components. This creates a structural dependence between the response and predictor that violates the assumptions of Egger-type regressions. Following recent recommendations for multilevel meta-analyses, we therefore use a sample-size-based measure of precision instead (Nakagawa et al. 2022).

We define the harmonic-mean (or “effective”) sample size as:

\[ n_0 = \frac{n_1 n_2}{n_1 + n_2} \]

and fit two related models using:

  • 1 / sqrt(n₀), an SE-analogue, to test for the presence of small-study effects; and
  • 1 / n₀, a variance-analogue, to obtain a bias-adjusted overall effect estimate if evidence of small-study effects is detected.

Both predictors depend only on sample sizes and are therefore independent of the observed lnM values themselves. This allows small-study effects to be assessed without inducing artefactual correlations between effect sizes and their precision.

Worked example: Egger-style models using n₀

# Create sample-size-based predictors using the harmonic mean sample size n0
dat_cond$n0 <- (dat_cond$n1i * dat_cond$n2i) / (dat_cond$n1i + dat_cond$n2i)
dat_cond$nSE <- 1 / sqrt(dat_cond$n0)
dat_cond$nV <- 1 / dat_cond$n0
NoteHow to read these models

The models using 1 / sqrt(n₀) test for the presence of small-study effects (Egger-style slope test).
The model using 1 / n₀ provides a bias-adjusted estimate of the overall effect size, evaluated at infinite sample size.

Absolute SMD

(folded-normal; 1/ sqrt(n₀))

egger_abs <- rma.mv(
  yi   = yi_d_abs, 
  V    = vi_d_abs,
  mods = ~ nSE,
  random = list(
    ~ 1 | Study,
    ~ 1 | ES.ID
  ),
  data   = dat_cond,
  method = "REML",
  test   = "t"
)

summary(egger_abs)

Multivariate Meta-Analysis Model (k = 336; method: REML)

   logLik   Deviance        AIC        BIC       AICc   
-610.0077  1220.0155  1228.0155  1243.2600  1228.1371   

Variance Components:

            estim    sqrt  nlvls  fixed  factor 
sigma^2.1  1.7731  1.3316     53     no   Study 
sigma^2.2  0.6157  0.7847    336     no   ES.ID 

Test for Residual Heterogeneity:
QE(df = 334) = 2664.7524, p-val < .0001

Test of Moderators (coefficient 2):
F(df1 = 1, df2 = 334) = 2.7804, p-val = 0.0964

Model Results:

         estimate      se    tval   df    pval    ci.lb   ci.ub    
intrcpt    0.7986  0.3840  2.0799  334  0.0383   0.0433  1.5539  * 
nSE        1.0866  0.6516  1.6675  334  0.0964  -0.1952  2.3684  . 

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
p5 <- bubble_plot(
  egger_abs, mod = "nSE",
  ylab = "abs(SMD)",
  xlab = "sqrt(inverse sample size)",
  group = "Study"
) +
  scale_colour_manual(values = rep("grey20", 8)) +
  scale_fill_manual(values = "#CC6677")

p5

lnM

(SAFE; 1 / sqrt(n₀))

egger_safe <- rma.mv(
  yi   = yi_lnM_safe, 
  V    = vi_lnM_safe,
  mods = ~ nSE,
  random = list(
    ~ 1 | Study,
    ~ 1 | ES.ID
  ),
  data   = dat_cond,
  method = "REML",
  test   = "t"
)

summary(egger_safe)

Multivariate Meta-Analysis Model (k = 336; method: REML)

   logLik   Deviance        AIC        BIC       AICc   
-394.3423   788.6846   796.6846   811.9291   796.8061   

Variance Components:

            estim    sqrt  nlvls  fixed  factor 
sigma^2.1  0.5103  0.7144     53     no   Study 
sigma^2.2  0.2897  0.5383    336     no   ES.ID 

Test for Residual Heterogeneity:
QE(df = 334) = 3608.6350, p-val < .0001

Test of Moderators (coefficient 2):
F(df1 = 1, df2 = 334) = 14.8738, p-val = 0.0001

Model Results:

         estimate      se     tval   df    pval    ci.lb    ci.ub      
intrcpt   -0.9476  0.2383  -3.9758  334  <.0001  -1.4164  -0.4787  *** 
nSE        1.6146  0.4186   3.8567  334  0.0001   0.7911   2.4381  *** 

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
p6 <- bubble_plot(
  egger_safe, mod = "nSE",
  ylab = "lnM",
  xlab = "sqrt(inverse sample size)",
  group = "Study",legend.pos = "bottom.right"
) +
  scale_colour_manual(values = rep("grey20", 8)) +
  scale_fill_manual(values = "#117733")

p6

For the folded-normal absolute SMD, the fitted slope is shallow and the bulk of estimates cluster near the lower bound of the scale, indicating limited sensitivity of absolute SMD to study size and little ability to distinguish small-study effects. In contrast, lnM exhibits a clearer positive association with 1 / sqrt(n₀), with smaller effective sample sizes corresponding to larger estimated magnitudes. This pattern indicates the presence of small-study effects for lnM and illustrates that lnM provides a wider dynamic range for diagnosing size-related structure in effect magnitudes under the same multilevel model.

lnM

(SAFE; 1 / n₀ — bias-adjusted estimate)

#| label: ex1-egger-lnM-safe-2
#| message: false
#| warning: false

# Variance-analogue specification for bias adjustment
egger_safe2 <- rma.mv(
  yi   = yi_lnM_safe,
  V    = vi_lnM_safe,
  mods = ~ nV,
  random = list(
    ~ 1 | Study,
    ~ 1 | ES.ID
  ),
  data   = dat_cond,
  method = "REML",
  test   = "t"
)

summary(egger_safe2)

Multivariate Meta-Analysis Model (k = 336; method: REML)

   logLik   Deviance        AIC        BIC       AICc   
-394.0818   788.1635   796.1635   811.4081   796.2851   

Variance Components:

            estim    sqrt  nlvls  fixed  factor 
sigma^2.1  0.4968  0.7048     53     no   Study 
sigma^2.2  0.2903  0.5388    336     no   ES.ID 

Test for Residual Heterogeneity:
QE(df = 334) = 3602.1321, p-val < .0001

Test of Moderators (coefficient 2):
F(df1 = 1, df2 = 334) = 15.3951, p-val = 0.0001

Model Results:

         estimate      se     tval   df    pval    ci.lb    ci.ub      
intrcpt   -0.5892  0.1609  -3.6625  334  0.0003  -0.9057  -0.2728  *** 
nV         1.5028  0.3830   3.9237  334  0.0001   0.7494   2.2562  *** 

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
bubble_plot(
  egger_safe2, mod = "nV",
  ylab = "lnM",
  xlab = "inverse sample size",
  group = "Study"
) +
  scale_colour_manual(values = rep("grey20", 8)) +
  scale_fill_manual(values = "#117733")

NoteMapping lnM to an SMD-equivalent effect size

For interpretability, we convert the intercept from the 1 / n₀ model (the bias-adjusted lnM) to an SMD-equivalent magnitude using the large-sample approximation.

lnm_to_deq_approx(egger_safe2, "Egger-style model (egger_safe2)")
Egger-style model (egger_safe2): d_eq (approx) = 0.78 (95% CI: 0.57 to 1.08)

Sensitivity analysis

We repeated the lnM analyses after restricting effect sizes to comparisons with total sample size n₁ + n₂ ≥ 40.

dat_cond_sens <- dat_cond %>% dplyr::filter((n1i + n2i) >= 40)
dim(dat_cond_sens)
[1] 77 36

lnM: meta-analysis (sensitivity subset)

ma_safe_sens <- rma.mv(
  yi     = yi_lnM_safe, 
  V      = vi_lnM_safe,
  random = list(~ 1 | Study, ~ 1 | ES.ID),
  data   = dat_cond_sens,
  method = "REML",
  test   = "t"
)

summary(ma_safe_sens)

Multivariate Meta-Analysis Model (k = 77; method: REML)

   logLik   Deviance        AIC        BIC       AICc   
-102.8434   205.6868   211.6868   218.6790   212.0201   

Variance Components:

            estim    sqrt  nlvls  fixed  factor 
sigma^2.1  0.8863  0.9414     16     no   Study 
sigma^2.2  0.4862  0.6973     77     no   ES.ID 

Test for Heterogeneity:
Q(df = 76) = 1804.0553, p-val < .0001

Model Results:

estimate      se     tval  df    pval    ci.lb   ci.ub    
 -0.4138  0.2703  -1.5311  76  0.1299  -0.9521  0.1245    

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
i2_ml(ma_safe_sens)
I2_Total I2_Study I2_ES.ID 
97.33345 62.85424 34.47921 
orchard_plot(ma_safe_sens, xlab = "lnM", group = "Study")

NoteMapping lnM to an SMD-equivalent effect size
lnm_to_deq_approx(ma_safe_sens, "Sensitivity model (n₁ + n₂ ≥ 40)")
Sensitivity model (n₁ + n₂ ≥ 40): d_eq (approx) = 0.93 (95% CI: 0.55 to 1.60)

lnM: meta-regression (main vs sensitivity)

To compare moderator patterns, we fit the same meta-regression model to the full dataset and to the sensitivity subset.

Main dataset

mod_safe_main <- rma.mv(
  yi     = yi_lnM_safe, 
  V      = vi_lnM_safe,
  mods   = ~ exposure + duration + Recovery + nSE,
  random = list(~ 1 | Study, ~ 1 | ES.ID),
  data   = dat_cond,
  method = "REML",
  test   = "t"
)

summary(mod_safe_main)

Multivariate Meta-Analysis Model (k = 265; method: REML)

   logLik   Deviance        AIC        BIC       AICc   
-274.6719   549.3439   563.3439   588.2687   563.7883   

Variance Components:

            estim    sqrt  nlvls  fixed  factor 
sigma^2.1  0.6567  0.8104     41     no   Study 
sigma^2.2  0.1400  0.3741    265     no   ES.ID 

Test for Residual Heterogeneity:
QE(df = 260) = 2059.1090, p-val < .0001

Test of Moderators (coefficients 2:5):
F(df1 = 4, df2 = 260) = 18.6838, p-val < .0001

Model Results:

          estimate      se     tval   df    pval    ci.lb    ci.ub      
intrcpt    -1.0846  0.4218  -2.5716  260  0.0107  -1.9151  -0.2541    * 
exposure    1.6259  0.3211   5.0640  260  <.0001   0.9937   2.2581  *** 
duration    0.8755  0.2367   3.6988  260  0.0003   0.4094   1.3415  *** 
Recovery   -0.8361  0.3974  -2.1039  260  0.0363  -1.6185  -0.0536    * 
nSE         0.1668  0.4932   0.3381  260  0.7355  -0.8045   1.1380      

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
r2_ml(mod_safe_main)
   R2_marginal R2_conditional 
     0.3130056      0.8793096 
bubble_plot(
  mod_safe_main, mod = "Recovery",
  ylab = "lnM",
  xlab = "Recovery period",
  group = "Study",
  legend.pos = "bottom.left"
) +
  scale_colour_manual(values = rep("grey20", 8)) +
  scale_fill_manual(values = "#117733")

Sensitivity subset (n₁ + n₂ ≥ 40)

mod_safe_sens <- rma.mv(
  yi     = yi_lnM_safe, 
  V      = vi_lnM_safe,
  mods   = ~ exposure + duration + Recovery + nSE,
  random = list(~ 1 | Study, ~ 1 | ES.ID),
  data   = dat_cond_sens,
  method = "REML",
  test   = "t"
)

summary(mod_safe_sens)

Multivariate Meta-Analysis Model (k = 50; method: REML)

  logLik  Deviance       AIC       BIC      AICc   
-47.0208   94.0415  108.0415  120.6882  111.0686   

Variance Components:

            estim    sqrt  nlvls  fixed  factor 
sigma^2.1  0.7949  0.8916     10     no   Study 
sigma^2.2  0.1128  0.3358     50     no   ES.ID 

Test for Residual Heterogeneity:
QE(df = 45) = 379.4244, p-val < .0001

Test of Moderators (coefficients 2:5):
F(df1 = 4, df2 = 45) = 5.1463, p-val = 0.0017

Model Results:

          estimate      se     tval  df    pval    ci.lb    ci.ub    
intrcpt    -0.5878  1.1798  -0.4982  45  0.6208  -2.9641   1.7885    
exposure    3.7537  1.5867   2.3658  45  0.0224   0.5580   6.9494  * 
duration    0.7689  0.5269   1.4594  45  0.1514  -0.2923   1.8300    
Recovery   -2.6607  1.0002  -2.6603  45  0.0108  -4.6752  -0.6463  * 
nSE        -2.2327  2.5729  -0.8678  45  0.3901  -7.4148   2.9494    

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
r2_ml(mod_safe_sens)
   R2_marginal R2_conditional 
     0.4285730      0.9289948 
bubble_plot(
  mod_safe_sens, mod = "Recovery",
  ylab = "lnM",
  xlab = "Recovery period",
  group = "Study",
  legend.pos = "bottom.left"
) +
  scale_colour_manual(values = rep("grey20", 8)) +
  scale_fill_manual(values = "#117733")

These sensitivity results indicate that lnM-based inferences are robust to the exclusion of small-sample comparisons.

Figure 1

Figure 1 collates the three analyses above (overall model, meta-regression, and small-study effects diagnostic) to highlight how inference differs when magnitude is quantified using folded-normal SMD versus lnM (SAFE).

# Combine panels into a 2 × 3 layout:
# Row 1: overall meta-analyses (absolute SMD vs lnM)
# Row 2: meta-regressions (absolute SMD vs lnM)
# Row 3: Egger-style diagnostics (absolute SMD vs lnM)
(p1 + ggtitle("SMD estimates") + p2 + ggtitle("lnM estimates")) / 
  (p3 + p4) / 
  (p5 + p6) + 
  plot_annotation(tag_levels = 'A') & 
  theme(plot.tag = element_text(size = 16, face = "bold"))
Figure 1

Example 2 — Kunc & Schmidt anthropogenic noise dataset

This dataset synthesises experiments testing whether exposure to anthropogenic noise alters animal behaviour or physiology relative to control (quiet) conditions (Kunc and Schmidt 2019). Each effect size contrasts a noise treatment with a control group. In many applications the direction of change can also be important; here, we focus on magnitude to illustrate how lnM can summarise the strength of responses across taxa and response variables.

Effect size calculations

This section mirrors Example 1. We compute a conventional standardized mean difference (here using the unequal-variance variant, SMDH), its folded-normal magnitude, and lnM using both the Delta plug-in and SAFE.

# Load data
dat_raw <- read.csv(here("data", "kunc_2019.csv"), na.strings = c("", "NA")) %>%
  as.data.frame()

# Basic cleaning: keep rows with species info and usable SDs
dat_raw <- dat_raw %>%
  dplyr::filter(!is.na(species.latin), species.latin != "NA") %>%
  dplyr::mutate(
    sd.control = as.numeric(as.character(sd.control)),
    sd.noise   = as.numeric(as.character(sd.noise))
  ) %>%
  dplyr::filter(!is.na(sd.control), !is.na(sd.noise))

dim(dat_raw)
[1] 490  14
Note
Type Variable Names Description
Inputs (Control) mean.control, sd.control, sample.size.control Mean, SD, and Sample Size in control condition
Inputs (Noise) mean.noise, sd.noise, sample.size.noise.1 Mean, SD, and Sample Size in noise condition
Output (1) yi_d, vi_d Signed SMDH
Output (2) yi_d_abs, vi_d_abs Magnitude of SMDH via folded-normal
Output (3) yi_lnM_raw, vi_lnM_raw lnM via Delta plug-in
Output (4) yi_lnM_safe, vi_lnM_safe lnM via SAFE

Standardized mean difference

We compute an unequal-variance standardized mean difference (SMDH) using metafor::escalc(). This returns a point estimate (yi_d) and sampling variance (vi_d).

dat <- escalc(
  measure = "SMDH",
  m1i = mean.control, sd1i = sd.control, n1i = sample.size.control,
  m2i = mean.noise,   sd2i = sd.noise,   n2i = sample.size.noise.1,
  data = dat_raw, append = TRUE,
  var.names = c("yi_d", "vi_d")
) %>%
  dplyr::filter(is.finite(yi_d), is.finite(vi_d))

Folded-normal magnitude of SMDH

We convert the signed SMDH into a magnitude measure using the folded-normal transformation.

#| label: ex2-folded-normal-smd
#| message: false
#| warning: false

dat <- dat %>%
  dplyr::mutate(
    abs_d = purrr::map2_dfr(yi_d, vi_d, ~{
      out <- folded_norm(.x, .y)
      tibble::tibble(yi_d_abs = out["point"], vi_d_abs = out["var"])
    })
  ) %>%
  tidyr::unnest(abs_d)

lnM via Delta plug-in estimator

We compute lnM using the analytic (Delta-method) plug-in estimator. This can be undefined near boundary conditions.

dat <- dat %>%
  dplyr::mutate(
    lnM_raw = purrr::pmap_dfr(
      list(mean.control, mean.noise, sd.control, sd.noise,
           sample.size.control, sample.size.noise.1),
      get_lnM_raw
    )
  ) %>%
  tidyr::unnest(lnM_raw)

lnM via SAFE

We compute lnM using SAFE (single-fit parametric bootstrap).

dat <- dat %>%
  dplyr::mutate(
    lnM_safe = purrr::pmap_dfr(
      list(mean.control, mean.noise, sd.control, sd.noise,
           sample.size.control, sample.size.noise.1),
      get_lnM_safe
    )
  ) %>%
  tidyr::unnest(lnM_safe)

How many lnM estimates can be obtained?

Code
stopifnot(exists("dat"), is.data.frame(dat))

n_total <- nrow(dat)

summary_tbl <- tibble::tibble(
  Metric = c(
    "Total effect sizes",
    "Delta plug-in lnM: point estimate unavailable (undefined)",
    "Delta plug-in lnM: sampling variance unavailable",
    "SAFE lnM: point estimate unavailable",
    "SAFE lnM: sampling variance unavailable",
    "SAFE rescue: point estimate available when Delta plug-in lnM is undefined",
    "SAFE rescue: sampling variance available when Delta plug-in lnM is undefined"
  ),
  Count = c(
    n_total,
    sum(is.na(dat$yi_lnM_raw)),
    sum(is.na(dat$vi_lnM_raw)),
    sum(is.na(dat$yi_lnM_safe)),
    sum(is.na(dat$vi_lnM_safe)),
    sum(is.na(dat$yi_lnM_raw) & !is.na(dat$yi_lnM_safe)),
    sum(is.na(dat$vi_lnM_raw) & !is.na(dat$vi_lnM_safe))
  )
) %>%
  dplyr::mutate(
    Percent = dplyr::if_else(Metric == "Total effect sizes", NA_real_, 100 * Count / n_total)
  )

n_saved <- sum(!is.na(dat$yi_lnM_safe))
p_saved <- 100 * n_saved / n_total

cat(sprintf(
  "\n\n**SAFE produced defined lnM point estimates for %d/%d effect sizes (%.1f%%).**\n\n",
  n_saved, n_total, p_saved
))


**SAFE produced defined lnM point estimates for 486/486 effect sizes (100.0%).**
Code
summary_tbl %>%
  kableExtra::kable(
    digits = 1,
    caption = "Data availability for lnM estimates (Example 2): Delta plug-in vs SAFE (counts and % of all effect sizes).",
    align = c("l", "r", "r")
  ) %>%
  kableExtra::kable_styling(full_width = FALSE) %>%
  kableExtra::scroll_box(width = "100%", height = "400px")
Data availability for lnM estimates (Example 2): Delta plug-in vs SAFE (counts and % of all effect sizes).
Metric Count Percent
Total effect sizes 486 NA
Delta plug-in lnM: point estimate unavailable (undefined) 177 36.4
Delta plug-in lnM: sampling variance unavailable 177 36.4
SAFE lnM: point estimate unavailable 0 0.0
SAFE lnM: sampling variance unavailable 0 0.0
SAFE rescue: point estimate available when Delta plug-in lnM is undefined 177 36.4
SAFE rescue: sampling variance available when Delta plug-in lnM is undefined 177 36.4

Meta-analysis model

We focus on the main taxa retained in the worked example (birds, mammals, amphibians, and fish), excluding groups with very sparse representation.

#| label: ex2-filter-taxa
#| message: false
#| warning: false

dat <- dat %>%
  dplyr::filter(!taxon.for.plot %in% c("reptilia", "mollusca", "arthropoda"))

Absolute SMD

(folded-normal)

ma_abs_ex2 <- rma.mv(
  yi     = yi_d_abs,
  V      = vi_d_abs,
  random = list(~ 1 | study, ~ 1 | case.nr),
  data   = dat,
  method = "REML",
  test   = "t"
)

summary(ma_abs_ex2)

Multivariate Meta-Analysis Model (k = 379; method: REML)

   logLik   Deviance        AIC        BIC       AICc   
-489.6619   979.3238   985.3238   997.1284   985.3879   

Variance Components:

            estim    sqrt  nlvls  fixed   factor 
sigma^2.1  0.4514  0.6719     88     no    study 
sigma^2.2  0.0110  0.1047    379     no  case.nr 

Test for Heterogeneity:
Q(df = 378) = 1464.9688, p-val < .0001

Model Results:

estimate      se     tval   df    pval   ci.lb   ci.ub      
  0.8663  0.0772  11.2195  378  <.0001  0.7145  1.0182  *** 

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
i2_ml(ma_abs_ex2)
  I2_Total   I2_study I2_case.nr 
 84.992852  82.976595   2.016257 
main <- orchaRd::mod_results(ma_abs_ex2, group = "study")
main <- unclass(main)
main$mod_table$name <- "Overall"
main$data$moderator <- "Overall"
class(main) <- "orchard"

p7 <- orchard_plot(main, xlab = "Folded-normal SMD magnitude", group = "study",
                   trunk.size = 0.6, branch.size = 5) +
  scale_colour_manual(values = rep("grey20", 8)) +
  scale_fill_manual(values = "#CC6677")
p7

lnM

(SAFE)

ma_safe_ex2 <- rma.mv(
  yi     = yi_lnM_safe,
  V      = vi_lnM_safe,
  random = list(~ 1 | study, ~ 1 | case.nr),
  data   = dat,
  method = "REML",
  test   = "t"
)

summary(ma_safe_ex2)

Multivariate Meta-Analysis Model (k = 379; method: REML)

   logLik   Deviance        AIC        BIC       AICc   
-438.9700   877.9401   883.9401   895.7448   884.0043   

Variance Components:

            estim    sqrt  nlvls  fixed   factor 
sigma^2.1  0.6214  0.7883     88     no    study 
sigma^2.2  0.1967  0.4435    379     no  case.nr 

Test for Heterogeneity:
Q(df = 378) = 3267.6846, p-val < .0001

Model Results:

estimate      se     tval   df    pval    ci.lb    ci.ub      
 -0.4625  0.0956  -4.8378  378  <.0001  -0.6504  -0.2745  *** 

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
i2_ml(ma_safe_ex2)
  I2_Total   I2_study I2_case.nr 
  83.57297   63.48178   20.09119 
main2 <- orchaRd::mod_results(ma_safe_ex2, group = "study")
main2 <- unclass(main2)
main2$mod_table$name <- "Overall"
main2$data$moderator <- "Overall"
class(main2) <- "orchard"

p8 <- orchard_plot(main2, xlab = "lnM", group = "study",
                   trunk.size = 0.8, branch.size = 5) +
  scale_colour_manual(values = rep("grey20", 8)) +
  scale_fill_manual(values = "#117733")
p8

NoteMapping lnM to an SMD-equivalent effect size
lnm_to_deq_approx(ma_safe_ex2, "Overall model (Example 2)")
Overall model (Example 2): d_eq (approx) = 0.89 (95% CI: 0.74 to 1.07)

In this example, absolute SMD estimates again cluster near the lower bound of the scale, with a small number of extreme values dominating visual spread. By contrast, lnM produces a more symmetric and dispersed distribution of magnitudes, allowing differences in response strength across studies to be more clearly resolved. As in Example 1, this illustrates how lnM can summarise the magnitude of treatment effects across heterogeneous outcomes while retaining a scale that is better suited for synthesis and model-based inference.

Meta-regression

We next examine whether effect magnitudes vary across broad taxonomic groups. Here, taxon.for.plot is treated as a categorical moderator (with - 1 so each group mean is estimated directly).

Absolute SMD

(folded-normal magnitude)

mod_abs_ex2 <- rma.mv(
  yi     = yi_d_abs,
  V      = vi_d_abs,
  mods   = ~ taxon.for.plot - 1,
  random = list(~ 1 | study, ~ 1 | case.nr),
  data   = dat,
  method = "REML",
  test   = "t"
)

summary(mod_abs_ex2)

Multivariate Meta-Analysis Model (k = 379; method: REML)

   logLik   Deviance        AIC        BIC       AICc   
-479.6620   959.3239   971.3239   994.8855   971.5522   

Variance Components:

            estim    sqrt  nlvls  fixed   factor 
sigma^2.1  0.3997  0.6323     88     no    study 
sigma^2.2  0.0117  0.1083    379     no  case.nr 

Test for Residual Heterogeneity:
QE(df = 375) = 1382.5880, p-val < .0001

Test of Moderators (coefficients 1:4):
F(df1 = 4, df2 = 375) = 37.4973, p-val < .0001

Model Results:

                        estimate      se    tval   df    pval   ci.lb   ci.ub 
taxon.for.plotamphibia    0.6673  0.1832  3.6430  375  0.0003  0.3071  1.0275 
taxon.for.plotaves        0.7237  0.1133  6.3872  375  <.0001  0.5009  0.9465 
taxon.for.plotfish        0.8606  0.1370  6.2813  375  <.0001  0.5912  1.1300 
taxon.for.plotmammalia    1.4839  0.1975  7.5143  375  <.0001  1.0956  1.8722 
                            
taxon.for.plotamphibia  *** 
taxon.for.plotaves      *** 
taxon.for.plotfish      *** 
taxon.for.plotmammalia  *** 

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
p9 <- orchard_plot(
  mod_abs_ex2,
  mod = "taxon.for.plot",
  xlab = "Folded-normal SMD magnitude",
  group = "study",
  trunk.size = 0.6,
  branch.size = 5
) +
  scale_colour_manual(values = rep("grey20", 8)) +
  scale_fill_manual(values = rep("#CC6677", 4))

p9

#Pairwise comparisons among taxon means (Tukey-type)
summary(
  glht(mod_abs_ex2, linfct = cbind(contrMat(rep(1, 4), type = "Tukey"))),
  test = adjusted("none")
)

     Simultaneous Tests for General Linear Hypotheses

Fit: rma.mv(yi = yi_d_abs, V = vi_d_abs, mods = ~taxon.for.plot - 
    1, data = dat, random = list(~1 | study, ~1 | case.nr), method = "REML", 
    test = "t")

Linear Hypotheses:
           Estimate Std. Error z value Pr(>|z|)    
2 - 1 == 0   0.0564     0.2154   0.262 0.793438    
3 - 1 == 0   0.1932     0.2288   0.845 0.398267    
4 - 1 == 0   0.8166     0.2694   3.032 0.002432 ** 
3 - 2 == 0   0.1368     0.1778   0.770 0.441537    
4 - 2 == 0   0.7602     0.2277   3.339 0.000841 ***
4 - 3 == 0   0.6234     0.2404   2.594 0.009498 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- none method)

lnM

(SAFE)

mod_safe_ex2 <- rma.mv(
  yi     = yi_lnM_safe,
  V      = vi_lnM_safe,
  mods   = ~ taxon.for.plot - 1,
  random = list(~ 1 | study, ~ 1 | case.nr),
  data   = dat,
  method = "REML",
  test   = "t"
)

summary(mod_safe_ex2)

Multivariate Meta-Analysis Model (k = 379; method: REML)

   logLik   Deviance        AIC        BIC       AICc   
-427.7495   855.4990   867.4990   891.0606   867.7273   

Variance Components:

            estim    sqrt  nlvls  fixed   factor 
sigma^2.1  0.5337  0.7305     88     no    study 
sigma^2.2  0.1970  0.4438    379     no  case.nr 

Test for Residual Heterogeneity:
QE(df = 375) = 2829.2541, p-val < .0001

Test of Moderators (coefficients 1:4):
F(df1 = 4, df2 = 375) = 10.0157, p-val < .0001

Model Results:

                        estimate      se     tval   df    pval    ci.lb 
taxon.for.plotamphibia   -0.7424  0.2311  -3.2129  375  0.0014  -1.1968 
taxon.for.plotaves       -0.5879  0.1421  -4.1377  375  <.0001  -0.8673 
taxon.for.plotfish       -0.5610  0.1700  -3.2992  375  0.0011  -0.8953 
taxon.for.plotmammalia    0.2910  0.2209   1.3171  375  0.1886  -0.1434 
                          ci.ub      
taxon.for.plotamphibia  -0.2881   ** 
taxon.for.plotaves      -0.3085  *** 
taxon.for.plotfish      -0.2266   ** 
taxon.for.plotmammalia   0.7255      

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
p10 <- orchard_plot(
  mod_safe_ex2,
  mod = "taxon.for.plot",
  xlab = "lnM (SAFE)",
  group = "study",
  trunk.size = 0.6,
  branch.size = 5
) +
  scale_colour_manual(values = rep("grey20", 8)) +
  scale_fill_manual(values = rep("#117733", 4))

p10

#Pairwise comparisons among taxon means (Tukey-type)
summary(
  glht(mod_safe_ex2, linfct = cbind(contrMat(rep(1, 4), type = "Tukey"))),
  test = adjusted("none")
)

     Simultaneous Tests for General Linear Hypotheses

Fit: rma.mv(yi = yi_lnM_safe, V = vi_lnM_safe, mods = ~taxon.for.plot - 
    1, data = dat, random = list(~1 | study, ~1 | case.nr), method = "REML", 
    test = "t")

Linear Hypotheses:
           Estimate Std. Error z value Pr(>|z|)    
2 - 1 == 0  0.15456    0.27127   0.570  0.56884    
3 - 1 == 0  0.18147    0.28690   0.633  0.52705    
4 - 1 == 0  1.03345    0.31972   3.232  0.00123 ** 
3 - 2 == 0  0.02691    0.22158   0.121  0.90334    
4 - 2 == 0  0.87890    0.26269   3.346  0.00082 ***
4 - 3 == 0  0.85199    0.27880   3.056  0.00224 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- none method)

Using folded-normal SMDs, group means are strongly compressed toward zero and heavily influenced by a small number of extreme values, resulting in substantial overlap among taxa and limited ability to distinguish typical response magnitudes. In contrast, lnM yields a wider and more symmetric distribution of effect magnitudes within groups, revealing clearer separation among taxa and reducing the influence of extreme observations.

Small-study effects diagnostic

We use the same approach as in Example 1.

dat <- dat %>%
  dplyr::mutate(
    n0  = (sample.size.control * sample.size.noise.1) /
          (sample.size.control + sample.size.noise.1),
    nSE = 1 / sqrt(n0),
    nV  = 1 / n0
  )

Absolute SMD

(folded-normal; 1 / sqrt(n₀))

egger_abs_ex2 <- rma.mv(
  yi     = yi_d_abs,
  V      = vi_d_abs,
  mods   = ~ nSE,
  random = list(~ 1 | study, ~ 1 | case.nr),
  data   = dat,
  method = "REML",
  test   = "t"
)

summary(egger_abs_ex2)

Multivariate Meta-Analysis Model (k = 379; method: REML)

   logLik   Deviance        AIC        BIC       AICc   
-487.1753   974.3506   982.3506   998.0795   982.4581   

Variance Components:

            estim    sqrt  nlvls  fixed   factor 
sigma^2.1  0.4097  0.6401     88     no    study 
sigma^2.2  0.0111  0.1052    379     no  case.nr 

Test for Residual Heterogeneity:
QE(df = 377) = 1368.1531, p-val < .0001

Test of Moderators (coefficient 2):
F(df1 = 1, df2 = 377) = 3.2713, p-val = 0.0713

Model Results:

         estimate      se    tval   df    pval    ci.lb   ci.ub     
intrcpt    0.5739  0.1744  3.2912  377  0.0011   0.2310  0.9168  ** 
nSE        0.7707  0.4261  1.8087  377  0.0713  -0.0672  1.6085   . 

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
p11 <- bubble_plot(
  egger_abs_ex2, mod = "nSE",
  ylab = "Folded-normal SMD magnitude",
  xlab = "sqrt(inverse sample size)",
  group = "study"
) +
  scale_colour_manual(values = rep("grey20", 8)) +
  scale_fill_manual(values = "#CC6677")

p11

lnM

(SAFE; 1 / sqrt(n₀))

egger_safe_ex2 <- rma.mv(
  yi     = yi_lnM_safe,
  V      = vi_lnM_safe,
  mods   = ~ nSE,
  random = list(~ 1 | study, ~ 1 | case.nr),
  data   = dat,
  method = "REML",
  test   = "t"
)

summary(egger_safe_ex2)

Multivariate Meta-Analysis Model (k = 379; method: REML)

   logLik   Deviance        AIC        BIC       AICc   
-429.5713   859.1425   867.1425   882.8715   867.2501   

Variance Components:

            estim    sqrt  nlvls  fixed   factor 
sigma^2.1  0.4910  0.7007     88     no    study 
sigma^2.2  0.2041  0.4518    379     no  case.nr 

Test for Residual Heterogeneity:
QE(df = 377) = 3202.4662, p-val < .0001

Test of Moderators (coefficient 2):
F(df1 = 1, df2 = 377) = 17.7738, p-val < .0001

Model Results:

         estimate      se     tval   df    pval    ci.lb    ci.ub      
intrcpt   -1.4564  0.2526  -5.7649  377  <.0001  -1.9531  -0.9596  *** 
nSE        2.6049  0.6179   4.2159  377  <.0001   1.3900   3.8197  *** 

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
p12 <- bubble_plot(
  egger_safe_ex2, mod = "nSE",
  ylab = "lnM (SAFE)",
  xlab = "sqrt(inverse sample size)",
  group = "study",
  legend.pos = "bottom.right"
) +
  scale_colour_manual(values = rep("grey20", 8)) +
  scale_fill_manual(values = "#117733")

p12

The folded-normal SMD shows little association with precision, reflecting compression of effect magnitudes near zero. In contrast, lnM exhibits a clear positive relationship with decreasing study size, indicating greater sensitivity to size-dependent differences in effect magnitude.

lnM

(SAFE; 1 / n₀)

egger_safe2_ex2 <- rma.mv(
  yi     = yi_lnM_safe,
  V      = vi_lnM_safe,
  mods   = ~ nV,
  random = list(~ 1 | study, ~ 1 | case.nr),
  data   = dat,
  method = "REML",
  test   = "t"
)

summary(egger_safe2_ex2)

Multivariate Meta-Analysis Model (k = 379; method: REML)

   logLik   Deviance        AIC        BIC       AICc   
-429.8243   859.6486   867.6486   883.3776   867.7561   

Variance Components:

            estim    sqrt  nlvls  fixed   factor 
sigma^2.1  0.5130  0.7162     88     no    study 
sigma^2.2  0.2013  0.4487    379     no  case.nr 

Test for Residual Heterogeneity:
QE(df = 377) = 3230.6182, p-val < .0001

Test of Moderators (coefficient 2):
F(df1 = 1, df2 = 377) = 16.4874, p-val < .0001

Model Results:

         estimate      se     tval   df    pval    ci.lb    ci.ub      
intrcpt   -0.9303  0.1464  -6.3552  377  <.0001  -1.2181  -0.6425  *** 
nV         2.9297  0.7215   4.0605  377  <.0001   1.5110   4.3484  *** 

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
bubble_plot(
  egger_safe2_ex2, mod = "nV",
  ylab = "lnM (SAFE)",
  xlab = "inverse sample size",
  group = "study"
) +
  scale_colour_manual(values = rep("grey20", 8)) +
  scale_fill_manual(values = "#117733")

NoteMapping lnM to an SMD-equivalent effect size
lnm_to_deq_approx(egger_safe2_ex2, "Egger-style model (Example 2; 1 / no)")
Egger-style model (Example 2; 1 / no): d_eq (approx) = 0.56 (95% CI: 0.42 to 0.74)

Sensitivity analysis

We repeated the lnM analyses after restricting effect sizes to comparisons with total sample size n₁ + n₂ ≥ 40.

dat_sens <- dat %>%
  dplyr::filter((sample.size.control + sample.size.noise.1) >= 40)

dim(dat_sens)
[1] 105  29

lnM: meta-analysis (sensitivity subset)

ma_safe_sens_ex2 <- rma.mv(
  yi     = yi_lnM_safe,
  V      = vi_lnM_safe,
  random = list(~ 1 | study, ~ 1 | case.nr),
  data   = dat_sens,
  method = "REML",
  test   = "t"
)

summary(ma_safe_sens_ex2)

Multivariate Meta-Analysis Model (k = 105; method: REML)

   logLik   Deviance        AIC        BIC       AICc   
-113.7869   227.5737   233.5737   241.5069   233.8137   

Variance Components:

            estim    sqrt  nlvls  fixed   factor 
sigma^2.1  0.6462  0.8039     36     no    study 
sigma^2.2  0.1040  0.3225    105     no  case.nr 

Test for Heterogeneity:
Q(df = 104) = 827.7187, p-val < .0001

Model Results:

estimate      se     tval   df    pval    ci.lb    ci.ub      
 -0.7593  0.1519  -5.0003  104  <.0001  -1.0604  -0.4582  *** 

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
i2_ml(ma_safe_sens_ex2)
  I2_Total   I2_study I2_case.nr 
  84.48766   72.77735   11.71031 
orchard_plot(ma_safe_sens_ex2, xlab = "lnM", group = "study")

NoteMapping lnM to an SMD-equivalent effect size
lnm_to_deq_approx(ma_safe_sens_ex2, "Sensitivity model (Example 2; n₁ + n₂ ≥ 40)")
Sensitivity model (Example 2; n₁ + n₂ ≥ 40): d_eq (approx) = 0.66 (95% CI: 0.49 to 0.89)

lnM: taxon model (main vs sensitivity)

To compare patterns across taxa, we refit the same taxon model in the sensitivity subset.

Main dataset

tax_model_main_ex2 <- rma.mv(
  yi     = yi_lnM_safe,
  V      = vi_lnM_safe,
  mods   = ~ taxon.for.plot - 1 + nSE,
  random = list(~ 1 | study, ~ 1 | case.nr),
  data   = dat,
  method = "REML",
  test   = "t"
)

summary(tax_model_main_ex2)

Multivariate Meta-Analysis Model (k = 379; method: REML)

   logLik   Deviance        AIC        BIC       AICc   
-419.9461   839.8922   853.8922   881.3620   854.1982   

Variance Components:

            estim    sqrt  nlvls  fixed   factor 
sigma^2.1  0.4416  0.6645     88     no    study 
sigma^2.2  0.2032  0.4508    379     no  case.nr 

Test for Residual Heterogeneity:
QE(df = 374) = 2816.4068, p-val < .0001

Test of Moderators (coefficients 1:5):
F(df1 = 5, df2 = 374) = 11.9256, p-val < .0001

Model Results:

                        estimate      se     tval   df    pval    ci.lb 
taxon.for.plotamphibia   -1.5287  0.3012  -5.0752  374  <.0001  -2.1210 
taxon.for.plotaves       -1.4860  0.2752  -5.3995  374  <.0001  -2.0271 
taxon.for.plotfish       -1.3918  0.2735  -5.0882  374  <.0001  -1.9297 
taxon.for.plotmammalia   -0.7116  0.3387  -2.1007  374  0.0363  -1.3776 
nSE                       2.3032  0.6154   3.7426  374  0.0002   1.0931 
                          ci.ub      
taxon.for.plotamphibia  -0.9364  *** 
taxon.for.plotaves      -0.9448  *** 
taxon.for.plotfish      -0.8540  *** 
taxon.for.plotmammalia  -0.0455    * 
nSE                      3.5133  *** 

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
r2_ml(tax_model_main_ex2)
   R2_marginal R2_conditional 
     0.2124293      0.7517636 
res_main <- mod_results(
  tax_model_main_ex2,
  group = "study",
  mod = "taxon.for.plot",
  at = list(nSE = 0)
)

orchard_plot(
  res_main,
  mod = "taxon.for.plot",
  xlab = "lnM (SAFE)",
  group = "study",
  trunk.size = 0.6,
  branch.size = 5
) +
  scale_colour_manual(values = rep("grey20", 8)) +
  scale_fill_manual(values = rep("#117733", 4))

Sensitivity subset (n₁ + n₂ ≥ 40)

tax_model_sens_ex2 <- rma.mv(
  yi     = yi_lnM_safe,
  V      = vi_lnM_safe,
  mods   = ~ taxon.for.plot - 1 + nSE,
  random = list(~ 1 | study, ~ 1 | case.nr),
  data   = dat_sens,
  method = "REML",
  test   = "t"
)

summary(tax_model_sens_ex2)

Multivariate Meta-Analysis Model (k = 105; method: REML)

   logLik   Deviance        AIC        BIC       AICc   
-103.7305   207.4609   221.4609   239.6971   222.6783   

Variance Components:

            estim    sqrt  nlvls  fixed   factor 
sigma^2.1  0.5479  0.7402     36     no    study 
sigma^2.2  0.1010  0.3178    105     no  case.nr 

Test for Residual Heterogeneity:
QE(df = 100) = 764.6320, p-val < .0001

Test of Moderators (coefficients 1:5):
F(df1 = 5, df2 = 100) = 7.7083, p-val < .0001

Model Results:

                        estimate      se     tval   df    pval    ci.lb 
taxon.for.plotamphibia   -2.6986  0.7521  -3.5884  100  0.0005  -4.1907 
taxon.for.plotaves       -2.9561  0.7125  -4.1489  100  <.0001  -4.3697 
taxon.for.plotfish       -2.4293  0.6547  -3.7104  100  0.0003  -3.7282 
taxon.for.plotmammalia   -2.2179  0.8154  -2.7200  100  0.0077  -3.8356 
nSE                       7.1416  2.4720   2.8890  100  0.0047   2.2372 
                          ci.ub      
taxon.for.plotamphibia  -1.2066  *** 
taxon.for.plotaves      -1.5425  *** 
taxon.for.plotfish      -1.1303  *** 
taxon.for.plotmammalia  -0.6002   ** 
nSE                     12.0459   ** 

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
r2_ml(tax_model_sens_ex2)
   R2_marginal R2_conditional 
     0.1764212      0.8717989 
res_sens <- mod_results(
  tax_model_sens_ex2,
  group = "study",
  mod = "taxon.for.plot",
  at = list(nSE = 0)
)

orchard_plot(
  res_sens,
  mod = "taxon.for.plot",
  xlab = "lnM (SAFE)",
  group = "study",
  trunk.size = 0.6,
  branch.size = 5
) +
  scale_colour_manual(values = rep("grey20", 8)) +
  scale_fill_manual(values = rep("#117733", 4))

Figure 2

Comparison of overall meta-analytic estimates, taxon-level meta-regressions, and Egger-style small-study effect diagnostics for folded-normal SMD magnitude and lnM (SAFE) in the Kunc & Schmidt dataset.

# Combine panels into a 2 × 3 layout:
# Row 1: overall meta-analyses (absolute SMD vs lnM)
# Row 2: taxon meta-regressions (absolute SMD vs lnM)
# Row 3: Egger-style diagnostics (absolute SMD vs lnM)
(p7 + p8) / (p9 + p10) / (p11 + p12) + plot_annotation(tag_levels = 'A') & theme(plot.tag = element_text(size = 16, face = "bold"))

Part II — Simulation reproduction (figures + tables)

Part II reproduces all simulation-based figures and tables reported in the manuscript using:

  1. a pre-computed simulation summary (one row per simulation setting), and
  2. where needed, replicate-level outputs (results from each Monte Carlo replicate within a simulation setting).

A simulation setting (or grid cell) is one combination of: design × sample sizes × true lnM value (theta, θ). Within each setting, we ran several Monte Carlo replicates, computed lnM using the Delta plug-in and SAFE-BC, and summarised estimator performance.

NoteKey quantities
Quantity Definition
\(\theta\) (theta) The true lnM value used to generate data in the simulation setting (\(x\)-axis in all plots).
Bias \(\text{mean}(\text{estimate}) - \text{true lnM}\) (computed within each setting).
RMSE \(\sqrt{\text{mean}((\text{estimate} - \text{true lnM})^2)}\) (combines bias + variability).
Coverage The proportion of nominal 95% intervals that contain the true lnM (target is 0.95).
MCSE Monte Carlo standard error; uncertainty in simulation summaries due to a finite number of Monte Carlo replicates.

Supplementary Figures

Load summary results

The summary file contains one row per simulation setting (design × sample sizes × θ), including bias, RMSE, coverage, failure rates, and MCSE diagnostics computed from the Monte Carlo replicates.

Code
# Load the pre-computed simulation summary used to reproduce figures and tables results

results <- readRDS( here::here("results", "lnM_summary_SAFEfun_2025-12-21.rds"))

# Create a compact facet label for plotting 
results <- results %>%
    mutate(facet_label = paste0(design, " n1=", n1, " n2=", n2))

facet_info <- results %>% distinct(facet_label, design, n1, n2)

balanced_ind <- facet_info %>%
    filter(design == "indep", n1 == n2) %>% arrange(n1) %>% pull(facet_label)
unbalanced_ind <- facet_info %>%
    filter(design == "indep", n1 != n2) %>% arrange(n1) %>% pull(facet_label)
paired_balanced <- facet_info %>%
    filter(design == "paired") %>% arrange(n1) %>% pull(facet_label)

new_levels <- c(balanced_ind, unbalanced_ind, paired_balanced)
results <- results %>% mutate(facet_label = factor(facet_label, levels = new_levels))

strip_labeller <- function(x) gsub(" n1=", "   n1=", gsub(" n2=", "   n2=", x))

# Shared plotting theme for all simulation figures

base_theme <- theme_bw(11) +
    theme(
      legend.title = element_text(),
      legend.position = "right",
      strip.background = element_rect(fill = "grey85", colour = "grey40"),
      strip.text = element_text(face = "plain"),
      panel.grid.minor = element_blank()
)

Figure — Bias of point estimates

Figure 2 plots point-estimate bias for each estimator within each simulation setting. Bias is computed as \(bias = mean(estimate) - true lnM\), so values above 0 indicate systematic overestimation and values below 0 indicate underestimation. The dashed horizontal line marks 0 bias (ideal). Each panel corresponds to one design × sample-size combination; the x-axis (theta) is the true lnM used to generate the data.

Code
# Construct long-format data for plotting bias curves
bias_df <- dplyr::bind_rows(
    results %>% dplyr::select(theta, facet_label, delta_bias) %>%
      dplyr::rename(bias = delta_bias) %>% dplyr::mutate(estimator = "PI"),
    results %>% dplyr::select(theta, facet_label, safe_bias) %>%
      dplyr::rename(bias = safe_bias) %>% dplyr::mutate(estimator = "SAFE")
  )

  ggplot(bias_df, aes(theta, bias, colour = estimator, group = estimator)) +
    geom_hline(yintercept = 0, linetype = "dashed", colour = "grey50") +
    geom_line(linewidth = 0.6) +
    facet_wrap(~ facet_label, ncol = 4, nrow = 3,
               labeller = labeller(facet_label = strip_labeller)) +
    scale_color_discrete("Estimator", 
                         labels = c("PI" = "Plugin",
                                   "SAFE" = "SAFE bootstrapping"))+
    labs(x = expression(theta),
         y = "Bias (estimate − true lnM)",
         colour = "estimator") +
    base_theme + 
  theme(legend.position = "bottom")
Figure 2: Point-estimate bias (estimate − true lnM). Dashed line indicates 0 bias.

Relative bias of variance estimates

Relative bias of the reported sampling variances is expressed as a percentage (Figure 3). In the simulation code, this is computed relative to an empirical Monte Carlo baseline variance for the SAFE-BC point estimator (computed within each setting on a common “ok” subset). The dashed line at 0% indicates an unbiased variance estimate under this baseline.

Code
# Construct long-format data for plotting relative bias of variance estimates

rb_df <- dplyr::bind_rows(
    results %>% dplyr::select(theta, facet_label, relbias_delta) %>%
      dplyr::rename(relbias = relbias_delta) %>% dplyr::mutate(estimator = "Delta"),
    results %>% dplyr::select(theta, facet_label, relbias_safe) %>%
      dplyr::rename(relbias = relbias_safe)  %>% dplyr::mutate(estimator = "SAFE")
  )

  ggplot(rb_df, aes(theta, relbias, colour = estimator, group = estimator)) +
    geom_hline(yintercept = 0, linetype = "dashed", colour = "grey50") +
    geom_line(linewidth = 0.6) +
    facet_wrap(~ facet_label, ncol = 4, nrow = 3,
               labeller = labeller(facet_label = strip_labeller)) +
        scale_color_discrete("Estimator", 
                         labels = c("Delta" = "Delta method derived plugin",
                                   "SAFE" = "SAFE bootstrapping"))+
    labs(x = expression(theta),
         y = "Relative bias of variance (%)",
         colour = "estimator") +
    base_theme + 
    theme(legend.position = "bottom")
Figure 3: Relative bias of variance estimates (%). Dashed line indicates 0% (unbiased).

Coverage of nominal 95% intervals

Coverage is the fraction of Monte Carlo replicates in which the nominal 95% interval contains the true lnM value (Figure 4). The dashed line at 0.95 is the nominal target.

Code
# Construct long-format data for plotting empirical coverage
cov_df <- dplyr::bind_rows(
    results %>% transmute(theta, facet_label, estimator = "PI",   cover = cover_delta),
    results %>% transmute(theta, facet_label, estimator = "SAFE", cover = cover_safe)
  )

  ggplot(cov_df, aes(theta, cover, colour = estimator, group = estimator)) +
    geom_hline(yintercept = 0.95, linetype = "dashed", colour = "grey50") +
    geom_line(linewidth = 0.6) +
    facet_wrap(~ facet_label, ncol = 4, nrow = 3,
               labeller = labeller(facet_label = strip_labeller)) +
    scale_color_discrete("Estimator", 
                         labels = c("PI" = "Plugin",
                                   "SAFE" = "SAFE bootstrapping"))+
    labs(x = expression(theta),
         y = "Empirical coverage",
         colour = "estimator") +
    base_theme + 
  theme(legend.position = "bottom")
Figure 4: Empirical coverage of nominal 95% intervals. Dashed line indicates 0.95.

RMSE of point estimates

RMSE (root mean square error) summarizes overall point-estimate accuracy by combining bias and variability (Figure 5). Lower RMSE indicates better performance.

Code
# Construct long-format data for plotting RMSE curves
rmse_df <- dplyr::bind_rows(
    results %>% transmute(theta, facet_label, estimator = "PI",   rmse = rmse_delta),
    results %>% transmute(theta, facet_label, estimator = "SAFE", rmse = rmse_safe)
  )

  ggplot(rmse_df, aes(theta, rmse, colour = estimator, group = estimator)) +
    geom_line(linewidth = 0.6) +
    facet_wrap(~ facet_label, ncol = 4, nrow = 3,
               labeller = labeller(facet_label = strip_labeller)) +
     scale_color_discrete("Estimator", 
                         labels = c("PI" = "Plugin",
                                   "SAFE" = "SAFE bootstrapping"))+
    labs(x = expression(theta),
         y = expression(RMSE~"("~hat(lnM)~"\u2212"~lnM~")"),
         colour = "estimator") +
    base_theme + 
  theme(legend.position = "bottom")
Figure 5: RMSE of point estimates. Lower values indicate better overall accuracy.

Reading in data for Monte Carlo (MC) standard error (SE)

The chunks below read replicate-level outputs saved during the simulation run (raw_runs/row_###.rds). Each file corresponds to one simulation setting (one “grid cell”), and contains one row per Monte Carlo replicate. We use these files to reconstruct additional Monte Carlo error diagnostics (MCSE) that are easier to compute from replicate-level results.

Note

Each raw_runs/row_###.rds file corresponds to one row of the simulation parameter grid used in the driver script (i.e., one design × sample size × θ combination).

Code
# Read replicate-level simulation outputs (one file per grid cell)
raw_dir <- here("raw_runs")
raw_files <- list.files(raw_dir, pattern = "^row_[0-9]{3}\\.rds$", full.names = TRUE)
if (!length(raw_files)) stop("No raw files found in: ", raw_dir)


# Extract grid-cell index from filename (e.g., row_012.rds -> 12)

row_index_from_path <- function(p) as.integer(sub("^row_([0-9]{3})\\.rds$", "\\1", basename(p)))

# Compute diagnostics for a single grid cell (one replicate-level data frame)

one_raw_diag <- function(raw_df) {
  
   # Defensive: remove replicate ID if present (not used in computations)
  if ("rep" %in% names(raw_df)) raw_df <- raw_df %>% dplyr::select(-rep)
  
  # Match the “ok_d” logic used in the simulation summary:
  # keep replicates where the Delta-method point & variance are defined and variance > 0
  ok_d <- is.finite(raw_df$delta_pt) & is.finite(raw_df$delta_var) & (raw_df$delta_var > 0)
  
  # If nothing survives the ok_d filter, return NAs
  if (!any(ok_d)) {
    return(data.frame(
      n_ok_d = 0L,
      n_ok_safe = 0L,
      RSE_VarMC_SAFEpt = NA_real_,
      sd_PI = NA_real_,
      sd_SAFE = NA_real_
    ))
  }
  
  df_ok <- raw_df[ok_d, , drop = FALSE]
  
  # Within the ok_d subset, check how many finite SAFE-BC point estimates exist
  ok_safe_pt <- is.finite(df_ok$safe_pt)
  n_ok_safe <- sum(ok_safe_pt)
  
  # Relative standard error (RSE) of the Monte Carlo variance estimate:
  # RSE(Var_hat) ≈ sqrt(2/(n-1)) for sample variance under normality
  RSE_varmc <- if (n_ok_safe >= 3) sqrt(2 / (n_ok_safe - 1)) else NA_real_
  
    # SD of point estimates (used to compute MCSE of bias later)
  # Since true_lnM is constant within a grid cell, sd(est - true) = sd(est).
  sd_PI   <- sd(df_ok$delta_pt, na.rm = TRUE)
  sd_SAFE <- sd(df_ok$safe_pt,  na.rm = TRUE)
  
  data.frame(
    n_ok_d = nrow(df_ok),
    n_ok_safe = n_ok_safe,
    RSE_VarMC_SAFEpt = RSE_varmc,
    sd_PI = sd_PI,
    sd_SAFE = sd_SAFE
  )
}
# Compute diagnostics for each grid cell (row_###.rds)
diag_list <- pbapply::pblapply(raw_files, function(f) {
  rid <- row_index_from_path(f)
  raw_df <- readRDS(f)
  d <- one_raw_diag(raw_df)
  d$row_id <- rid
  d
})

diag_df <- bind_rows(diag_list)
# Join diagnostics back onto the simulation summary
results2 <- results %>%
  mutate(row_id = seq_len(nrow(results))) %>%
  left_join(diag_df, by = "row_id")

MCSE of bias (from replicate-level files)

Note

How to read MCSE plots.
MCSE is simulation uncertainty, not estimator uncertainty. It measures how precisely the simulation has estimated quantities like bias and mean(variance). Lower MCSE implies more stable simulation summaries.

At Figure 6 shows the Monte Carlo standard error (MCSE) of the estimated bias within each simulation setting. Here, MCSE quantifies how much the estimated bias would vary if we re-ran the Monte Carlo simulation again with the same number of replicates.

We compute this from replicate-level outputs using the approximation:

MCSE(bias) ≈ sd(estimate) / sqrt(n_eff)

where n_eff is the number of usable Monte Carlo replicates in that setting (after applying the same validity filters used in the simulation summary). Smaller MCSE values indicate more precise bias estimates.

Code
# MCSE(bias) approximation:
#   MCSE(bias) ≈ sd(estimator) / sqrt(n_eff),
# where n_eff is the number of usable replicates in each grid cell.

results2 <- results2 %>%
  mutate(
    mcse_bias_PI_disk   = ifelse(is.finite(sd_PI)   & n_ok_d >= 2,    sd_PI   / sqrt(n_ok_d),     NA_real_),
    mcse_bias_SAFE_disk = ifelse(is.finite(sd_SAFE) & n_ok_safe >= 2, sd_SAFE / sqrt(n_ok_safe),  NA_real_)
  )

mc_long <- results2 %>%
  transmute(
    theta, facet_label,
    mcse_bias_PI   = mcse_bias_PI_disk,
    mcse_bias_SAFE = mcse_bias_SAFE_disk
  ) %>%
  pivot_longer(
    cols = c(mcse_bias_PI, mcse_bias_SAFE),
    names_to = "metric", values_to = "value"
  )

p_mc <- ggplot(mc_long, aes(theta, value, colour = metric, group = metric)) +
  geom_line() +
  facet_wrap(~ facet_label, ncol = 4) +
  scale_color_discrete("Estimator", 
                         labels = c("mcse_bias_PI" = "Plugin",
                                   "mcse_bias_SAFE" = "SAFE bootstrapping"))+
  labs(
    x = expression(theta),
    y = "MCSE of bias",
    title = "Monte-Carlo error diagnostics: MCSE of bias (per grid cell; from raw runs)"
  ) +
  base_theme + 
  theme(legend.position = "bottom")

print(p_mc)
Figure 6: MCSE of bias (per grid cell) reconstructed from replicate-level files in raw_runs.

MCSE of mean(variance estimate) (from summary)

Figure 7 shows the MCSE of the mean reported sampling variance for each estimator within each simulation setting. This is a precision diagnostic for the simulation itself: it indicates how much uncertainty remains in the simulation summary of mean_var_delta and mean_var_safe due to finite Monte Carlo replicates.

In practice, smaller MCSE values mean the simulation is sufficiently large to estimate average variance behavior reliably for that setting; larger values indicate that more Monte Carlo replicates would be needed to stabilize the mean variance estimate.

Code
# Construct long-format data for plotting MCSE of mean variance estimates
var_mcse_df <- results %>%
  transmute(theta, facet_label,
            PI_var_MCSE   = mcse_varbar_delta,
            SAFE_var_MCSE = mcse_varbar_safe) %>%
  pivot_longer(cols = c(PI_var_MCSE, SAFE_var_MCSE),
               names_to = "estimator", values_to = "mcse") %>%
  mutate(estimator = recode(estimator,
                            PI_var_MCSE   = "PI (delta-var)",
                            SAFE_var_MCSE = "SAFE (safe-var)"))

p_var_mcse <- ggplot(var_mcse_df, aes(theta, mcse, colour = estimator, group = estimator)) +
  geom_line() +
  facet_wrap(~ facet_label, ncol = 4) +
  labs(x = expression(theta),
       y = "MCSE of mean(variance estimate)",
       colour = NULL,
       title = "Monte-Carlo error of variance estimators (MCSE of the mean)") +
  base_theme + 
  theme(legend.position = "bottom")

print(p_var_mcse)
Figure 7: MCSE of mean variance estimates (from the simulation summary).

Supplementary Tables

Table S1 — Method feasibility and stability

Table 1 summarizes feasibility and stability diagnostics for lnM estimation across simulation settings. Values are summarized across theta within each design × (n1, n2) facet. Medians represent typical behavior across theta, while maxima highlight worst-case behavior near boundary regions.

Variable Name Definition
θ_min, θ_max Minimum and maximum true lnM values evaluated in the simulation grid for a given design × sample-size combination.
delta_fail_median Median proportion of simulation replicates (across \(\theta\)) for which the Delta plug-in lnM estimator was undefined (e.g., due to boundary violations).
safe_fail_median Median proportion of simulation replicates for which SAFE failed to return a usable lnM estimate.
SAFE_ok_median Median proportion of replicates in which SAFE returned a valid (accepted) bootstrap estimate.
delta_cap_median Median proportion of replicates in which the Delta-method sampling variance was capped for numerical stability.
delta_fail_max, safe_fail_max, delta_cap_max Worst-case (maximum) values of the corresponding metrics across \(\theta\), highlighting boundary regions where estimator behavior deteriorates.
Code
tab1_stability <- results %>%
  group_by(design, n1, n2, facet_label) %>%
  summarise(
    # Range of θ (theta) values evaluated in each facet
    theta_min = min(theta, na.rm = TRUE),
    theta_max = max(theta, na.rm = TRUE),
    
    # Median failure rates across θ
    delta_fail_median = median(delta_fail_prop, na.rm = TRUE),
    safe_fail_median  = median(safe_fail_prop,  na.rm = TRUE),
    
    # Median SAFE convergence / acceptance rate
    SAFE_ok_median    = median(SAFE_ok_rate, na.rm = TRUE),
    
    # Median rate of Delta variance capping
    delta_cap_median  = median(delta_cap_rate, na.rm = TRUE),
    
    # Worst-case behaviour across θ
    delta_fail_max = max(delta_fail_prop, na.rm = TRUE),
    safe_fail_max  = max(safe_fail_prop,  na.rm = TRUE),
    delta_cap_max  = max(delta_cap_rate,  na.rm = TRUE),
    
    .groups = "drop"
  ) %>%
  arrange(factor(facet_label, levels = levels(results$facet_label))) %>%
  mutate(
    design = as.character(design),
    facet_label = as.character(facet_label)
  )

kable(tab1_stability, digits = 3,
      caption = "Table S1. Feasibility / stability diagnostics summarised across θ within each design × (n1,n2) facet.") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed"))%>%
  kableExtra::scroll_box(width = "100%", height = "400px")
Table 1: Table S1. Feasibility / stability diagnostics summarised across θ within each design × (n1,n2) facet.
design n1 n2 facet_label theta_min theta_max delta_fail_median safe_fail_median SAFE_ok_median delta_cap_median delta_fail_max safe_fail_max delta_cap_max
indep 5 5 indep n1=5 n2=5 0.2 5 0.227 0 1 0.060 0.631 0 0.171
indep 10 10 indep n1=10 n2=10 0.2 5 0.077 0 1 0.025 0.622 0 0.169
indep 20 20 indep n1=20 n2=20 0.2 5 0.009 0 1 0.004 0.584 0 0.155
indep 100 100 indep n1=100 n2=100 0.2 5 0.000 0 1 0.000 0.333 0 0.083
indep 3 7 indep n1=3 n2=7 0.2 5 0.269 0 1 0.067 0.635 0 0.166
indep 6 14 indep n1=6 n2=14 0.2 5 0.110 0 1 0.032 0.630 0 0.170
indep 12 28 indep n1=12 n2=28 0.2 5 0.019 0 1 0.007 0.602 0 0.159
indep 40 160 indep n1=40 n2=160 0.2 5 0.000 0 1 0.000 0.430 0 0.106
paired 5 5 paired n1=5 n2=5 0.6 5 0.002 0 1 0.001 0.493 0 0.097
paired 10 10 paired n1=10 n2=10 0.6 5 0.000 0 1 0.000 0.227 0 0.050
paired 20 20 paired n1=20 n2=20 0.3 5 0.000 0 1 0.000 0.535 0 0.103
paired 100 100 paired n1=100 n2=100 0.2 5 0.000 0 1 0.000 0.181 0 0.041

Table S2 — SAFE bootstrap acceptance (kept/tried)

Table 2 summarizes the SAFE bootstrap acceptance rate (“kept / tried”) across theta within each facet. Quantiles show how acceptance varies across theta, including worst-case regions where accepted draws are rare.

Variable Name Definition
q0, q25, q50, q75, q100 Minimum, 25th percentile, median, 75th percentile, and maximum SAFE bootstrap acceptance rates across values of \(\theta\) within each facet.
mean Mean SAFE bootstrap acceptance rate across \(\theta\).
Code
tab2_accept <- results %>%
  group_by(design, n1, n2, facet_label) %>%
  summarise(
    q0  = quantile(boot_accept_prop, 0.00, na.rm = TRUE),
    q25 = quantile(boot_accept_prop, 0.25, na.rm = TRUE),
    q50 = quantile(boot_accept_prop, 0.50, na.rm = TRUE),
    q75 = quantile(boot_accept_prop, 0.75, na.rm = TRUE),
    q100= quantile(boot_accept_prop, 1.00, na.rm = TRUE),
    mean = mean(boot_accept_prop, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(factor(facet_label, levels = levels(results$facet_label)))

kable(tab2_accept, digits = 4,
      caption = "Table S2. SAFE acceptance rate (kept/tried) summarised across θ for each facet. Quantiles highlight worst-case acceptance.") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed"))%>%
  kableExtra::scroll_box(width = "100%", height = "400px")
Table 2: Table S2. SAFE acceptance rate (kept/tried) summarised across θ for each facet. Quantiles highlight worst-case acceptance.
design n1 n2 facet_label q0 q25 q50 q75 q100 mean
indep 5 5 indep n1=5 n2=5 0.4820 0.5444 0.6752 0.9079 1.0000 0.7226
indep 10 10 indep n1=10 n2=10 0.4663 0.5892 0.8083 0.9871 1.0000 0.7800
indep 20 20 indep n1=20 n2=20 0.4718 0.7044 0.9441 0.9997 1.0000 0.8428
indep 100 100 indep n1=100 n2=100 0.5779 0.9897 1.0000 1.0000 1.0000 0.9511
indep 3 7 indep n1=3 n2=7 0.4477 0.5030 0.6285 0.8755 0.9998 0.6898
indep 6 14 indep n1=6 n2=14 0.4484 0.5575 0.7642 0.9761 1.0000 0.7571
indep 12 28 indep n1=12 n2=28 0.4586 0.6634 0.9161 0.9991 1.0000 0.8246
indep 40 160 indep n1=40 n2=160 0.5247 0.9510 0.9997 1.0000 1.0000 0.9281
paired 5 5 paired n1=5 n2=5 0.1982 0.6858 0.9722 0.9996 1.0000 0.8156
paired 10 10 paired n1=10 n2=10 0.4293 0.9150 0.9990 1.0000 1.0000 0.9089
paired 20 20 paired n1=20 n2=20 0.1547 0.9570 1.0000 1.0000 1.0000 0.8822
paired 100 100 paired n1=100 n2=100 0.5318 1.0000 1.0000 1.0000 1.0000 0.9709

Acceptance rate is defined as the proportion of bootstrap resamples retained after SAFE’s feasibility and stability checks. Lower values indicate reduced effective bootstrap sample sizes, typically occurring near parameter boundaries or under small sample sizes.

Table S3 — MCSE diagnostics

Table 4 summarizes Monte Carlo precision. Values are the maximum MCSE observed across theta within each facet, giving a conservative measure of simulation uncertainty.

Table 3: Column definitions for Monte Carlo standard errors
Variable Name Definition
mcse_bias_PI_max Maximum Monte Carlo standard error of the Delta-method lnM bias estimate across \(\theta\).
mcse_bias_SAFE_max Maximum Monte Carlo standard error of the SAFE lnM bias estimate across \(\theta\).
mcse_var_PI_max Maximum Monte Carlo standard error of the mean Delta-method variance estimate across \(\theta\).
mcse_var_SAFE_max Maximum Monte Carlo standard error of the mean SAFE variance estimate across \(\theta\).
Code
has_mcse_bias <- all(c("mcse_bias_delta","mcse_bias_safe") %in% names(results))

tab3_mcse <- results %>%
  group_by(design, n1, n2, facet_label) %>%
  summarise(
    # bias MCSE (optional)
    mcse_bias_PI_max   = if (has_mcse_bias) max(mcse_bias_delta, na.rm = TRUE) else NA_real_,
    mcse_bias_SAFE_max = if (has_mcse_bias) max(mcse_bias_safe,  na.rm = TRUE) else NA_real_,
    
    # variance MCSE (these are in your results)
    mcse_var_PI_max    = max(mcse_varbar_delta, na.rm = TRUE),
    mcse_var_SAFE_max  = max(mcse_varbar_safe,  na.rm = TRUE),
    
    .groups = "drop"
  ) %>%
  arrange(factor(facet_label, levels = levels(results$facet_label)))

kable(tab3_mcse, digits = 5,
      caption = "Table S3. Monte-Carlo precision (max MCSE over θ within each facet).") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped","hover","condensed"))%>%
  kableExtra::scroll_box(width = "100%", height = "400px")
Table 4: Table S3. Monte-Carlo precision (max MCSE over θ within each facet).
design n1 n2 facet_label mcse_bias_PI_max mcse_bias_SAFE_max mcse_var_PI_max mcse_var_SAFE_max
indep 5 5 indep n1=5 n2=5 0.00293 0.00283 0.03832 0.00039
indep 10 10 indep n1=10 n2=10 0.00278 0.00276 0.03776 0.00037
indep 20 20 indep n1=20 n2=20 0.00263 0.00271 0.03525 0.00036
indep 100 100 indep n1=100 n2=100 0.00246 0.00269 0.02258 0.00035
indep 3 7 indep n1=3 n2=7 0.00305 0.00284 0.03836 0.00042
indep 6 14 indep n1=6 n2=14 0.00292 0.00276 0.03829 0.00038
indep 12 28 indep n1=12 n2=28 0.00273 0.00273 0.03628 0.00037
indep 40 160 indep n1=40 n2=160 0.00274 0.00271 0.02673 0.00035
paired 5 5 paired n1=5 n2=5 0.00253 0.00285 0.02733 0.00036
paired 10 10 paired n1=10 n2=10 0.00219 0.00265 0.01713 0.00033
paired 20 20 paired n1=20 n2=20 0.00221 0.00258 0.02920 0.00031
paired 100 100 paired n1=100 n2=100 0.00190 0.00248 0.01534 0.00029

MCSE values quantify simulation uncertainty due to a finite number of Monte Carlo replicates; smaller values indicate more precise simulation estimates.

Part 3. Supplementary simulation for the absolute standardized mean difference, |d|

To complement the main simulation study for lnM, we conducted an additional simulation for the absolute standardized mean difference, |d|, a commonly used workaround for analysing effect magnitude.

The data-generating process, parameter grid, and performance summaries were the same as in the main simulation (Section 2.3), so that results for |d| and lnM were directly comparable. Here, we describe only the effect-size estimators, targets, and sampling-variance formulas specific to |d|.

For each simulated dataset, we first calculated a signed standardized mean difference and then converted it to magnitude by taking its absolute value.

Signed standardized mean differences

For independent groups, we used the following estimator of \(d\):

\[ \hat d_{\mathrm{IND}} = \frac{\bar X_1 - \bar X_2}{s_p} \]

where

\[ s_p = \sqrt{\frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}} \]

For paired (dependent) groups, we used a repeated-measures Cohen-type standardized mean difference based on the average within-condition standard deviation:

\[ \hat d_{\mathrm{DEP}} = \frac{\bar X_1 - \bar X_2}{s_{\mathrm{av}}} \]

where

\[ s_{\mathrm{av}} = \sqrt{\frac{s_1^2 + s_2^2}{2}} \]

Under our simulation parameterization (\(\mu_1 = 0\), \(\mu_2 = \theta\), and \(\sigma_1 = \sigma_2 = 1\)), the population signed standardized mean difference equals \(\theta\) for both designs. However, because the goal here was to evaluate unsigned magnitude estimators, we did not treat \(|\theta|\) as the primary target. Instead, following the logic that magnitude is obtained by transforming a signed effect with sampling uncertainty, we defined the target as the mean of the corresponding folded-normal distribution:

\[ |d|_{\mathrm{FN,true}} = \mathbb{E}(|D|), \qquad D \sim \mathcal{N}(d_{\mathrm{true}}, v_{d,\mathrm{true}}) \] where \(d_{\mathrm{true}} = \theta\) and \(v_{d,\mathrm{true}}\) is the design-specific sampling variance of the signed standardized mean difference.

For independent groups, we used the common large-sample approximation:

\[ v_{d,\mathrm{true,IND}} \approx \frac{n_1 + n_2}{n_1 n_2} + \frac{d_{\mathrm{true}}^2}{2(n_1 + n_2 - 2)} \]

For paired groups, we used the corresponding large-sample approximation for the repeated-measures standardized mean difference:

\[ v_{d,\mathrm{true,DEP}} \approx \frac{2(1-\rho)}{n} + \frac{d_{\mathrm{true}}^2}{2(n-1)} \]

where \(\rho\) is the population within-pair correlation and \(n\) is the number of pairs.

Thus, the theoretical folded-normal target was: \[ |d|_{\mathrm{FN,true}} = \sqrt{\frac{2v_{d,\mathrm{true}}}{\pi}} \exp\left(-\frac{d_{\mathrm{true}}^2}{2v_{d,\mathrm{true}}}\right) + |d_{\mathrm{true}}| \left[ 2\Phi\left(\frac{|d_{\mathrm{true}}|}{\sqrt{v_{d,\mathrm{true}}}}\right) - 1 \right] \]

Unsigned magnitude estimators for |d

The naive estimator of magnitude was simply:

\[ |\hat d|_{\mathrm{naive}} = |\hat d| \]

where \(\hat d\) denotes either \(\hat d_{\mathrm{IND}}\) or \(\hat d_{\mathrm{DEP}}\), depending on the design.

For the naive approach, we used the usual sampling variance of the corresponding signed standardized mean difference, unchanged after taking the absolute value. For independent groups: \[ \widehat{\mathrm{Var}}_{\mathrm{naive,IND}}(|d|) \approx \widehat{\mathrm{Var}}(\hat d_{\mathrm{IND}}) = \frac{n_1 + n_2}{n_1 n_2} + \frac{\hat d_{\mathrm{IND}}^2}{2(n_1 + n_2 - 2)} \]

For paired groups:

\[ \widehat{\mathrm{Var}}_{\mathrm{naive,DEP}}(|d|) \approx \widehat{\mathrm{Var}}(\hat d_{\mathrm{DEP}}) = \frac{2(1-r)}{n} + \frac{\hat d_{\mathrm{DEP}}^2}{2(n-1)} \]

where \(r\) is the sample within-pair correlation.

To examine a distribution-based alternative, we also considered a folded-normal plug-in estimator. Specifically, if the signed standardized mean difference is approximated by

\[ \hat d \sim \mathcal{N}(\mu_d, v_d) \] then the corresponding magnitude \(|\hat d|\) has a folded-normal distribution. Its mean is:

\[ \mathbb{E}(|\hat d|) = \sqrt{\frac{2v_d}{\pi}} \exp\left(-\frac{\mu_d^2}{2v_d}\right) + |\mu_d| \left[ 2\Phi\left(\frac{|\mu_d|}{\sqrt{v_d}}\right) - 1 \right] \]

and its variance is:

\[ \mathrm{Var}(|\hat d|) = \mu_d^2 + v_d - \left\{\mathbb{E}(|\hat d|)\right\}^2 \]

In practice, we used the observed signed effect-size estimate and its reported sampling variance as plug-in values:

\[ \mu_d \leftarrow \hat d, \qquad v_d \leftarrow \widehat{\mathrm{Var}}(\hat d) \]

Therefore, the folded-normal point estimator was:

\[ |\hat d|_{\mathrm{FN}} = \sqrt{\frac{2\hat v_d}{\pi}} \exp\left(-\frac{\hat d^2}{2\hat v_d}\right) + |\hat d| \left[ 2\Phi\left(\frac{|\hat d|}{\sqrt{\hat v_d}}\right) - 1 \right] \]

with sampling variance:

\[ \widehat{\mathrm{Var}}_{\mathrm{FN}}(|d|) = \hat d^2 + \hat v_d - |\hat d|_{\mathrm{FN}}^2 \]

Performance summaries

For each parameter combination, we repeated the simulation \(K = 10^5\) times. For every replicate, we recorded: (i) the signed standardized mean difference \(\hat d\), (ii) the naive absolute-value estimator \(|\hat d|_{\mathrm{naive}}\), (iii) its sampling variance from the equations above, (iv) the folded-normal estimator \(|\hat d|_{\mathrm{FN}}\), and (v) its sampling variance from the folded-normal plug-in formula.

We then evaluated both point estimators against the folded-normal target above. For each estimator \(|\hat d|_l \in \{|\hat d|_{\mathrm{naive}}, |\hat d|_{\mathrm{FN}}\}\), we calculated empirical bias as:

\[ \mathrm{bias}\left(|\hat d|_l\right) = \overline{|\hat d|_l} - |d|_{\mathrm{FN,true}} \]

where

\[ \overline{|\hat d|_l} = \frac{1}{K} \sum_{k=1}^{K} |\hat d|_l^{(k)} \]

For the variance estimators, we calculated relative bias against the Monte Carlo variance of each point estimator:

\[ \mathrm{relative\ bias}\left(\widehat{\mathrm{Var}}_l\right) = \frac{ \overline{\widehat{\mathrm{Var}}_l} - \mathrm{Var}_{\mathrm{MC}}\left(|\hat d|_l\right) }{ \mathrm{Var}_{\mathrm{MC}}\left(|\hat d|_l\right) } \times 100 \]

where

\[ \overline{\widehat{\mathrm{Var}}_l} = \frac{1}{K} \sum_{k=1}^{K} \widehat{\mathrm{Var}}_l^{(k)} \]

and

\[ \mathrm{Var}_{\mathrm{MC}}\left(|\hat d|_l\right) = \frac{1}{K-1} \sum_{k=1}^{K} \left( |\hat d|_l^{(k)} - \overline{|\hat d|_l} \right)^2 \]

We also calculated root mean squared error (RMSE), nominal 95% Wald-interval coverage, and Monte Carlo standard errors (MCSE). In this supplementary simulation, the folded-normal target is the appropriate benchmark because both the naive absolute-value estimator and the folded-normal estimator aim to estimate an unsigned magnitude induced by folding a signed effect-size distribution, rather than the simpler quantity \(|d_{\mathrm{true}}|\).

Simulation results for |d|

Bias of point estimators

Figure 8 shows the empirical bias of the two |d| estimators across the simulation grid. The folded-normal estimator generally reduced bias relative to the naive absolute-value estimator when standardized mean differences were small, especially for independent-group designs. However, both approaches remained positively biased because the target itself is an unsigned folded quantity.

Figure 8: Empirical bias of the two |d| estimators across the simulation grid for independent and paired designs. The folded-normal estimator is shown in blue and the naive absolute-value estimator in red.

Relative bias of variance estimators

Figure 9 shows the relative bias of the variance estimators for the two |d| approaches. For independent-group designs, both variance estimators tended to overestimate variance when \(\theta\) was very small and then underestimate it over much of the remaining range. For paired designs, patterns differed more clearly with sample size, but the folded-normal and naive variance estimators remained broadly similar.

Figure 9: Relative bias of the variance estimators for the naive and folded-normal |d| estimators across the simulation grid.

Coverage of 95% Wald intervals

Figure 10 shows the coverage of nominal 95% Wald intervals for the two |d| estimators. Coverage was often close to nominal overall, but the folded-normal estimator showed a clearer dip below 95% in some independent-group scenarios with small \(\theta\), whereas paired designs were generally slightly conservative.

Figure 10: Coverage of nominal 95% Wald intervals for the naive and folded-normal |d| estimators across the simulation grid.

Root mean squared error

Figure 11 summarizes overall estimator performance in terms of RMSE. Differences between the two |d| estimators were generally modest, with the folded-normal estimator showing slightly lower RMSE in some scenarios, especially when \(\theta\) was small.

Figure 11: Root mean squared error (RMSE) of the naive and folded-normal |d| estimators across the simulation grid.

Direct comparison with lnM

To facilitate comparison with the main simulation for lnM, Figure 12 and Figure 13 show the corresponding bias and variance-estimation patterns for |d| and lnM together. These combined plots make the main contrast visually clear: lnM can exhibit much larger point-estimator and variance-estimator problems near the boundary where between-group separation is weak, whereas |d|-based estimators remain defined but target a different folded quantity.

Figure 12: Empirical bias of |d| and lnM estimators across the simulation grid.
Figure 13: Relative bias of variance estimators for |d| and lnM across the simulation grid.

Monte Carlo error diagnostics

For completeness, Figure 14 and Figure 15 show Monte Carlo standard errors for the estimated bias and variance summaries. These diagnostics indicate that the reported simulation summaries were estimated with high precision across the parameter grid.

Figure 14: Monte Carlo standard errors of estimated bias for the |d| estimators.
Figure 15: Monte Carlo standard errors of the mean estimated sampling variances for the |d| estimators.

References

Almeida, L Zoe, Stephen M Hovick, Stuart A Ludsin, and Elizabeth A Marschall. 2021. “Which Factors Determine the Long-Term Effect of Poor Early-Life Nutrition? A Meta-Analytic Review.” Ecosphere 12 (8): e03694.
Egger, Matthias, George Davey Smith, Martin Schneider, and Christoph Minder. 1997. “Bias in Meta-Analysis Detected by a Simple, Graphical Test.” Bmj 315 (7109): 629–34.
Kunc, Hansjoerg P, and Rouven Schmidt. 2019. “The Effects of Anthropogenic Noise on Animals: A Meta-Analysis.” Biology Letters 15 (11): 20190649.
Nakagawa, Shinichi, Malgorzata Lagisz, Michael D Jennions, Julia Koricheva, Daniel WA Noble, Timothy H Parker, Alfredo Sánchez-Tójar, Yefeng Yang, and Rose E O’Dea. 2022. “Methods for Testing Publication Bias in Ecological and Evolutionary Meta-Analyses.” Methods in Ecology and Evolution 13 (1): 4–21.

Session info

sessionInfo()
R version 4.5.2 (2025-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)

Matrix products: default
  LAPACK version 3.12.1

locale:
[1] LC_COLLATE=English_Guernsey.utf8  LC_CTYPE=English_Guernsey.utf8   
[3] LC_MONETARY=English_Guernsey.utf8 LC_NUMERIC=C                     
[5] LC_TIME=English_Guernsey.utf8    

time zone: America/Edmonton
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] multcomp_1.4-30     TH.data_1.1-5       MASS_7.3-65        
 [4] survival_3.8-6      mvtnorm_1.3-3       patchwork_1.3.2    
 [7] kableExtra_1.4.0    orchaRd_2.1.3       here_1.0.2         
[10] metafor_4.8-0       numDeriv_2016.8-1.1 metadat_1.4-0      
[13] Matrix_1.7-4        lubridate_1.9.5     forcats_1.0.1      
[16] stringr_1.6.0       dplyr_1.2.0         purrr_1.2.1        
[19] readr_2.2.0         tidyr_1.3.2         tibble_3.3.1       
[22] ggplot2_4.0.2       tidyverse_2.0.0    

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1   viridisLite_0.4.3  vipor_0.4.7        farver_2.1.2      
 [5] S7_0.2.1           fastmap_1.2.0      latex2exp_0.9.8    mathjaxr_2.0-0    
 [9] digest_0.6.39      timechange_0.4.0   estimability_1.5.1 lifecycle_1.0.5   
[13] magrittr_2.0.4     compiler_4.5.2     rlang_1.1.7        tools_4.5.2       
[17] yaml_2.3.12        knitr_1.51         labeling_0.4.3     htmlwidgets_1.6.4 
[21] bit_4.6.0          xml2_1.5.2         RColorBrewer_1.1-3 withr_3.0.2       
[25] tester_0.3.0       grid_4.5.2         xtable_1.8-8       emmeans_2.0.2     
[29] scales_1.4.0       cli_3.6.5          rmarkdown_2.30     crayon_1.5.3      
[33] generics_0.1.4     otel_0.2.0         rstudioapi_0.18.0  tzdb_0.5.0        
[37] pbapply_1.7-4      ggbeeswarm_0.7.3   splines_4.5.2      parallel_4.5.2    
[41] vctrs_0.7.1        sandwich_3.1-1     jsonlite_2.0.0     hms_1.1.4         
[45] bit64_4.6.0-1      beeswarm_0.4.0     systemfonts_1.3.2  glue_1.8.0        
[49] codetools_0.2-20   stringi_1.8.7      gtable_0.3.6       pillar_1.11.1     
[53] htmltools_0.5.9    R6_2.6.1           textshaping_1.0.5  rprojroot_2.1.1   
[57] vroom_1.7.0        evaluate_1.0.5     lattice_0.22-9     svglite_2.2.2     
[61] coda_0.19-4.1      nlme_3.1-168       mgcv_1.9-4         xfun_0.56         
[65] zoo_1.8-15         pkgconfig_2.0.3