Worked examples + extra results from the simulation (figures + tables)
Authors
Shinichi Nakagawa
Ayumi Mizuno
Coralie Williams
Malgorzata Lagisz
Rose E. O’Dea
Daniel W. A. Noble
Alistair M. Senior
Erick Lundgren
Santiago Ortega
Published
April 3, 2026
Code
# ------------------------------------------------------------------------------# SETUP# This chunk:# 1) Sets global chunk options (figures, widths)# 2) Loads required packages# 3) Defines helper functions used throughout the supplement# ------------------------------------------------------------------------------knitr::opts_chunk$set(fig.align ="center",fig.width =10,fig.height =6,out.width ="100%")suppressPackageStartupMessages({library(tidyverse)library(metafor) # escalc(), rma.mv()library(here) # robust file pathslibrary(orchaRd) # orchard_plot(), bubble_plot(), .safe_lnM_indep()library(kableExtra) # HTML tableslibrary(patchwork) # combining ggplots library(multcomp) # multiple comparisons})# ---- Ensure compatible orchaRd version -------------------------------------orchard_min_version <-"2.1.3"# version that includes SAFE lnM + seed argumentif (packageVersion("orchaRd") <package_version(orchard_min_version)) {stop("This tutorial requires orchaRd >= ", orchard_min_version, ".\n","Your version is ", packageVersion("orchaRd"), ".\n\n","Please update by running:\n"," install.packages('orchaRd')\n","or, if needed:\n"," remotes::install_github('daniel1noble/orchaRd')" )}# ---- Check that the required SAFE function is available ---------------------if (!exists(".safe_lnM_indep", where =asNamespace("orchaRd"), inherits =FALSE)) {stop("This tutorial requires orchaRd:::.safe_lnM_indep(), but it was not found.\n","Please update orchaRd to a version that includes SAFE lnM." )}# ---- SAFE lnM (orchaRd) -----------------------------------------------------# Note on purrr::pmap_dfr():# pmap_dfr() applies a function row-by-row to several input columns (m1i, m2i, sd1i, ...),# returning a data frame with one output row per effect size.## Settings below match the simulation defaults used in this project.# Passing a fixed seed makes each SAFE estimate reproducible.get_lnM_safe <-function(m1, m2, s1, s2, n1, n2, min_kept =100000, chunk_init =5000, seed=123) { out <- orchaRd:::.safe_lnM_indep(x1bar = m1, x2bar = m2,sd1 = s1, sd2 = s2,n1 = n1, n2 = n2,min_kept = min_kept,chunk_init = chunk_init,seed=seed)tibble::tibble(yi_lnM_safe = out$lnM_SAFE,vi_lnM_safe = out$var_lnM_SAFE,draws_kept = out$kept,draws_total = out$total,attempts = out$attempts,status = out$status)}# ---- Folded-normal helper for magnitude of SMD ------------------------------# This converts a signed normal estimate (mu, var) into the expected magnitude# under the implied folded-normal distribution.folded_norm <-function(mu, var) { sigma <-sqrt(var)if (sigma ==0) return(c(point =abs(mu), var =0)) z <- mu / sigma meanFN <- sigma *sqrt(2/ pi) *exp(-0.5* z^2) + mu * (2*pnorm(z) -1) varFN <- mu^2+ var - meanFN^2c(point = meanFN, var = varFN)}# ---- Delta plug-in lnM (analytic) -------------------------------------------# IMPORTANT: This requires lnM_delta1_indep() to be available, source it heresource(here("R", "SAFE_fun.R")) get_lnM_raw <-function(m1, m2, s1, s2, n1, n2) { out <-lnM_delta1_indep(m1, m2, s1, s2, n1, n2) # returns c(point, var)tibble(yi_lnM_raw = out["point"],vi_lnM_raw = out["var"] )}
Overview
This supplement has two parts:
Part I — Worked examples (Examples 1–2).
We demonstrate a complete, end-to-end workflow for quantifying effect magnitude using lnM. For each dataset, we first compute a conventional standardized mean difference (SMD) and compare results obtained using absolute SMD–based measures with those obtained using lnM.
We note that we estimate lnM with two approaches:
a Delta plug-in estimator and
the SAFE bootstrap bias-corrected point estimate (SAFE-BC) which is designed to remain usable when the plug-in estimator is undefined (e.g., when \(MS_B \le MS_W\)).
Using these estimates, we then fit multilevel meta-analytic models and meta-regressions and visualize results using the same plot types reported in the main manuscript.
Part II — Simulation reproduction (Figures + Tables).
We reproduce all simulation-based figures and tables reported in the manuscript using pre-computed simulation summaries and, where required, outputs from individual simulation replicates (i.e. results from each Monte Carlo run within a design cell). These results characterise estimator performance in terms of bias, variance behaviour, empirical coverage, root mean square error (RMSE), and Monte Carlo standard error (MCSE).
We additionally provide compact tables summarising estimator feasibility, numerical stability, and SAFE bootstrap acceptance rates across simulation designs.
Part I — Worked examples
Part I illustrates how lnM is computed, analysed, and interpreted in real meta-analytic datasets using two published examples with contrasting data structures.
Example 1 — Almeida et al nutrition dataset
Almeida et al.(2021) examined whether experimental manipulations of nutrition produce consistently large deviations from control conditions across a range of physiological and performance traits. Because the biological interest lies in the magnitude of responses rather than their direction, this dataset provides a natural setting for comparing analyses based on absolute SMD measures with corresponding analyses based on lnM.
Effect size calculations
This section calculates four effect-size quantities from the same raw inputs (group means, SDs, and sample sizes). Each step adds new columns to dat.
# Load the dataset from CSV filesdat <-read_csv(here("data", "data_almeida.csv"), show_col_types =FALSE)
Note
Type
Variable Names
Description
Inputs (Group 1)
m1i, sd1i, n1i
Mean, SD, and Sample Size for Group 1
Inputs (Group 2)
m2i, sd2i, n2i
Mean, SD, and Sample Size for Group 2
Output (1)
yi_d, vi_d
Signed SMD
Output (2)
yi_d_abs, vi_d_abs
Magnitude of SMD via folded-normal
Output (3)
yi_lnM_raw, vi_lnM_raw
lnM via Delta plug-in
Output (4)
yi_lnM_safe, vi_lnM_safe
lnM via SAFE
Conventional SMD (signed)
We first compute the standardized mean difference (SMD). This is the conventional effect size used in many meta-analyses. metafor::escalc() returns a point estimate (yi_d) and its sampling variance (vi_d).
Many researchers summarise “effect magnitude” by taking \(|SMD|\). However, absolute values are biased upward, especially when sampling error is large.
Instead of taking \(|SMD|\) directly, we use the folded-normal distribution implied by \((\hat{d}, \mathrm{Var}(\hat{d}))\), yielding yi_d_abs and vi_d_abs.
dat <- dat %>%mutate(abs_d =map2_dfr(yi_d, vi_d, ~{ out <-folded_norm(.x, .y)tibble(yi_d_abs = out["point"], vi_d_abs = out["var"]) }) ) %>%unnest(abs_d)
Note
About pmap_dfr() (row-wise function application) purrr::pmap_dfr() applies a function once per row using multiple columns as inputs (e.g., means, SDs, and sample sizes), returning a data frame with one output row per effect size.
lnM via Delta plug-in estimator
We compute lnM using an analytic (Delta-method) plug-in estimator. This can be undefined near boundary conditions, which we diagnose later. The result is stored as yi_lnM_raw with sampling variance vi_lnM_raw.
dat <- dat %>%mutate(lnM_raw = purrr::pmap_dfr(list(m1i, m2i, sd1i, sd2i, n1i, n2i), get_lnM_raw ) ) %>%unnest(lnM_raw)
lnM via SAFE
We compute lnM using SAFE (single-fit parametric bootstrap), which is designed to remain usable when the plug-in estimator fails. The result is stored as yi_lnM_safe with sampling variance vi_lnM_safe.
dat <- dat %>% dplyr::mutate(lnM_safe = purrr::pmap_dfr(list(m1i, m2i, sd1i, sd2i, n1i, n2i), get_lnM_safe ) ) %>% tidyr::unnest(lnM_safe)
How many lnM estimates can be obtained?
The Delta plug-in estimator of lnM can be undefined for some effect sizes, particularly near boundary conditions. SAFE is designed to be more robust in these cases by relying on a single-fit parametric bootstrap. The table below summarises (i) how often each method yields a usable lnM point estimate and sampling variance, and (ii) how often SAFE provides valid estimates in cases where the Delta plug-in estimator fails.
Note
Percentages are computed relative to the total number of effect sizes in the dataset.
Because lnM can be undefined when the between-group variance is not estimatable, we first quantify how often each estimator returns usable lnM point estimates and sampling variances.
Code
stopifnot(exists("dat"), is.data.frame(dat))n_total <-nrow(dat)# ---- Helper counts -----------------------------------------------------------summary_tbl <- tibble::tibble(Metric =c("Total effect sizes","Delta plug-in lnM: point estimate unavailable (undefined)","Delta plug-in lnM: sampling variance unavailable","SAFE lnM: point estimate unavailable","SAFE lnM: sampling variance unavailable","SAFE rescue: point estimate available when Delta plug-in is undefined","SAFE rescue: sampling variance available when Delta plug-in is undefined" ),Count =c( n_total,sum(is.na(dat$yi_lnM_raw)),sum(is.na(dat$vi_lnM_raw)),sum(is.na(dat$yi_lnM_safe)),sum(is.na(dat$vi_lnM_safe)),sum(is.na(dat$yi_lnM_raw) &!is.na(dat$yi_lnM_safe)),sum(is.na(dat$vi_lnM_raw) &!is.na(dat$vi_lnM_safe)) )) %>% dplyr::mutate(Percent = dplyr::if_else( Metric =="Total effect sizes",NA_real_,100* Count / n_total ) )# Headline numbers people care aboutn_saved <-sum(!is.na(dat$yi_lnM_safe))p_saved <-100* n_saved / n_totalcat(sprintf("\n\n**SAFE produced defined lnM point estimates for %d/%d effect sizes (%.1f%%).**\n\n", n_saved, n_total, p_saved))
**SAFE produced defined lnM point estimates for 683/683 effect sizes (100.0%).**
Code
summary_tbl %>% kableExtra::kable(digits =1,caption ="Data availability for lnM estimates: Delta plug-in vs SAFE (counts and % of all effect sizes).",align =c("l", "r", "r") ) %>% kableExtra::kable_styling(full_width =FALSE)%>% kableExtra::scroll_box(width ="100%", height ="400px")
Data availability for lnM estimates: Delta plug-in vs SAFE (counts and % of all effect sizes).
Metric
Count
Percent
Total effect sizes
683
NA
Delta plug-in lnM: point estimate unavailable (undefined)
270
39.5
Delta plug-in lnM: sampling variance unavailable
270
39.5
SAFE lnM: point estimate unavailable
0
0.0
SAFE lnM: sampling variance unavailable
0
0.0
SAFE rescue: point estimate available when Delta plug-in is undefined
270
39.5
SAFE rescue: sampling variance available when Delta plug-in is undefined
270
39.5
Tip
Rows labelled “SAFE rescue” count effect sizes for which the Delta plug-in estimator failed but SAFE returned a valid estimate.
This table shows that the Delta plug-in estimator fails to return a defined lnM point estimate or sampling variance for a substantial fraction of effect sizes in this dataset. In contrast, SAFE returns both a point estimate and a sampling variance for all effect sizes. As a result, SAFE “rescues” a large number of comparisons that would otherwise need to be excluded from lnM-based analyses.
Meta-analysis model
In this worked example we focus on body condition outcomes from the Almeida dataset. Almeida et al. report multiple outcome categories (e.g., condition and other trait classes); to keep the tutorial concrete and comparable across methods, we restrict to the Category == "Condition" subset.
We then fit the same multilevel random-effects structure for two magnitude metrics: (i) the folded-normal magnitude of SMD and (ii) lnM estimated with SAFE. In both cases, effect sizes are nested within studies (random intercepts for Study and for the within-study effect-size ID, ES.ID).
dat_cond <- dat %>%filter(Category =="Condition") %>%rename(exposure = magnitude)dim(dat_cond)
[1] 336 33
Absolute SMD
(folded-normal)
We meta-analyse the SMD magnitude using the folded-normal transformation.
For interpretability, we map \(lnM\) onto an SMD-equivalent effect size, \(d_{\mathrm{eq}}\), which is expressed in units of the within-group standard deviation. Using the large-sample approximation:
Overall model (ma_safe): d_eq (approx) = 1.22 (95% CI: 0.98 to 1.54)
Because this transformation is monotone (larger lnM always implies larger \(d_{\mathrm{eq}}\)), we can obtain a 95% interval for \(d_{\mathrm{eq}}\) by transforming the lower and upper endpoints of the lnM interval.
In the Almeida dataset, the folded-normal SMD magnitude is constrained to be non-negative and many estimates accumulate near zero, with a small number of extreme values extending the upper tail. lnM, in contrast, is defined on an unbounded scale and spreads effect magnitudes more evenly, making between-study differences in overall magnitude easier to resolve when direction is not of primary interest.
Meta-regression
We next examine whether variation in effect magnitude is associated with study-level covariates. Specifically, we regress each magnitude metric on exposure intensity, exposure duration, and recovery period while retaining the same multilevel structure (random intercepts for Study and within-study effect-size ID, ES.ID).
The folded-normal SMD magnitude is bounded at zero and shows strong compression near the lower end of the scale, which can attenuate moderator slopes. lnM is unbounded and approximately normal, providing greater dynamic range and making moderator relationships (e.g., with recovery period) easier to detect and interpret under the same multilevel structure.
Small-study effects diagnostic
We use an Egger-style meta-regression (Egger et al. 1997) to assess whether estimated effect magnitudes systematically vary with study size, a pattern that can be consistent with small-study effects (including publication bias). In this approach, effect sizes are regressed on a measure of study precision rather than being compared visually in a funnel plot.
For lnM, using the usual sampling standard error as a predictor is problematic because lnM and its estimated standard error are both functions of the same underlying variance components. This creates a structural dependence between the response and predictor that violates the assumptions of Egger-type regressions. Following recent recommendations for multilevel meta-analyses, we therefore use a sample-size-based measure of precision instead (Nakagawa et al. 2022).
We define the harmonic-mean (or “effective”) sample size as:
\[
n_0 = \frac{n_1 n_2}{n_1 + n_2}
\]
and fit two related models using:
1 / sqrt(n₀), an SE-analogue, to test for the presence of small-study effects; and
1 / n₀, a variance-analogue, to obtain a bias-adjusted overall effect estimate if evidence of small-study effects is detected.
Both predictors depend only on sample sizes and are therefore independent of the observed lnM values themselves. This allows small-study effects to be assessed without inducing artefactual correlations between effect sizes and their precision.
Worked example: Egger-style models using n₀
# Create sample-size-based predictors using the harmonic mean sample size n0dat_cond$n0 <- (dat_cond$n1i * dat_cond$n2i) / (dat_cond$n1i + dat_cond$n2i)dat_cond$nSE <-1/sqrt(dat_cond$n0)dat_cond$nV <-1/ dat_cond$n0
NoteHow to read these models
The models using 1 / sqrt(n₀) test for the presence of small-study effects (Egger-style slope test).
The model using 1 / n₀ provides a bias-adjusted estimate of the overall effect size, evaluated at infinite sample size.
For the folded-normal absolute SMD, the fitted slope is shallow and the bulk of estimates cluster near the lower bound of the scale, indicating limited sensitivity of absolute SMD to study size and little ability to distinguish small-study effects. In contrast, lnM exhibits a clearer positive association with 1 / sqrt(n₀), with smaller effective sample sizes corresponding to larger estimated magnitudes. This pattern indicates the presence of small-study effects for lnM and illustrates that lnM provides a wider dynamic range for diagnosing size-related structure in effect magnitudes under the same multilevel model.
For interpretability, we convert the intercept from the 1 / n₀ model (the bias-adjusted lnM) to an SMD-equivalent magnitude using the large-sample approximation.
lnm_to_deq_approx(egger_safe2, "Egger-style model (egger_safe2)")
Egger-style model (egger_safe2): d_eq (approx) = 0.78 (95% CI: 0.57 to 1.08)
Sensitivity analysis
We repeated the lnM analyses after restricting effect sizes to comparisons with total sample size n₁ + n₂ ≥ 40.
These sensitivity results indicate that lnM-based inferences are robust to the exclusion of small-sample comparisons.
Figure 1
Figure 1 collates the three analyses above (overall model, meta-regression, and small-study effects diagnostic) to highlight how inference differs when magnitude is quantified using folded-normal SMD versus lnM (SAFE).
# Combine panels into a 2 × 3 layout:# Row 1: overall meta-analyses (absolute SMD vs lnM)# Row 2: meta-regressions (absolute SMD vs lnM)# Row 3: Egger-style diagnostics (absolute SMD vs lnM)(p1 +ggtitle("SMD estimates") + p2 +ggtitle("lnM estimates")) / (p3 + p4) / (p5 + p6) +plot_annotation(tag_levels ='A') &theme(plot.tag =element_text(size =16, face ="bold"))
Figure 1
Example 2 — Kunc & Schmidt anthropogenic noise dataset
This dataset synthesises experiments testing whether exposure to anthropogenic noise alters animal behaviour or physiology relative to control (quiet) conditions (Kunc and Schmidt 2019). Each effect size contrasts a noise treatment with a control group. In many applications the direction of change can also be important; here, we focus on magnitude to illustrate how lnM can summarise the strength of responses across taxa and response variables.
Effect size calculations
This section mirrors Example 1. We compute a conventional standardized mean difference (here using the unequal-variance variant, SMDH), its folded-normal magnitude, and lnM using both the Delta plug-in and SAFE.
# Load datadat_raw <-read.csv(here("data", "kunc_2019.csv"), na.strings =c("", "NA")) %>%as.data.frame()# Basic cleaning: keep rows with species info and usable SDsdat_raw <- dat_raw %>% dplyr::filter(!is.na(species.latin), species.latin !="NA") %>% dplyr::mutate(sd.control =as.numeric(as.character(sd.control)),sd.noise =as.numeric(as.character(sd.noise)) ) %>% dplyr::filter(!is.na(sd.control), !is.na(sd.noise))dim(dat_raw)
[1] 490 14
Note
Type
Variable Names
Description
Inputs (Control)
mean.control, sd.control, sample.size.control
Mean, SD, and Sample Size in control condition
Inputs (Noise)
mean.noise, sd.noise, sample.size.noise.1
Mean, SD, and Sample Size in noise condition
Output (1)
yi_d, vi_d
Signed SMDH
Output (2)
yi_d_abs, vi_d_abs
Magnitude of SMDH via folded-normal
Output (3)
yi_lnM_raw, vi_lnM_raw
lnM via Delta plug-in
Output (4)
yi_lnM_safe, vi_lnM_safe
lnM via SAFE
Standardized mean difference
We compute an unequal-variance standardized mean difference (SMDH) using metafor::escalc(). This returns a point estimate (yi_d) and sampling variance (vi_d).
lnm_to_deq_approx(ma_safe_ex2, "Overall model (Example 2)")
Overall model (Example 2): d_eq (approx) = 0.89 (95% CI: 0.74 to 1.07)
In this example, absolute SMD estimates again cluster near the lower bound of the scale, with a small number of extreme values dominating visual spread. By contrast, lnM produces a more symmetric and dispersed distribution of magnitudes, allowing differences in response strength across studies to be more clearly resolved. As in Example 1, this illustrates how lnM can summarise the magnitude of treatment effects across heterogeneous outcomes while retaining a scale that is better suited for synthesis and model-based inference.
Meta-regression
We next examine whether effect magnitudes vary across broad taxonomic groups. Here, taxon.for.plot is treated as a categorical moderator (with - 1 so each group mean is estimated directly).
#Pairwise comparisons among taxon means (Tukey-type)summary(glht(mod_abs_ex2, linfct =cbind(contrMat(rep(1, 4), type ="Tukey"))),test =adjusted("none"))
#Pairwise comparisons among taxon means (Tukey-type)summary(glht(mod_safe_ex2, linfct =cbind(contrMat(rep(1, 4), type ="Tukey"))),test =adjusted("none"))
Using folded-normal SMDs, group means are strongly compressed toward zero and heavily influenced by a small number of extreme values, resulting in substantial overlap among taxa and limited ability to distinguish typical response magnitudes. In contrast, lnM yields a wider and more symmetric distribution of effect magnitudes within groups, revealing clearer separation among taxa and reducing the influence of extreme observations.
Small-study effects diagnostic
We use the same approach as in Example 1.
dat <- dat %>% dplyr::mutate(n0 = (sample.size.control * sample.size.noise.1) / (sample.size.control + sample.size.noise.1),nSE =1/sqrt(n0),nV =1/ n0 )
The folded-normal SMD shows little association with precision, reflecting compression of effect magnitudes near zero. In contrast, lnM exhibits a clear positive relationship with decreasing study size, indicating greater sensitivity to size-dependent differences in effect magnitude.
Comparison of overall meta-analytic estimates, taxon-level meta-regressions, and Egger-style small-study effect diagnostics for folded-normal SMD magnitude and lnM (SAFE) in the Kunc & Schmidt dataset.
# Combine panels into a 2 × 3 layout:# Row 1: overall meta-analyses (absolute SMD vs lnM)# Row 2: taxon meta-regressions (absolute SMD vs lnM)# Row 3: Egger-style diagnostics (absolute SMD vs lnM)(p7 + p8) / (p9 + p10) / (p11 + p12) +plot_annotation(tag_levels ='A') &theme(plot.tag =element_text(size =16, face ="bold"))
Part II — Simulation reproduction (figures + tables)
Part II reproduces all simulation-based figures and tables reported in the manuscript using:
a pre-computed simulation summary (one row per simulation setting), and
where needed, replicate-level outputs (results from each Monte Carlo replicate within a simulation setting).
A simulation setting (or grid cell) is one combination of: design × sample sizes × true lnM value (theta, θ). Within each setting, we ran several Monte Carlo replicates, computed lnM using the Delta plug-in and SAFE-BC, and summarised estimator performance.
NoteKey quantities
Quantity
Definition
\(\theta\) (theta)
The true lnM value used to generate data in the simulation setting (\(x\)-axis in all plots).
Bias
\(\text{mean}(\text{estimate}) - \text{true lnM}\) (computed within each setting).
The proportion of nominal 95% intervals that contain the true lnM (target is 0.95).
MCSE
Monte Carlo standard error; uncertainty in simulation summaries due to a finite number of Monte Carlo replicates.
Supplementary Figures
Load summary results
The summary file contains one row per simulation setting (design × sample sizes × θ), including bias, RMSE, coverage, failure rates, and MCSE diagnostics computed from the Monte Carlo replicates.
Figure 2 plots point-estimate bias for each estimator within each simulation setting. Bias is computed as \(bias = mean(estimate) - true lnM\), so values above 0 indicate systematic overestimation and values below 0 indicate underestimation. The dashed horizontal line marks 0 bias (ideal). Each panel corresponds to one design × sample-size combination; the x-axis (theta) is the true lnM used to generate the data.
Relative bias of the reported sampling variances is expressed as a percentage (Figure 3). In the simulation code, this is computed relative to an empirical Monte Carlo baseline variance for the SAFE-BC point estimator (computed within each setting on a common “ok” subset). The dashed line at 0% indicates an unbiased variance estimate under this baseline.
Figure 3: Relative bias of variance estimates (%). Dashed line indicates 0% (unbiased).
Coverage of nominal 95% intervals
Coverage is the fraction of Monte Carlo replicates in which the nominal 95% interval contains the true lnM value (Figure 4). The dashed line at 0.95 is the nominal target.
Figure 5: RMSE of point estimates. Lower values indicate better overall accuracy.
Reading in data for Monte Carlo (MC) standard error (SE)
The chunks below read replicate-level outputs saved during the simulation run (raw_runs/row_###.rds). Each file corresponds to one simulation setting (one “grid cell”), and contains one row per Monte Carlo replicate. We use these files to reconstruct additional Monte Carlo error diagnostics (MCSE) that are easier to compute from replicate-level results.
Note
Each raw_runs/row_###.rds file corresponds to one row of the simulation parameter grid used in the driver script (i.e., one design × sample size × θ combination).
Code
# Read replicate-level simulation outputs (one file per grid cell)raw_dir <-here("raw_runs")raw_files <-list.files(raw_dir, pattern ="^row_[0-9]{3}\\.rds$", full.names =TRUE)if (!length(raw_files)) stop("No raw files found in: ", raw_dir)# Extract grid-cell index from filename (e.g., row_012.rds -> 12)row_index_from_path <-function(p) as.integer(sub("^row_([0-9]{3})\\.rds$", "\\1", basename(p)))# Compute diagnostics for a single grid cell (one replicate-level data frame)one_raw_diag <-function(raw_df) {# Defensive: remove replicate ID if present (not used in computations)if ("rep"%in%names(raw_df)) raw_df <- raw_df %>% dplyr::select(-rep)# Match the “ok_d” logic used in the simulation summary:# keep replicates where the Delta-method point & variance are defined and variance > 0 ok_d <-is.finite(raw_df$delta_pt) &is.finite(raw_df$delta_var) & (raw_df$delta_var >0)# If nothing survives the ok_d filter, return NAsif (!any(ok_d)) {return(data.frame(n_ok_d =0L,n_ok_safe =0L,RSE_VarMC_SAFEpt =NA_real_,sd_PI =NA_real_,sd_SAFE =NA_real_ )) } df_ok <- raw_df[ok_d, , drop =FALSE]# Within the ok_d subset, check how many finite SAFE-BC point estimates exist ok_safe_pt <-is.finite(df_ok$safe_pt) n_ok_safe <-sum(ok_safe_pt)# Relative standard error (RSE) of the Monte Carlo variance estimate:# RSE(Var_hat) ≈ sqrt(2/(n-1)) for sample variance under normality RSE_varmc <-if (n_ok_safe >=3) sqrt(2/ (n_ok_safe -1)) elseNA_real_# SD of point estimates (used to compute MCSE of bias later)# Since true_lnM is constant within a grid cell, sd(est - true) = sd(est). sd_PI <-sd(df_ok$delta_pt, na.rm =TRUE) sd_SAFE <-sd(df_ok$safe_pt, na.rm =TRUE)data.frame(n_ok_d =nrow(df_ok),n_ok_safe = n_ok_safe,RSE_VarMC_SAFEpt = RSE_varmc,sd_PI = sd_PI,sd_SAFE = sd_SAFE )}# Compute diagnostics for each grid cell (row_###.rds)diag_list <- pbapply::pblapply(raw_files, function(f) { rid <-row_index_from_path(f) raw_df <-readRDS(f) d <-one_raw_diag(raw_df) d$row_id <- rid d})diag_df <-bind_rows(diag_list)# Join diagnostics back onto the simulation summaryresults2 <- results %>%mutate(row_id =seq_len(nrow(results))) %>%left_join(diag_df, by ="row_id")
MCSE of bias (from replicate-level files)
Note
How to read MCSE plots.
MCSE is simulation uncertainty, not estimator uncertainty. It measures how precisely the simulation has estimated quantities like bias and mean(variance). Lower MCSE implies more stable simulation summaries.
At Figure 6 shows the Monte Carlo standard error (MCSE) of the estimated bias within each simulation setting. Here, MCSE quantifies how much the estimated bias would vary if we re-ran the Monte Carlo simulation again with the same number of replicates.
We compute this from replicate-level outputs using the approximation:
MCSE(bias) ≈ sd(estimate) / sqrt(n_eff)
where n_eff is the number of usable Monte Carlo replicates in that setting (after applying the same validity filters used in the simulation summary). Smaller MCSE values indicate more precise bias estimates.
Code
# MCSE(bias) approximation:# MCSE(bias) ≈ sd(estimator) / sqrt(n_eff),# where n_eff is the number of usable replicates in each grid cell.results2 <- results2 %>%mutate(mcse_bias_PI_disk =ifelse(is.finite(sd_PI) & n_ok_d >=2, sd_PI /sqrt(n_ok_d), NA_real_),mcse_bias_SAFE_disk =ifelse(is.finite(sd_SAFE) & n_ok_safe >=2, sd_SAFE /sqrt(n_ok_safe), NA_real_) )mc_long <- results2 %>%transmute( theta, facet_label,mcse_bias_PI = mcse_bias_PI_disk,mcse_bias_SAFE = mcse_bias_SAFE_disk ) %>%pivot_longer(cols =c(mcse_bias_PI, mcse_bias_SAFE),names_to ="metric", values_to ="value" )p_mc <-ggplot(mc_long, aes(theta, value, colour = metric, group = metric)) +geom_line() +facet_wrap(~ facet_label, ncol =4) +scale_color_discrete("Estimator", labels =c("mcse_bias_PI"="Plugin","mcse_bias_SAFE"="SAFE bootstrapping"))+labs(x =expression(theta),y ="MCSE of bias",title ="Monte-Carlo error diagnostics: MCSE of bias (per grid cell; from raw runs)" ) + base_theme +theme(legend.position ="bottom")print(p_mc)
Figure 6: MCSE of bias (per grid cell) reconstructed from replicate-level files in raw_runs.
MCSE of mean(variance estimate) (from summary)
Figure 7 shows the MCSE of the mean reported sampling variance for each estimator within each simulation setting. This is a precision diagnostic for the simulation itself: it indicates how much uncertainty remains in the simulation summary of mean_var_delta and mean_var_safe due to finite Monte Carlo replicates.
In practice, smaller MCSE values mean the simulation is sufficiently large to estimate average variance behavior reliably for that setting; larger values indicate that more Monte Carlo replicates would be needed to stabilize the mean variance estimate.
Code
# Construct long-format data for plotting MCSE of mean variance estimatesvar_mcse_df <- results %>%transmute(theta, facet_label,PI_var_MCSE = mcse_varbar_delta,SAFE_var_MCSE = mcse_varbar_safe) %>%pivot_longer(cols =c(PI_var_MCSE, SAFE_var_MCSE),names_to ="estimator", values_to ="mcse") %>%mutate(estimator =recode(estimator,PI_var_MCSE ="PI (delta-var)",SAFE_var_MCSE ="SAFE (safe-var)"))p_var_mcse <-ggplot(var_mcse_df, aes(theta, mcse, colour = estimator, group = estimator)) +geom_line() +facet_wrap(~ facet_label, ncol =4) +labs(x =expression(theta),y ="MCSE of mean(variance estimate)",colour =NULL,title ="Monte-Carlo error of variance estimators (MCSE of the mean)") + base_theme +theme(legend.position ="bottom")print(p_var_mcse)
Figure 7: MCSE of mean variance estimates (from the simulation summary).
Supplementary Tables
Table S1 — Method feasibility and stability
Table 1 summarizes feasibility and stability diagnostics for lnM estimation across simulation settings. Values are summarized across theta within each design × (n1, n2) facet. Medians represent typical behavior across theta, while maxima highlight worst-case behavior near boundary regions.
Variable Name
Definition
θ_min, θ_max
Minimum and maximum true lnM values evaluated in the simulation grid for a given design × sample-size combination.
delta_fail_median
Median proportion of simulation replicates (across \(\theta\)) for which the Delta plug-in lnM estimator was undefined (e.g., due to boundary violations).
safe_fail_median
Median proportion of simulation replicates for which SAFE failed to return a usable lnM estimate.
SAFE_ok_median
Median proportion of replicates in which SAFE returned a valid (accepted) bootstrap estimate.
delta_cap_median
Median proportion of replicates in which the Delta-method sampling variance was capped for numerical stability.
delta_fail_max, safe_fail_max, delta_cap_max
Worst-case (maximum) values of the corresponding metrics across \(\theta\), highlighting boundary regions where estimator behavior deteriorates.
Code
tab1_stability <- results %>%group_by(design, n1, n2, facet_label) %>%summarise(# Range of θ (theta) values evaluated in each facettheta_min =min(theta, na.rm =TRUE),theta_max =max(theta, na.rm =TRUE),# Median failure rates across θdelta_fail_median =median(delta_fail_prop, na.rm =TRUE),safe_fail_median =median(safe_fail_prop, na.rm =TRUE),# Median SAFE convergence / acceptance rateSAFE_ok_median =median(SAFE_ok_rate, na.rm =TRUE),# Median rate of Delta variance cappingdelta_cap_median =median(delta_cap_rate, na.rm =TRUE),# Worst-case behaviour across θdelta_fail_max =max(delta_fail_prop, na.rm =TRUE),safe_fail_max =max(safe_fail_prop, na.rm =TRUE),delta_cap_max =max(delta_cap_rate, na.rm =TRUE),.groups ="drop" ) %>%arrange(factor(facet_label, levels =levels(results$facet_label))) %>%mutate(design =as.character(design),facet_label =as.character(facet_label) )kable(tab1_stability, digits =3,caption ="Table S1. Feasibility / stability diagnostics summarised across θ within each design × (n1,n2) facet.") %>%kable_styling(full_width =FALSE, bootstrap_options =c("striped", "hover", "condensed"))%>% kableExtra::scroll_box(width ="100%", height ="400px")
Table 1: Table S1. Feasibility / stability diagnostics summarised across θ within each design × (n1,n2) facet.
design
n1
n2
facet_label
theta_min
theta_max
delta_fail_median
safe_fail_median
SAFE_ok_median
delta_cap_median
delta_fail_max
safe_fail_max
delta_cap_max
indep
5
5
indep n1=5 n2=5
0.2
5
0.227
0
1
0.060
0.631
0
0.171
indep
10
10
indep n1=10 n2=10
0.2
5
0.077
0
1
0.025
0.622
0
0.169
indep
20
20
indep n1=20 n2=20
0.2
5
0.009
0
1
0.004
0.584
0
0.155
indep
100
100
indep n1=100 n2=100
0.2
5
0.000
0
1
0.000
0.333
0
0.083
indep
3
7
indep n1=3 n2=7
0.2
5
0.269
0
1
0.067
0.635
0
0.166
indep
6
14
indep n1=6 n2=14
0.2
5
0.110
0
1
0.032
0.630
0
0.170
indep
12
28
indep n1=12 n2=28
0.2
5
0.019
0
1
0.007
0.602
0
0.159
indep
40
160
indep n1=40 n2=160
0.2
5
0.000
0
1
0.000
0.430
0
0.106
paired
5
5
paired n1=5 n2=5
0.6
5
0.002
0
1
0.001
0.493
0
0.097
paired
10
10
paired n1=10 n2=10
0.6
5
0.000
0
1
0.000
0.227
0
0.050
paired
20
20
paired n1=20 n2=20
0.3
5
0.000
0
1
0.000
0.535
0
0.103
paired
100
100
paired n1=100 n2=100
0.2
5
0.000
0
1
0.000
0.181
0
0.041
Table S2 — SAFE bootstrap acceptance (kept/tried)
Table 2 summarizes the SAFE bootstrap acceptance rate (“kept / tried”) across theta within each facet. Quantiles show how acceptance varies across theta, including worst-case regions where accepted draws are rare.
Variable Name
Definition
q0, q25, q50, q75, q100
Minimum, 25th percentile, median, 75th percentile, and maximum SAFE bootstrap acceptance rates across values of \(\theta\) within each facet.
mean
Mean SAFE bootstrap acceptance rate across \(\theta\).
Table 2: Table S2. SAFE acceptance rate (kept/tried) summarised across θ for each facet. Quantiles highlight worst-case acceptance.
design
n1
n2
facet_label
q0
q25
q50
q75
q100
mean
indep
5
5
indep n1=5 n2=5
0.4820
0.5444
0.6752
0.9079
1.0000
0.7226
indep
10
10
indep n1=10 n2=10
0.4663
0.5892
0.8083
0.9871
1.0000
0.7800
indep
20
20
indep n1=20 n2=20
0.4718
0.7044
0.9441
0.9997
1.0000
0.8428
indep
100
100
indep n1=100 n2=100
0.5779
0.9897
1.0000
1.0000
1.0000
0.9511
indep
3
7
indep n1=3 n2=7
0.4477
0.5030
0.6285
0.8755
0.9998
0.6898
indep
6
14
indep n1=6 n2=14
0.4484
0.5575
0.7642
0.9761
1.0000
0.7571
indep
12
28
indep n1=12 n2=28
0.4586
0.6634
0.9161
0.9991
1.0000
0.8246
indep
40
160
indep n1=40 n2=160
0.5247
0.9510
0.9997
1.0000
1.0000
0.9281
paired
5
5
paired n1=5 n2=5
0.1982
0.6858
0.9722
0.9996
1.0000
0.8156
paired
10
10
paired n1=10 n2=10
0.4293
0.9150
0.9990
1.0000
1.0000
0.9089
paired
20
20
paired n1=20 n2=20
0.1547
0.9570
1.0000
1.0000
1.0000
0.8822
paired
100
100
paired n1=100 n2=100
0.5318
1.0000
1.0000
1.0000
1.0000
0.9709
Acceptance rate is defined as the proportion of bootstrap resamples retained after SAFE’s feasibility and stability checks. Lower values indicate reduced effective bootstrap sample sizes, typically occurring near parameter boundaries or under small sample sizes.
Table S3 — MCSE diagnostics
Table 4 summarizes Monte Carlo precision. Values are the maximum MCSE observed across theta within each facet, giving a conservative measure of simulation uncertainty.
Table 3: Column definitions for Monte Carlo standard errors
Variable Name
Definition
mcse_bias_PI_max
Maximum Monte Carlo standard error of the Delta-method lnM bias estimate across \(\theta\).
mcse_bias_SAFE_max
Maximum Monte Carlo standard error of the SAFE lnM bias estimate across \(\theta\).
mcse_var_PI_max
Maximum Monte Carlo standard error of the mean Delta-method variance estimate across \(\theta\).
mcse_var_SAFE_max
Maximum Monte Carlo standard error of the mean SAFE variance estimate across \(\theta\).
Table 4: Table S3. Monte-Carlo precision (max MCSE over θ within each facet).
design
n1
n2
facet_label
mcse_bias_PI_max
mcse_bias_SAFE_max
mcse_var_PI_max
mcse_var_SAFE_max
indep
5
5
indep n1=5 n2=5
0.00293
0.00283
0.03832
0.00039
indep
10
10
indep n1=10 n2=10
0.00278
0.00276
0.03776
0.00037
indep
20
20
indep n1=20 n2=20
0.00263
0.00271
0.03525
0.00036
indep
100
100
indep n1=100 n2=100
0.00246
0.00269
0.02258
0.00035
indep
3
7
indep n1=3 n2=7
0.00305
0.00284
0.03836
0.00042
indep
6
14
indep n1=6 n2=14
0.00292
0.00276
0.03829
0.00038
indep
12
28
indep n1=12 n2=28
0.00273
0.00273
0.03628
0.00037
indep
40
160
indep n1=40 n2=160
0.00274
0.00271
0.02673
0.00035
paired
5
5
paired n1=5 n2=5
0.00253
0.00285
0.02733
0.00036
paired
10
10
paired n1=10 n2=10
0.00219
0.00265
0.01713
0.00033
paired
20
20
paired n1=20 n2=20
0.00221
0.00258
0.02920
0.00031
paired
100
100
paired n1=100 n2=100
0.00190
0.00248
0.01534
0.00029
MCSE values quantify simulation uncertainty due to a finite number of Monte Carlo replicates; smaller values indicate more precise simulation estimates.
Part 3. Supplementary simulation for the absolute standardized mean difference, |d|
To complement the main simulation study for lnM, we conducted an additional simulation for the absolute standardized mean difference, |d|, a commonly used workaround for analysing effect magnitude.
The data-generating process, parameter grid, and performance summaries were the same as in the main simulation (Section 2.3), so that results for |d| and lnM were directly comparable. Here, we describe only the effect-size estimators, targets, and sampling-variance formulas specific to |d|.
For each simulated dataset, we first calculated a signed standardized mean difference and then converted it to magnitude by taking its absolute value.
Signed standardized mean differences
For independent groups, we used the following estimator of \(d\):
For paired (dependent) groups, we used a repeated-measures Cohen-type standardized mean difference based on the average within-condition standard deviation:
Under our simulation parameterization (\(\mu_1 = 0\), \(\mu_2 = \theta\), and \(\sigma_1 = \sigma_2 = 1\)), the population signed standardized mean difference equals \(\theta\) for both designs. However, because the goal here was to evaluate unsigned magnitude estimators, we did not treat \(|\theta|\) as the primary target. Instead, following the logic that magnitude is obtained by transforming a signed effect with sampling uncertainty, we defined the target as the mean of the corresponding folded-normal distribution:
\[
|d|_{\mathrm{FN,true}} = \mathbb{E}(|D|), \qquad D \sim \mathcal{N}(d_{\mathrm{true}}, v_{d,\mathrm{true}})
\] where \(d_{\mathrm{true}} = \theta\) and \(v_{d,\mathrm{true}}\) is the design-specific sampling variance of the signed standardized mean difference.
For independent groups, we used the common large-sample approximation:
where \(\hat d\) denotes either \(\hat d_{\mathrm{IND}}\) or \(\hat d_{\mathrm{DEP}}\), depending on the design.
For the naive approach, we used the usual sampling variance of the corresponding signed standardized mean difference, unchanged after taking the absolute value. For independent groups: \[
\widehat{\mathrm{Var}}_{\mathrm{naive,IND}}(|d|)
\approx
\widehat{\mathrm{Var}}(\hat d_{\mathrm{IND}})
=
\frac{n_1 + n_2}{n_1 n_2}
+
\frac{\hat d_{\mathrm{IND}}^2}{2(n_1 + n_2 - 2)}
\]
where \(r\) is the sample within-pair correlation.
To examine a distribution-based alternative, we also considered a folded-normal plug-in estimator. Specifically, if the signed standardized mean difference is approximated by
\[
\hat d \sim \mathcal{N}(\mu_d, v_d)
\] then the corresponding magnitude \(|\hat d|\) has a folded-normal distribution. Its mean is:
For each parameter combination, we repeated the simulation \(K = 10^5\) times. For every replicate, we recorded: (i) the signed standardized mean difference \(\hat d\), (ii) the naive absolute-value estimator \(|\hat d|_{\mathrm{naive}}\), (iii) its sampling variance from the equations above, (iv) the folded-normal estimator \(|\hat d|_{\mathrm{FN}}\), and (v) its sampling variance from the folded-normal plug-in formula.
We then evaluated both point estimators against the folded-normal target above. For each estimator \(|\hat d|_l \in \{|\hat d|_{\mathrm{naive}}, |\hat d|_{\mathrm{FN}}\}\), we calculated empirical bias as:
We also calculated root mean squared error (RMSE), nominal 95% Wald-interval coverage, and Monte Carlo standard errors (MCSE). In this supplementary simulation, the folded-normal target is the appropriate benchmark because both the naive absolute-value estimator and the folded-normal estimator aim to estimate an unsigned magnitude induced by folding a signed effect-size distribution, rather than the simpler quantity \(|d_{\mathrm{true}}|\).
Simulation results for |d|
Bias of point estimators
Figure 8 shows the empirical bias of the two |d| estimators across the simulation grid. The folded-normal estimator generally reduced bias relative to the naive absolute-value estimator when standardized mean differences were small, especially for independent-group designs. However, both approaches remained positively biased because the target itself is an unsigned folded quantity.
Figure 8: Empirical bias of the two |d| estimators across the simulation grid for independent and paired designs. The folded-normal estimator is shown in blue and the naive absolute-value estimator in red.
Relative bias of variance estimators
Figure 9 shows the relative bias of the variance estimators for the two |d| approaches. For independent-group designs, both variance estimators tended to overestimate variance when \(\theta\) was very small and then underestimate it over much of the remaining range. For paired designs, patterns differed more clearly with sample size, but the folded-normal and naive variance estimators remained broadly similar.
Figure 9: Relative bias of the variance estimators for the naive and folded-normal |d| estimators across the simulation grid.
Coverage of 95% Wald intervals
Figure 10 shows the coverage of nominal 95% Wald intervals for the two |d| estimators. Coverage was often close to nominal overall, but the folded-normal estimator showed a clearer dip below 95% in some independent-group scenarios with small \(\theta\), whereas paired designs were generally slightly conservative.
Figure 10: Coverage of nominal 95% Wald intervals for the naive and folded-normal |d| estimators across the simulation grid.
Root mean squared error
Figure 11 summarizes overall estimator performance in terms of RMSE. Differences between the two |d| estimators were generally modest, with the folded-normal estimator showing slightly lower RMSE in some scenarios, especially when \(\theta\) was small.
Figure 11: Root mean squared error (RMSE) of the naive and folded-normal |d| estimators across the simulation grid.
Direct comparison with lnM
To facilitate comparison with the main simulation for lnM, Figure 12 and Figure 13 show the corresponding bias and variance-estimation patterns for |d| and lnM together. These combined plots make the main contrast visually clear: lnM can exhibit much larger point-estimator and variance-estimator problems near the boundary where between-group separation is weak, whereas |d|-based estimators remain defined but target a different folded quantity.
Figure 12: Empirical bias of |d| and lnM estimators across the simulation grid.
Figure 13: Relative bias of variance estimators for |d| and lnM across the simulation grid.
Monte Carlo error diagnostics
For completeness, Figure 14 and Figure 15 show Monte Carlo standard errors for the estimated bias and variance summaries. These diagnostics indicate that the reported simulation summaries were estimated with high precision across the parameter grid.
Figure 14: Monte Carlo standard errors of estimated bias for the |d| estimators.
Figure 15: Monte Carlo standard errors of the mean estimated sampling variances for the |d| estimators.
References
Almeida, L Zoe, Stephen M Hovick, Stuart A Ludsin, and Elizabeth A Marschall. 2021. “Which Factors Determine the Long-Term Effect of Poor Early-Life Nutrition? A Meta-Analytic Review.”Ecosphere 12 (8): e03694.
Egger, Matthias, George Davey Smith, Martin Schneider, and Christoph Minder. 1997. “Bias in Meta-Analysis Detected by a Simple, Graphical Test.”Bmj 315 (7109): 629–34.
Kunc, Hansjoerg P, and Rouven Schmidt. 2019. “The Effects of Anthropogenic Noise on Animals: A Meta-Analysis.”Biology Letters 15 (11): 20190649.
Nakagawa, Shinichi, Malgorzata Lagisz, Michael D Jennions, Julia Koricheva, Daniel WA Noble, Timothy H Parker, Alfredo Sánchez-Tójar, Yefeng Yang, and Rose E O’Dea. 2022. “Methods for Testing Publication Bias in Ecological and Evolutionary Meta-Analyses.”Methods in Ecology and Evolution 13 (1): 4–21.