Assembling mammal trait databases for phylogenetic comparative models • prepR4pcm

Comparative analyses often begin well before model fitting. Trait data are commonly assembled from multiple databases, each with its own species names, column names, measurement conventions, and taxonomic coverage. A phylogenetic tree adds one more requirement: the final trait table must contain species that can be matched to the tree, and the table and tree must be aligned before they can be used in a phylogenetic model.

This vignette focuses on that database-assembly step. We combine three mammal trait datasets, reconcile their species names with a phylogenetic tree, and collapse the matched records into a species-level table. The goal is not to fit a model here, but to show how several databases and a tree can be brought into a single, clean object ready for downstream comparative analyses.

Setup

The worked example draws on five packages: dplyr for the table manipulations, readr for tolerant number parsing, stringr for tidying raw species strings, ape for the phylogeny, and prepR4pcm for reconciling species names against the tree.

library(dplyr)
library(readr)
library(stringr)
library(ape)
library(prepR4pcm)

Load the example sources

The package bundles three small source tables and one small tree. The source tables were sampled from real database structures, using only the columns needed for this example. They mirror three common inputs for mammal comparative work: the Amniote life-history database (Myhrvold et al. 2015), PanTHERIA (Jones et al. 2009), and TetrapodTraits (Moura et al. 2024). The bundled tree is a subset of the VertLife mammal phylogeny (Upham et al. 2019); the next code chunk reports its exact tip count. For analysis-grade trees download the full credible set from https://vertlife.org/phylosubsets/.

data(mammal_amniote_example)
data(mammal_pantheria_example)
data(mammal_tetrapodtraits_example)
data(mammal_tree_example)

cat(sprintf("Amniote-like source:        %d rows\n", nrow(mammal_amniote_example)))
#> Amniote-like source:        4953 rows
cat(sprintf("PanTHERIA-like source:      %d rows\n", nrow(mammal_pantheria_example)))
#> PanTHERIA-like source:      5416 rows
cat(sprintf("TetrapodTraits-like source: %d rows\n", nrow(mammal_tetrapodtraits_example)))
#> TetrapodTraits-like source: 5911 rows
cat(sprintf("Tree:                       %d tips\n", ape::Ntip(mammal_tree_example)))
#> Tree:                       5987 tips

Step 1: Compare the source tables

The three sources contain related information, but they do not use the same column names. This is typical when assembling trait data from independent databases.

source_columns <- tibble::tibble(
  source = c("Amniote", "PanTHERIA", "TetrapodTraits"),
  n_rows = c(
    nrow(mammal_amniote_example),
    nrow(mammal_pantheria_example),
    nrow(mammal_tetrapodtraits_example)
  ),
  n_columns = c(
    ncol(mammal_amniote_example),
    ncol(mammal_pantheria_example),
    ncol(mammal_tetrapodtraits_example)
  ),
  species_column = c("name", "MSW05_Binomial", "Scientific.Name")
)

knitr::kable(source_columns)

source	n_rows	n_columns	species_column
Amniote	4953	5	name
PanTHERIA	5416	4	MSW05_Binomial
TetrapodTraits	5911	3	Scientific.Name

Step 2: Standardise the sources

We first convert each source into the same long-format structure. The result keeps one row per source record, plus a source column so the provenance of each value is retained.

amniote_std <- prep_source(
  mammal_amniote_example,
  source_name     = "AMNIOTE",
  species_col     = "name",
  female_mass_col = "female_body_mass_g",
  adult_mass_col  = "adult_body_mass_g",
  litter_size_col = "litter_or_clutch_size_n",
  litter_y_col    = "litters_or_clutches_per_y"
)

pantheria_std <- prep_source(
  mammal_pantheria_example,
  source_name     = "PANTHERIA",
  species_col     = "MSW05_Binomial",
  adult_mass_col  = "5-1_AdultBodyMass_g",
  litter_size_col = "15-1_LitterSize",
  litter_y_col    = "16-1_LittersPerYear"
)

tetrapodtraits_std <- prep_source(
  mammal_tetrapodtraits_example,
  source_name     = "TETRAPODTRAITS",
  species_col     = "Scientific.Name",
  adult_mass_col  = "BodyMass_g",
  litter_size_col = "LitterSize"
)

db_long_raw <- bind_rows(
  amniote_std,
  pantheria_std,
  tetrapodtraits_std
)

knitr::kable(slice_head(db_long_raw, n = 10))

source	row_in_source	species	female_mass_g	adult_mass_g	litter_size_n	litters_per_year_n
AMNIOTE	1	Amblysomus_corriae	NA	64.80	1.93	2
AMNIOTE	2	Amblysomus_hottentotus	NA	64.80	1.93	2
AMNIOTE	3	Amblysomus_marleyi	NA	64.80	1.93	2
AMNIOTE	4	Amblysomus_robustus	NA	64.80	1.93	2
AMNIOTE	5	Amblysomus_septentrionalis	NA	64.80	1.93	2
AMNIOTE	6	Calcochloris_obtusirostris	NA	24.05	2.00	NA
AMNIOTE	7	Carpitalpa_arendsi	NA	52.34	NA	NA
AMNIOTE	8	Chlorotalpa_duthieae	NA	31.34	2.00	NA
AMNIOTE	9	Chlorotalpa_sclateri	NA	38.30	2.00	NA
AMNIOTE	10	Chrysochloris_asiatica	NA	36.93	3.12	NA

We can now check how much information each source contributes.

source_coverage <- db_long_raw |>
  group_by(source) |>
  summarise(
    n_records           = n(),
    n_species           = n_distinct(species),
    adult_mass_records  = sum(!is.na(adult_mass_g)),
    female_mass_records = sum(!is.na(female_mass_g)),
    litter_size_records = sum(!is.na(litter_size_n)),
    litter_y_records    = sum(!is.na(litters_per_year_n)),
    .groups = "drop"
  )

knitr::kable(source_coverage)

source	n_records	n_species	adult_mass_records	female_mass_records	litter_size_records	litter_y_records
AMNIOTE	4953	4953	4651	438	3511	2146
PANTHERIA	5416	5416	3542	0	2501	894
TETRAPODTRAITS	5911	5911	5911	0	3547	0

In this table n_records is the number of rows the source contributes and n_species the number of distinct species; each *_records column counts the rows where that trait has a non-missing, positive value. Those counts sit well below n_records — no single database measures every trait for every species, which is exactly why combining sources is worthwhile.

Step 3: Reconcile species names with the tree

Name reconciliation is done on the unique source names, not on every row of the trait database. This creates one matching decision per raw species name.

species_lookup <- db_long_raw |>
  distinct(species) |>
  rename(species_raw = species)

knitr::kable(slice_head(species_lookup, n = 10))

species_raw
Amblysomus_corriae
Amblysomus_hottentotus
Amblysomus_marleyi
Amblysomus_robustus
Amblysomus_septentrionalis
Calcochloris_obtusirostris
Carpitalpa_arendsi
Chlorotalpa_duthieae
Chlorotalpa_sclateri
Chrysochloris_asiatica

We now reconcile the source names against the tree. External synonym lookup is turned off here so the vignette remains fast and reproducible.

rec0 <- reconcile_tree(
  x         = species_lookup,
  tree      = mammal_tree_example,
  x_species = "species_raw",
  authority = NULL,
  fuzzy     = FALSE,
  quiet     = TRUE
)

reconcile_summary(rec0, detail = "brief")
#> 
#> === Reconciliation Report ===
#> Type: data_tree
#> Timestamp: 2026-06-16 10:10:43
#> Package: prepR4pcm 0.4.0.9000
#> Authority: NONE (version: latest)
#> Rank: species
#> 
#> --- Match Summary ---
#>   Exact:       4677 / 11304
#>   Normalized:  1234 / 11304
#>   Synonym:     0 / 11304
#>   Fuzzy:       0 / 11304
#>   Manual:      0 / 11304
#>   Unresolved:  5393 (x only) + 76 (y only)
#>

The mapping table records which source names matched the tree and which need review. In it, name_x is the raw source name, name_y the tree tip it resolved to, match_type records how the two were linked, and in_x / in_y flag whether the name is present in the data and in the tree.

mapping0 <- reconcile_mapping(rec0)

mapping_preview <- mapping0 |>
  select(any_of(c("name_x", "name_y", "match_type", "in_x", "in_y"))) |>
  arrange(match_type, name_x) |>
  slice_head(n = 15)

knitr::kable(mapping_preview)

name_x	name_y	match_type	in_x	in_y
Abditomys_latidens	Abditomys_latidens	exact	TRUE	TRUE
Abeomelomys_sevia	Abeomelomys_sevia	exact	TRUE	TRUE
Abrawayaomys_ruschii	Abrawayaomys_ruschii	exact	TRUE	TRUE
Abrocoma_bennettii	Abrocoma_bennettii	exact	TRUE	TRUE
Abrocoma_boliviensis	Abrocoma_boliviensis	exact	TRUE	TRUE
Abrocoma_budini	Abrocoma_budini	exact	TRUE	TRUE
Abrocoma_cinerea	Abrocoma_cinerea	exact	TRUE	TRUE
Abrocoma_famatina	Abrocoma_famatina	exact	TRUE	TRUE
Abrocoma_shistacea	Abrocoma_shistacea	exact	TRUE	TRUE
Abrocoma_vaccarum	Abrocoma_vaccarum	exact	TRUE	TRUE
Abrothrix_andinus	Abrothrix_andinus	exact	TRUE	TRUE
Abrothrix_hershkovitzi	Abrothrix_hershkovitzi	exact	TRUE	TRUE
Abrothrix_illuteus	Abrothrix_illuteus	exact	TRUE	TRUE
Abrothrix_jelskii	Abrothrix_jelskii	exact	TRUE	TRUE
Abrothrix_lanosus	Abrothrix_lanosus	exact	TRUE	TRUE

Names that remain unresolved or flagged can be inspected separately. We also ask for suggested matches. These suggestions are not applied automatically; they are candidates for manual review.

review_names <- mapping0 |>
  filter(in_x, match_type %in% c("unresolved", "flagged")) |>
  arrange(match_type, name_x)

if (nrow(review_names) == 0) {
  cat("No unresolved or flagged names in this example.\n")
} else {
  cat(sprintf(
    "Showing 10 of %d unresolved or flagged names.\n\n",
    nrow(review_names)
  ))
  knitr::kable(slice_head(review_names, n = 10) |>
                 select(any_of(c("name_x", "name_y", "match_type",
                                 "in_x", "in_y"))))
}
#> Showing 10 of 5393 unresolved or flagged names.

name_x	name_y	match_type	in_x	in_y
Abditomys latidens	NA	unresolved	TRUE	FALSE
Abeomelomys sevia	NA	unresolved	TRUE	FALSE
Abrawayaomys ruschii	NA	unresolved	TRUE	FALSE
Abrocoma bennettii	NA	unresolved	TRUE	FALSE
Abrocoma boliviensis	NA	unresolved	TRUE	FALSE
Abrocoma budini	NA	unresolved	TRUE	FALSE
Abrocoma cinerea	NA	unresolved	TRUE	FALSE
Abrocoma famatina	NA	unresolved	TRUE	FALSE
Abrocoma shistacea	NA	unresolved	TRUE	FALSE
Abrocoma vaccarum	NA	unresolved	TRUE	FALSE


suggestions0 <- reconcile_suggest(rec0, n = 3, threshold = 0.9)
#> ✔ Found suggestions for 4683 of 5393 unresolved species.

suggestions_to_review <- suggestions0 |>
  transmute(
    name_x = unresolved,
    name_y = suggestion,
    score  = score
  ) |>
  filter(score >= 0.9, score < 1) |>
  arrange(desc(score))

if (nrow(suggestions_to_review) == 0) {
  cat("No high-confidence, non-perfect suggestions were found.\n")
} else {
  cat(
    "Showing up to 10 high-confidence suggested matches with score below 1.\n\n",
    sep = ""
  )
  knitr::kable(slice_head(suggestions_to_review, n = 10), digits = 3)
}
#> Showing up to 10 high-confidence suggested matches with score below 1.

name_x	name_y	score
Tamiops mcclellandii	Tamiops_macclellandii	0.969
Tamiops_mcclellandii	Tamiops_macclellandii	0.969
Rattus arfakiensis	Rattus_arfakienis	0.964
Rattus_arfakiensis	Rattus_arfakienis	0.964
Galeopterus variegates	Galeopterus_variegatus	0.960
Galeopterus_variegates	Galeopterus_variegatus	0.960
Herpestes edwardsi	Herpestes_edwardsii	0.956
Herpestes_edwardsi	Herpestes_edwardsii	0.956
Neophascogale lorentzi	Neophascogale_lorentzii	0.956
Neophascogale_lorentzi	Neophascogale_lorentzii	0.956

Step 4: Add manual corrections

Automated reconciliation is useful, but some names still need human review. The suggestion table above helps identify likely matches. Manual corrections are stored as a small editable table.

The important rule is simple: name_x must be a name from the trait database, and name_y must be an exact tip label in the tree.

manual_overrides <- suggestions_to_review |>
  slice_head(n = 2) |>
  mutate(user_note = "Accepted from high-confidence reconciliation suggestion") |>
  select(name_x, name_y, user_note)

if (nrow(manual_overrides) == 0) {
  cat("No manual corrections were added in this example.\n")
} else {
  knitr::kable(manual_overrides, digits = 3)
}

name_x	name_y	user_note
Tamiops mcclellandii	Tamiops_macclellandii	Accepted from high-confidence reconciliation suggestion
Tamiops_mcclellandii	Tamiops_macclellandii	Accepted from high-confidence reconciliation suggestion

We then apply any manual corrections to the reconciliation table. If manual_overrides is empty, the automated mapping is kept unchanged.

mapping_final <- mapping0 |>
  left_join(
    manual_overrides |>
      rename(manual_name_y = name_y, manual_note = user_note),
    by = "name_x"
  ) |>
  mutate(
    species_tree    = coalesce(manual_name_y, name_y),
    matched_to_tree = species_tree %in% mammal_tree_example$tip.label,
    match_type      = if_else(!is.na(manual_name_y), "manual", match_type),
    notes           = manual_note
  ) |>
  select(-manual_name_y, -manual_note)

final_reconciliation_summary <- mapping_final |>
  filter(in_x) |>
  count(match_type, name = "n_names") |>
  arrange(desc(n_names), match_type)

knitr::kable(final_reconciliation_summary)

match_type	n_names
unresolved	5391
exact	4677
normalized	1234
manual	2

The corrected mapping can now be joined back to the full source-level trait table.

name_map <- mapping_final |>
  filter(in_x) |>
  transmute(
    species_raw     = name_x,
    species_tree    = species_tree,
    matched_to_tree = matched_to_tree,
    match_type      = match_type,
    notes           = notes
  )

db_full <- db_long_raw |>
  rename(species_raw = species) |>
  left_join(name_map, by = "species_raw") |>
  relocate(source, row_in_source, species_raw, species_tree,
           matched_to_tree, match_type)

db_tree_matched <- db_full |>
  filter(matched_to_tree, !is.na(species_tree))

knitr::kable(
  db_tree_matched |>
    select(source, species_raw, species_tree, match_type,
           adult_mass_g, female_mass_g, litter_size_n,
           litters_per_year_n) |>
    slice_head(n = 10),
  digits = 3
)

source	species_raw	species_tree	match_type	adult_mass_g	female_mass_g	litter_size_n	litters_per_year_n
AMNIOTE	Amblysomus_corriae	Amblysomus_corriae	exact	64.80	NA	1.93	2
AMNIOTE	Amblysomus_hottentotus	Amblysomus_hottentotus	exact	64.80	NA	1.93	2
AMNIOTE	Amblysomus_marleyi	Amblysomus_marleyi	exact	64.80	NA	1.93	2
AMNIOTE	Amblysomus_robustus	Amblysomus_robustus	exact	64.80	NA	1.93	2
AMNIOTE	Amblysomus_septentrionalis	Amblysomus_septentrionalis	exact	64.80	NA	1.93	2
AMNIOTE	Calcochloris_obtusirostris	Calcochloris_obtusirostris	exact	24.05	NA	2.00	NA
AMNIOTE	Carpitalpa_arendsi	Carpitalpa_arendsi	exact	52.34	NA	NA	NA
AMNIOTE	Chlorotalpa_duthieae	Chlorotalpa_duthieae	exact	31.34	NA	2.00	NA
AMNIOTE	Chlorotalpa_sclateri	Chlorotalpa_sclateri	exact	38.30	NA	2.00	NA
AMNIOTE	Chrysochloris_asiatica	Chrysochloris_asiatica	exact	36.93	NA	3.12	NA

Step 5: Collapse to one row per species

The source-level database can now be summarised to one record per matched species. Here we use the median trait value across available source records and keep simple provenance columns.

db_species_summary <- db_tree_matched |>
  group_by(species_tree) |>
  summarise(
    n_sources_total       = n_distinct(source),
    sources               = safe_sources(source),
    adult_mass_g          = safe_median(adult_mass_g),
    female_mass_g         = safe_median(female_mass_g),
    litter_size_n         = safe_median(litter_size_n),
    litters_per_year_n    = safe_median(litters_per_year_n),
    adult_mass_n_records  = sum(!is.na(adult_mass_g)),
    female_mass_n_records = sum(!is.na(female_mass_g)),
    litter_size_n_records = sum(!is.na(litter_size_n)),
    litter_y_n_records    = sum(!is.na(litters_per_year_n)),
    .groups = "drop"
  ) |>
  mutate(annual_offspring_n = litter_size_n * litters_per_year_n)

trait_coverage <- db_species_summary |>
  summarise(
    n_species               = n(),
    adult_mass_species      = sum(!is.na(adult_mass_g)),
    female_mass_species     = sum(!is.na(female_mass_g)),
    litter_size_species     = sum(!is.na(litter_size_n)),
    litters_per_year_species= sum(!is.na(litters_per_year_n)),
    annual_offspring_species= sum(!is.na(annual_offspring_n))
  )

knitr::kable(trait_coverage)

n_species	adult_mass_species	female_mass_species	litter_size_species	litters_per_year_species	annual_offspring_species
5911	5634	395	3446	2000	1975

Each *_species column counts the species with a non-missing value for that trait. Some species still carry NA: even with three sources combined, not every species has every trait measured. Downstream comparative models handle these gaps through imputation or complete-case analysis.

Step 6: Align the database and the tree

For phylogenetic comparative models, the data and tree must refer to the same species. Here we prune the tree and order the data rows to match the tree tips.

matched_tips <- intersect(
  mammal_tree_example$tip.label,
  db_species_summary$species_tree
)

tree_pcm <- keep.tip(mammal_tree_example, matched_tips)

pcm_data <- db_species_summary |>
  filter(species_tree %in% tree_pcm$tip.label) |>
  mutate(species = species_tree) |>
  arrange(match(species, tree_pcm$tip.label)) |>
  relocate(species)

stopifnot(identical(pcm_data$species, tree_pcm$tip.label))

alignment_check <- tibble::tibble(
  object          = c("pcm_data", "tree_pcm"),
  species_or_tips = c(nrow(pcm_data), ape::Ntip(tree_pcm)),
  aligned         = c(
    identical(pcm_data$species, tree_pcm$tip.label),
    identical(pcm_data$species, tree_pcm$tip.label)
  )
)

knitr::kable(alignment_check)

object	species_or_tips	aligned
pcm_data	5911	TRUE
tree_pcm	5911	TRUE

Final database ready for model fitting

The final objects are pcm_data and tree_pcm. The table below shows the first 10 rows of the assembled database. This is the object that would be passed to downstream phylogenetic comparative models.

knitr::kable(
  pcm_data |>
    select(species, adult_mass_g, litter_size_n,
           litters_per_year_n, annual_offspring_n,
           n_sources_total, sources) |>
    slice_head(n = 10),
  digits = 3
)

species	adult_mass_g	litter_size_n	litters_per_year_n	annual_offspring_n	n_sources_total	sources
Zaglossus_bartoni	8951.71	1	1.00	1.00	1	AMNIOTE
Zaglossus_bruijnii	16500.00	1	NA	NA	1	TETRAPODTRAITS
Zaglossus_attenboroughi	3000.00	NA	NA	NA	1	AMNIOTE
Tachyglossus_aculeatus	3169.50	1	0.58	0.58	1	AMNIOTE
Ornithorhynchus_anatinus	1225.00	2	1.10	2.20	1	AMNIOTE
Hydrodamalis_gigas	8950000.00	1	NA	NA	2	AMNIOTE; TETRAPODTRAITS
Trichechus_manatus	387500.00	1	0.40	0.40	1	AMNIOTE
Trichechus_senegalensis	477000.00	1	NA	NA	1	AMNIOTE
Trichechus_inunguis	294000.59	1	0.50	0.50	1	AMNIOTE
Dugong_dugon	360000.00	1	0.22	0.22	1	AMNIOTE

The helper functions used in this vignette (prep_source(), pull_number_or_na(), safe_sources(), safe_median()) are local to this document; they are deliberately minimal so you can copy and adapt them for your own assembly script. If a generic version proves useful in practice we can graduate them to exported package functions.

References

Jones, Kate E., Jon Bielby, Marcel Cardillo, et al. 2009. “PanTHERIA: A Species‐level Database of Life History, Ecology, and Geography of Extant and Recently Extinct Mammals.” Ecology 90 (September): 2648–48. https://doi.org/10.1890/08-1494.1.

Moura, Mario R., Karoline Ceron, Jhonny J. M. Guedes, et al. 2024. “A Phylogeny-Informed Characterisation of Global Tetrapod Traits Addresses Data Gaps and Biases.” PLOS Biology 22 (July): e3002658. https://doi.org/10.1371/journal.pbio.3002658.

Myhrvold, Nathan P., Elita Baldridge, Benjamin Chan, Dhileep Sivam, Daniel L. Freeman, and S. K. Morgan Ernest. 2015. “An Amniote Life-History Database to Perform Comparative Analyses with Birds, Mammals, and Reptiles.” Ecology 96 (November): 3109–9. https://doi.org/10.1890/15-0846R.1.

Upham, Nathan S., Jacob A. Esselstyn, and Walter Jetz. 2019. “Inferring the Mammal Tree: Species-Level Sets of Phylogenies for Questions in Ecology, Evolution, and Conservation.” PLOS Biology 17: e3000494. https://doi.org/10.1371/journal.pbio.3000494.