Skip to contents

Comparative analyses often begin well before model fitting. Trait data are commonly assembled from multiple databases, each with its own species names, column names, measurement conventions, and taxonomic coverage. A phylogenetic tree adds one more requirement: the final trait table must contain species that can be matched to the tree, and the table and tree must be aligned before they can be used in a phylogenetic model.

This vignette focuses on that database-assembly step. We combine three mammal trait datasets, reconcile their species names with a phylogenetic tree, and collapse the matched records into a species-level table. The goal is not to fit a model here, but to show how several databases and a tree can be brought into a single, clean object ready for downstream comparative analyses.

Setup

The worked example draws on five packages: dplyr for the table manipulations, readr for tolerant number parsing, stringr for tidying raw species strings, ape for the phylogeny, and prepR4pcm for reconciling species names against the tree.

Load the example sources

The package bundles three small source tables and one small tree. The source tables were sampled from real database structures, using only the columns needed for this example. They mirror three common inputs for mammal comparative work: the Amniote life-history database (Myhrvold et al. 2015), PanTHERIA (Jones et al. 2009), and TetrapodTraits (Moura et al. 2024). The bundled tree is a subset of the VertLife mammal phylogeny (Upham et al. 2019); the next code chunk reports its exact tip count. For analysis-grade trees download the full credible set from https://vertlife.org/phylosubsets/.

data(mammal_amniote_example)
data(mammal_pantheria_example)
data(mammal_tetrapodtraits_example)
data(mammal_tree_example)

cat(sprintf("Amniote-like source:        %d rows\n", nrow(mammal_amniote_example)))
#> Amniote-like source:        4953 rows
cat(sprintf("PanTHERIA-like source:      %d rows\n", nrow(mammal_pantheria_example)))
#> PanTHERIA-like source:      5416 rows
cat(sprintf("TetrapodTraits-like source: %d rows\n", nrow(mammal_tetrapodtraits_example)))
#> TetrapodTraits-like source: 5911 rows
cat(sprintf("Tree:                       %d tips\n", ape::Ntip(mammal_tree_example)))
#> Tree:                       5987 tips

Step 1: Compare the source tables

The three sources contain related information, but they do not use the same column names. This is typical when assembling trait data from independent databases.

source_columns <- tibble::tibble(
  source = c("Amniote", "PanTHERIA", "TetrapodTraits"),
  n_rows = c(
    nrow(mammal_amniote_example),
    nrow(mammal_pantheria_example),
    nrow(mammal_tetrapodtraits_example)
  ),
  n_columns = c(
    ncol(mammal_amniote_example),
    ncol(mammal_pantheria_example),
    ncol(mammal_tetrapodtraits_example)
  ),
  species_column = c("name", "MSW05_Binomial", "Scientific.Name")
)

knitr::kable(source_columns)
source n_rows n_columns species_column
Amniote 4953 5 name
PanTHERIA 5416 4 MSW05_Binomial
TetrapodTraits 5911 3 Scientific.Name

Step 2: Standardise the sources

We first convert each source into the same long-format structure. The result keeps one row per source record, plus a source column so the provenance of each value is retained.

amniote_std <- prep_source(
  mammal_amniote_example,
  source_name     = "AMNIOTE",
  species_col     = "name",
  female_mass_col = "female_body_mass_g",
  adult_mass_col  = "adult_body_mass_g",
  litter_size_col = "litter_or_clutch_size_n",
  litter_y_col    = "litters_or_clutches_per_y"
)

pantheria_std <- prep_source(
  mammal_pantheria_example,
  source_name     = "PANTHERIA",
  species_col     = "MSW05_Binomial",
  adult_mass_col  = "5-1_AdultBodyMass_g",
  litter_size_col = "15-1_LitterSize",
  litter_y_col    = "16-1_LittersPerYear"
)

tetrapodtraits_std <- prep_source(
  mammal_tetrapodtraits_example,
  source_name     = "TETRAPODTRAITS",
  species_col     = "Scientific.Name",
  adult_mass_col  = "BodyMass_g",
  litter_size_col = "LitterSize"
)

db_long_raw <- bind_rows(
  amniote_std,
  pantheria_std,
  tetrapodtraits_std
)

knitr::kable(slice_head(db_long_raw, n = 10))
source row_in_source species female_mass_g adult_mass_g litter_size_n litters_per_year_n
AMNIOTE 1 Amblysomus_corriae NA 64.80 1.93 2
AMNIOTE 2 Amblysomus_hottentotus NA 64.80 1.93 2
AMNIOTE 3 Amblysomus_marleyi NA 64.80 1.93 2
AMNIOTE 4 Amblysomus_robustus NA 64.80 1.93 2
AMNIOTE 5 Amblysomus_septentrionalis NA 64.80 1.93 2
AMNIOTE 6 Calcochloris_obtusirostris NA 24.05 2.00 NA
AMNIOTE 7 Carpitalpa_arendsi NA 52.34 NA NA
AMNIOTE 8 Chlorotalpa_duthieae NA 31.34 2.00 NA
AMNIOTE 9 Chlorotalpa_sclateri NA 38.30 2.00 NA
AMNIOTE 10 Chrysochloris_asiatica NA 36.93 3.12 NA

We can now check how much information each source contributes.

source_coverage <- db_long_raw |>
  group_by(source) |>
  summarise(
    n_records           = n(),
    n_species           = n_distinct(species),
    adult_mass_records  = sum(!is.na(adult_mass_g)),
    female_mass_records = sum(!is.na(female_mass_g)),
    litter_size_records = sum(!is.na(litter_size_n)),
    litter_y_records    = sum(!is.na(litters_per_year_n)),
    .groups = "drop"
  )

knitr::kable(source_coverage)
source n_records n_species adult_mass_records female_mass_records litter_size_records litter_y_records
AMNIOTE 4953 4953 4651 438 3511 2146
PANTHERIA 5416 5416 3542 0 2501 894
TETRAPODTRAITS 5911 5911 5911 0 3547 0

In this table n_records is the number of rows the source contributes and n_species the number of distinct species; each *_records column counts the rows where that trait has a non-missing, positive value. Those counts sit well below n_records — no single database measures every trait for every species, which is exactly why combining sources is worthwhile.

Step 3: Reconcile species names with the tree

Name reconciliation is done on the unique source names, not on every row of the trait database. This creates one matching decision per raw species name.

species_lookup <- db_long_raw |>
  distinct(species) |>
  rename(species_raw = species)

knitr::kable(slice_head(species_lookup, n = 10))
species_raw
Amblysomus_corriae
Amblysomus_hottentotus
Amblysomus_marleyi
Amblysomus_robustus
Amblysomus_septentrionalis
Calcochloris_obtusirostris
Carpitalpa_arendsi
Chlorotalpa_duthieae
Chlorotalpa_sclateri
Chrysochloris_asiatica

We now reconcile the source names against the tree. External synonym lookup is turned off here so the vignette remains fast and reproducible.

rec0 <- reconcile_tree(
  x         = species_lookup,
  tree      = mammal_tree_example,
  x_species = "species_raw",
  authority = NULL,
  fuzzy     = FALSE,
  quiet     = TRUE
)

reconcile_summary(rec0, detail = "brief")
#> 
#> === Reconciliation Report ===
#> Type: data_tree
#> Timestamp: 2026-06-16 10:10:43
#> Package: prepR4pcm 0.4.0.9000
#> Authority: NONE (version: latest)
#> Rank: species
#> 
#> --- Match Summary ---
#>   Exact:       4677 / 11304
#>   Normalized:  1234 / 11304
#>   Synonym:     0 / 11304
#>   Fuzzy:       0 / 11304
#>   Manual:      0 / 11304
#>   Unresolved:  5393 (x only) + 76 (y only)
#> 

The mapping table records which source names matched the tree and which need review. In it, name_x is the raw source name, name_y the tree tip it resolved to, match_type records how the two were linked, and in_x / in_y flag whether the name is present in the data and in the tree.

mapping0 <- reconcile_mapping(rec0)

mapping_preview <- mapping0 |>
  select(any_of(c("name_x", "name_y", "match_type", "in_x", "in_y"))) |>
  arrange(match_type, name_x) |>
  slice_head(n = 15)

knitr::kable(mapping_preview)
name_x name_y match_type in_x in_y
Abditomys_latidens Abditomys_latidens exact TRUE TRUE
Abeomelomys_sevia Abeomelomys_sevia exact TRUE TRUE
Abrawayaomys_ruschii Abrawayaomys_ruschii exact TRUE TRUE
Abrocoma_bennettii Abrocoma_bennettii exact TRUE TRUE
Abrocoma_boliviensis Abrocoma_boliviensis exact TRUE TRUE
Abrocoma_budini Abrocoma_budini exact TRUE TRUE
Abrocoma_cinerea Abrocoma_cinerea exact TRUE TRUE
Abrocoma_famatina Abrocoma_famatina exact TRUE TRUE
Abrocoma_shistacea Abrocoma_shistacea exact TRUE TRUE
Abrocoma_vaccarum Abrocoma_vaccarum exact TRUE TRUE
Abrothrix_andinus Abrothrix_andinus exact TRUE TRUE
Abrothrix_hershkovitzi Abrothrix_hershkovitzi exact TRUE TRUE
Abrothrix_illuteus Abrothrix_illuteus exact TRUE TRUE
Abrothrix_jelskii Abrothrix_jelskii exact TRUE TRUE
Abrothrix_lanosus Abrothrix_lanosus exact TRUE TRUE

Names that remain unresolved or flagged can be inspected separately. We also ask for suggested matches. These suggestions are not applied automatically; they are candidates for manual review.

review_names <- mapping0 |>
  filter(in_x, match_type %in% c("unresolved", "flagged")) |>
  arrange(match_type, name_x)

if (nrow(review_names) == 0) {
  cat("No unresolved or flagged names in this example.\n")
} else {
  cat(sprintf(
    "Showing 10 of %d unresolved or flagged names.\n\n",
    nrow(review_names)
  ))
  knitr::kable(slice_head(review_names, n = 10) |>
                 select(any_of(c("name_x", "name_y", "match_type",
                                 "in_x", "in_y"))))
}
#> Showing 10 of 5393 unresolved or flagged names.
name_x name_y match_type in_x in_y
Abditomys latidens NA unresolved TRUE FALSE
Abeomelomys sevia NA unresolved TRUE FALSE
Abrawayaomys ruschii NA unresolved TRUE FALSE
Abrocoma bennettii NA unresolved TRUE FALSE
Abrocoma boliviensis NA unresolved TRUE FALSE
Abrocoma budini NA unresolved TRUE FALSE
Abrocoma cinerea NA unresolved TRUE FALSE
Abrocoma famatina NA unresolved TRUE FALSE
Abrocoma shistacea NA unresolved TRUE FALSE
Abrocoma vaccarum NA unresolved TRUE FALSE

suggestions0 <- reconcile_suggest(rec0, n = 3, threshold = 0.9)
#>  Found suggestions for 4683 of 5393 unresolved species.

suggestions_to_review <- suggestions0 |>
  transmute(
    name_x = unresolved,
    name_y = suggestion,
    score  = score
  ) |>
  filter(score >= 0.9, score < 1) |>
  arrange(desc(score))

if (nrow(suggestions_to_review) == 0) {
  cat("No high-confidence, non-perfect suggestions were found.\n")
} else {
  cat(
    "Showing up to 10 high-confidence suggested matches with score below 1.\n\n",
    sep = ""
  )
  knitr::kable(slice_head(suggestions_to_review, n = 10), digits = 3)
}
#> Showing up to 10 high-confidence suggested matches with score below 1.
name_x name_y score
Tamiops mcclellandii Tamiops_macclellandii 0.969
Tamiops_mcclellandii Tamiops_macclellandii 0.969
Rattus arfakiensis Rattus_arfakienis 0.964
Rattus_arfakiensis Rattus_arfakienis 0.964
Galeopterus variegates Galeopterus_variegatus 0.960
Galeopterus_variegates Galeopterus_variegatus 0.960
Herpestes edwardsi Herpestes_edwardsii 0.956
Herpestes_edwardsi Herpestes_edwardsii 0.956
Neophascogale lorentzi Neophascogale_lorentzii 0.956
Neophascogale_lorentzi Neophascogale_lorentzii 0.956

Step 4: Add manual corrections

Automated reconciliation is useful, but some names still need human review. The suggestion table above helps identify likely matches. Manual corrections are stored as a small editable table.

The important rule is simple: name_x must be a name from the trait database, and name_y must be an exact tip label in the tree.

manual_overrides <- suggestions_to_review |>
  slice_head(n = 2) |>
  mutate(user_note = "Accepted from high-confidence reconciliation suggestion") |>
  select(name_x, name_y, user_note)

if (nrow(manual_overrides) == 0) {
  cat("No manual corrections were added in this example.\n")
} else {
  knitr::kable(manual_overrides, digits = 3)
}
name_x name_y user_note
Tamiops mcclellandii Tamiops_macclellandii Accepted from high-confidence reconciliation suggestion
Tamiops_mcclellandii Tamiops_macclellandii Accepted from high-confidence reconciliation suggestion

We then apply any manual corrections to the reconciliation table. If manual_overrides is empty, the automated mapping is kept unchanged.

mapping_final <- mapping0 |>
  left_join(
    manual_overrides |>
      rename(manual_name_y = name_y, manual_note = user_note),
    by = "name_x"
  ) |>
  mutate(
    species_tree    = coalesce(manual_name_y, name_y),
    matched_to_tree = species_tree %in% mammal_tree_example$tip.label,
    match_type      = if_else(!is.na(manual_name_y), "manual", match_type),
    notes           = manual_note
  ) |>
  select(-manual_name_y, -manual_note)
final_reconciliation_summary <- mapping_final |>
  filter(in_x) |>
  count(match_type, name = "n_names") |>
  arrange(desc(n_names), match_type)

knitr::kable(final_reconciliation_summary)
match_type n_names
unresolved 5391
exact 4677
normalized 1234
manual 2

The corrected mapping can now be joined back to the full source-level trait table.

name_map <- mapping_final |>
  filter(in_x) |>
  transmute(
    species_raw     = name_x,
    species_tree    = species_tree,
    matched_to_tree = matched_to_tree,
    match_type      = match_type,
    notes           = notes
  )

db_full <- db_long_raw |>
  rename(species_raw = species) |>
  left_join(name_map, by = "species_raw") |>
  relocate(source, row_in_source, species_raw, species_tree,
           matched_to_tree, match_type)

db_tree_matched <- db_full |>
  filter(matched_to_tree, !is.na(species_tree))

knitr::kable(
  db_tree_matched |>
    select(source, species_raw, species_tree, match_type,
           adult_mass_g, female_mass_g, litter_size_n,
           litters_per_year_n) |>
    slice_head(n = 10),
  digits = 3
)
source species_raw species_tree match_type adult_mass_g female_mass_g litter_size_n litters_per_year_n
AMNIOTE Amblysomus_corriae Amblysomus_corriae exact 64.80 NA 1.93 2
AMNIOTE Amblysomus_hottentotus Amblysomus_hottentotus exact 64.80 NA 1.93 2
AMNIOTE Amblysomus_marleyi Amblysomus_marleyi exact 64.80 NA 1.93 2
AMNIOTE Amblysomus_robustus Amblysomus_robustus exact 64.80 NA 1.93 2
AMNIOTE Amblysomus_septentrionalis Amblysomus_septentrionalis exact 64.80 NA 1.93 2
AMNIOTE Calcochloris_obtusirostris Calcochloris_obtusirostris exact 24.05 NA 2.00 NA
AMNIOTE Carpitalpa_arendsi Carpitalpa_arendsi exact 52.34 NA NA NA
AMNIOTE Chlorotalpa_duthieae Chlorotalpa_duthieae exact 31.34 NA 2.00 NA
AMNIOTE Chlorotalpa_sclateri Chlorotalpa_sclateri exact 38.30 NA 2.00 NA
AMNIOTE Chrysochloris_asiatica Chrysochloris_asiatica exact 36.93 NA 3.12 NA

Step 5: Collapse to one row per species

The source-level database can now be summarised to one record per matched species. Here we use the median trait value across available source records and keep simple provenance columns.

db_species_summary <- db_tree_matched |>
  group_by(species_tree) |>
  summarise(
    n_sources_total       = n_distinct(source),
    sources               = safe_sources(source),
    adult_mass_g          = safe_median(adult_mass_g),
    female_mass_g         = safe_median(female_mass_g),
    litter_size_n         = safe_median(litter_size_n),
    litters_per_year_n    = safe_median(litters_per_year_n),
    adult_mass_n_records  = sum(!is.na(adult_mass_g)),
    female_mass_n_records = sum(!is.na(female_mass_g)),
    litter_size_n_records = sum(!is.na(litter_size_n)),
    litter_y_n_records    = sum(!is.na(litters_per_year_n)),
    .groups = "drop"
  ) |>
  mutate(annual_offspring_n = litter_size_n * litters_per_year_n)
trait_coverage <- db_species_summary |>
  summarise(
    n_species               = n(),
    adult_mass_species      = sum(!is.na(adult_mass_g)),
    female_mass_species     = sum(!is.na(female_mass_g)),
    litter_size_species     = sum(!is.na(litter_size_n)),
    litters_per_year_species= sum(!is.na(litters_per_year_n)),
    annual_offspring_species= sum(!is.na(annual_offspring_n))
  )

knitr::kable(trait_coverage)
n_species adult_mass_species female_mass_species litter_size_species litters_per_year_species annual_offspring_species
5911 5634 395 3446 2000 1975

Each *_species column counts the species with a non-missing value for that trait. Some species still carry NA: even with three sources combined, not every species has every trait measured. Downstream comparative models handle these gaps through imputation or complete-case analysis.

Step 6: Align the database and the tree

For phylogenetic comparative models, the data and tree must refer to the same species. Here we prune the tree and order the data rows to match the tree tips.

matched_tips <- intersect(
  mammal_tree_example$tip.label,
  db_species_summary$species_tree
)

tree_pcm <- keep.tip(mammal_tree_example, matched_tips)

pcm_data <- db_species_summary |>
  filter(species_tree %in% tree_pcm$tip.label) |>
  mutate(species = species_tree) |>
  arrange(match(species, tree_pcm$tip.label)) |>
  relocate(species)

stopifnot(identical(pcm_data$species, tree_pcm$tip.label))
alignment_check <- tibble::tibble(
  object          = c("pcm_data", "tree_pcm"),
  species_or_tips = c(nrow(pcm_data), ape::Ntip(tree_pcm)),
  aligned         = c(
    identical(pcm_data$species, tree_pcm$tip.label),
    identical(pcm_data$species, tree_pcm$tip.label)
  )
)

knitr::kable(alignment_check)
object species_or_tips aligned
pcm_data 5911 TRUE
tree_pcm 5911 TRUE

Final database ready for model fitting

The final objects are pcm_data and tree_pcm. The table below shows the first 10 rows of the assembled database. This is the object that would be passed to downstream phylogenetic comparative models.

knitr::kable(
  pcm_data |>
    select(species, adult_mass_g, litter_size_n,
           litters_per_year_n, annual_offspring_n,
           n_sources_total, sources) |>
    slice_head(n = 10),
  digits = 3
)
species adult_mass_g litter_size_n litters_per_year_n annual_offspring_n n_sources_total sources
Zaglossus_bartoni 8951.71 1 1.00 1.00 1 AMNIOTE
Zaglossus_bruijnii 16500.00 1 NA NA 1 TETRAPODTRAITS
Zaglossus_attenboroughi 3000.00 NA NA NA 1 AMNIOTE
Tachyglossus_aculeatus 3169.50 1 0.58 0.58 1 AMNIOTE
Ornithorhynchus_anatinus 1225.00 2 1.10 2.20 1 AMNIOTE
Hydrodamalis_gigas 8950000.00 1 NA NA 2 AMNIOTE; TETRAPODTRAITS
Trichechus_manatus 387500.00 1 0.40 0.40 1 AMNIOTE
Trichechus_senegalensis 477000.00 1 NA NA 1 AMNIOTE
Trichechus_inunguis 294000.59 1 0.50 0.50 1 AMNIOTE
Dugong_dugon 360000.00 1 0.22 0.22 1 AMNIOTE

The helper functions used in this vignette (prep_source(), pull_number_or_na(), safe_sources(), safe_median()) are local to this document; they are deliberately minimal so you can copy and adapt them for your own assembly script. If a generic version proves useful in practice we can graduate them to exported package functions.

References

Jones, Kate E., Jon Bielby, Marcel Cardillo, et al. 2009. “PanTHERIA: A Species‐level Database of Life History, Ecology, and Geography of Extant and Recently Extinct Mammals.” Ecology 90 (September): 2648–48. https://doi.org/10.1890/08-1494.1.
Moura, Mario R., Karoline Ceron, Jhonny J. M. Guedes, et al. 2024. “A Phylogeny-Informed Characterisation of Global Tetrapod Traits Addresses Data Gaps and Biases.” PLOS Biology 22 (July): e3002658. https://doi.org/10.1371/journal.pbio.3002658.
Myhrvold, Nathan P., Elita Baldridge, Benjamin Chan, Dhileep Sivam, Daniel L. Freeman, and S. K. Morgan Ernest. 2015. “An Amniote Life-History Database to Perform Comparative Analyses with Birds, Mammals, and Reptiles.” Ecology 96 (November): 3109–9. https://doi.org/10.1890/15-0846R.1.
Upham, Nathan S., Jacob A. Esselstyn, and Walter Jetz. 2019. “Inferring the Mammal Tree: Species-Level Sets of Phylogenies for Questions in Ecology, Evolution, and Conservation.” PLOS Biology 17: e3000494. https://doi.org/10.1371/journal.pbio.3000494.