Assembling mammal trait databases for phylogenetic comparative models
Source:vignettes/db-assembly-workflow_mammals.Rmd
db-assembly-workflow_mammals.RmdComparative analyses often begin well before model fitting. Trait data are commonly assembled from multiple databases, each with its own species names, column names, measurement conventions, and taxonomic coverage. A phylogenetic tree adds one more requirement: the final trait table must contain species that can be matched to the tree, and the table and tree must be aligned before they can be used in a phylogenetic model.
This vignette focuses on that database-assembly step. We combine three mammal trait datasets, reconcile their species names with a phylogenetic tree, and collapse the matched records into a species-level table. The goal is not to fit a model here, but to show how several databases and a tree can be brought into a single, clean object ready for downstream comparative analyses.
Setup
The worked example draws on five packages: dplyr for the table manipulations, readr for tolerant number parsing, stringr for tidying raw species strings, ape for the phylogeny, and prepR4pcm for reconciling species names against the tree.
Load the example sources
The package bundles three small source tables and one small tree. The source tables were sampled from real database structures, using only the columns needed for this example. They mirror three common inputs for mammal comparative work: the Amniote life-history database (Myhrvold et al. 2015), PanTHERIA (Jones et al. 2009), and TetrapodTraits (Moura et al. 2024). The bundled tree is a subset of the VertLife mammal phylogeny (Upham et al. 2019); the next code chunk reports its exact tip count. For analysis-grade trees download the full credible set from https://vertlife.org/phylosubsets/.
data(mammal_amniote_example)
data(mammal_pantheria_example)
data(mammal_tetrapodtraits_example)
data(mammal_tree_example)
cat(sprintf("Amniote-like source: %d rows\n", nrow(mammal_amniote_example)))
#> Amniote-like source: 4953 rows
cat(sprintf("PanTHERIA-like source: %d rows\n", nrow(mammal_pantheria_example)))
#> PanTHERIA-like source: 5416 rows
cat(sprintf("TetrapodTraits-like source: %d rows\n", nrow(mammal_tetrapodtraits_example)))
#> TetrapodTraits-like source: 5911 rows
cat(sprintf("Tree: %d tips\n", ape::Ntip(mammal_tree_example)))
#> Tree: 5987 tipsStep 1: Compare the source tables
The three sources contain related information, but they do not use the same column names. This is typical when assembling trait data from independent databases.
source_columns <- tibble::tibble(
source = c("Amniote", "PanTHERIA", "TetrapodTraits"),
n_rows = c(
nrow(mammal_amniote_example),
nrow(mammal_pantheria_example),
nrow(mammal_tetrapodtraits_example)
),
n_columns = c(
ncol(mammal_amniote_example),
ncol(mammal_pantheria_example),
ncol(mammal_tetrapodtraits_example)
),
species_column = c("name", "MSW05_Binomial", "Scientific.Name")
)
knitr::kable(source_columns)| source | n_rows | n_columns | species_column |
|---|---|---|---|
| Amniote | 4953 | 5 | name |
| PanTHERIA | 5416 | 4 | MSW05_Binomial |
| TetrapodTraits | 5911 | 3 | Scientific.Name |
Step 2: Standardise the sources
We first convert each source into the same long-format structure. The
result keeps one row per source record, plus a source
column so the provenance of each value is retained.
amniote_std <- prep_source(
mammal_amniote_example,
source_name = "AMNIOTE",
species_col = "name",
female_mass_col = "female_body_mass_g",
adult_mass_col = "adult_body_mass_g",
litter_size_col = "litter_or_clutch_size_n",
litter_y_col = "litters_or_clutches_per_y"
)
pantheria_std <- prep_source(
mammal_pantheria_example,
source_name = "PANTHERIA",
species_col = "MSW05_Binomial",
adult_mass_col = "5-1_AdultBodyMass_g",
litter_size_col = "15-1_LitterSize",
litter_y_col = "16-1_LittersPerYear"
)
tetrapodtraits_std <- prep_source(
mammal_tetrapodtraits_example,
source_name = "TETRAPODTRAITS",
species_col = "Scientific.Name",
adult_mass_col = "BodyMass_g",
litter_size_col = "LitterSize"
)
db_long_raw <- bind_rows(
amniote_std,
pantheria_std,
tetrapodtraits_std
)
knitr::kable(slice_head(db_long_raw, n = 10))| source | row_in_source | species | female_mass_g | adult_mass_g | litter_size_n | litters_per_year_n |
|---|---|---|---|---|---|---|
| AMNIOTE | 1 | Amblysomus_corriae | NA | 64.80 | 1.93 | 2 |
| AMNIOTE | 2 | Amblysomus_hottentotus | NA | 64.80 | 1.93 | 2 |
| AMNIOTE | 3 | Amblysomus_marleyi | NA | 64.80 | 1.93 | 2 |
| AMNIOTE | 4 | Amblysomus_robustus | NA | 64.80 | 1.93 | 2 |
| AMNIOTE | 5 | Amblysomus_septentrionalis | NA | 64.80 | 1.93 | 2 |
| AMNIOTE | 6 | Calcochloris_obtusirostris | NA | 24.05 | 2.00 | NA |
| AMNIOTE | 7 | Carpitalpa_arendsi | NA | 52.34 | NA | NA |
| AMNIOTE | 8 | Chlorotalpa_duthieae | NA | 31.34 | 2.00 | NA |
| AMNIOTE | 9 | Chlorotalpa_sclateri | NA | 38.30 | 2.00 | NA |
| AMNIOTE | 10 | Chrysochloris_asiatica | NA | 36.93 | 3.12 | NA |
We can now check how much information each source contributes.
source_coverage <- db_long_raw |>
group_by(source) |>
summarise(
n_records = n(),
n_species = n_distinct(species),
adult_mass_records = sum(!is.na(adult_mass_g)),
female_mass_records = sum(!is.na(female_mass_g)),
litter_size_records = sum(!is.na(litter_size_n)),
litter_y_records = sum(!is.na(litters_per_year_n)),
.groups = "drop"
)
knitr::kable(source_coverage)| source | n_records | n_species | adult_mass_records | female_mass_records | litter_size_records | litter_y_records |
|---|---|---|---|---|---|---|
| AMNIOTE | 4953 | 4953 | 4651 | 438 | 3511 | 2146 |
| PANTHERIA | 5416 | 5416 | 3542 | 0 | 2501 | 894 |
| TETRAPODTRAITS | 5911 | 5911 | 5911 | 0 | 3547 | 0 |
In this table n_records is the number of rows the source
contributes and n_species the number of distinct species;
each *_records column counts the rows where that trait has
a non-missing, positive value. Those counts sit well below
n_records — no single database measures every trait for
every species, which is exactly why combining sources is worthwhile.
Step 3: Reconcile species names with the tree
Name reconciliation is done on the unique source names, not on every row of the trait database. This creates one matching decision per raw species name.
species_lookup <- db_long_raw |>
distinct(species) |>
rename(species_raw = species)
knitr::kable(slice_head(species_lookup, n = 10))| species_raw |
|---|
| Amblysomus_corriae |
| Amblysomus_hottentotus |
| Amblysomus_marleyi |
| Amblysomus_robustus |
| Amblysomus_septentrionalis |
| Calcochloris_obtusirostris |
| Carpitalpa_arendsi |
| Chlorotalpa_duthieae |
| Chlorotalpa_sclateri |
| Chrysochloris_asiatica |
We now reconcile the source names against the tree. External synonym lookup is turned off here so the vignette remains fast and reproducible.
rec0 <- reconcile_tree(
x = species_lookup,
tree = mammal_tree_example,
x_species = "species_raw",
authority = NULL,
fuzzy = FALSE,
quiet = TRUE
)
reconcile_summary(rec0, detail = "brief")
#>
#> === Reconciliation Report ===
#> Type: data_tree
#> Timestamp: 2026-06-16 10:10:43
#> Package: prepR4pcm 0.4.0.9000
#> Authority: NONE (version: latest)
#> Rank: species
#>
#> --- Match Summary ---
#> Exact: 4677 / 11304
#> Normalized: 1234 / 11304
#> Synonym: 0 / 11304
#> Fuzzy: 0 / 11304
#> Manual: 0 / 11304
#> Unresolved: 5393 (x only) + 76 (y only)
#> The mapping table records which source names matched the tree and
which need review. In it, name_x is the raw source name,
name_y the tree tip it resolved to, match_type
records how the two were linked, and in_x /
in_y flag whether the name is present in the data and in
the tree.
mapping0 <- reconcile_mapping(rec0)
mapping_preview <- mapping0 |>
select(any_of(c("name_x", "name_y", "match_type", "in_x", "in_y"))) |>
arrange(match_type, name_x) |>
slice_head(n = 15)
knitr::kable(mapping_preview)| name_x | name_y | match_type | in_x | in_y |
|---|---|---|---|---|
| Abditomys_latidens | Abditomys_latidens | exact | TRUE | TRUE |
| Abeomelomys_sevia | Abeomelomys_sevia | exact | TRUE | TRUE |
| Abrawayaomys_ruschii | Abrawayaomys_ruschii | exact | TRUE | TRUE |
| Abrocoma_bennettii | Abrocoma_bennettii | exact | TRUE | TRUE |
| Abrocoma_boliviensis | Abrocoma_boliviensis | exact | TRUE | TRUE |
| Abrocoma_budini | Abrocoma_budini | exact | TRUE | TRUE |
| Abrocoma_cinerea | Abrocoma_cinerea | exact | TRUE | TRUE |
| Abrocoma_famatina | Abrocoma_famatina | exact | TRUE | TRUE |
| Abrocoma_shistacea | Abrocoma_shistacea | exact | TRUE | TRUE |
| Abrocoma_vaccarum | Abrocoma_vaccarum | exact | TRUE | TRUE |
| Abrothrix_andinus | Abrothrix_andinus | exact | TRUE | TRUE |
| Abrothrix_hershkovitzi | Abrothrix_hershkovitzi | exact | TRUE | TRUE |
| Abrothrix_illuteus | Abrothrix_illuteus | exact | TRUE | TRUE |
| Abrothrix_jelskii | Abrothrix_jelskii | exact | TRUE | TRUE |
| Abrothrix_lanosus | Abrothrix_lanosus | exact | TRUE | TRUE |
Names that remain unresolved or flagged can be inspected separately. We also ask for suggested matches. These suggestions are not applied automatically; they are candidates for manual review.
review_names <- mapping0 |>
filter(in_x, match_type %in% c("unresolved", "flagged")) |>
arrange(match_type, name_x)
if (nrow(review_names) == 0) {
cat("No unresolved or flagged names in this example.\n")
} else {
cat(sprintf(
"Showing 10 of %d unresolved or flagged names.\n\n",
nrow(review_names)
))
knitr::kable(slice_head(review_names, n = 10) |>
select(any_of(c("name_x", "name_y", "match_type",
"in_x", "in_y"))))
}
#> Showing 10 of 5393 unresolved or flagged names.| name_x | name_y | match_type | in_x | in_y |
|---|---|---|---|---|
| Abditomys latidens | NA | unresolved | TRUE | FALSE |
| Abeomelomys sevia | NA | unresolved | TRUE | FALSE |
| Abrawayaomys ruschii | NA | unresolved | TRUE | FALSE |
| Abrocoma bennettii | NA | unresolved | TRUE | FALSE |
| Abrocoma boliviensis | NA | unresolved | TRUE | FALSE |
| Abrocoma budini | NA | unresolved | TRUE | FALSE |
| Abrocoma cinerea | NA | unresolved | TRUE | FALSE |
| Abrocoma famatina | NA | unresolved | TRUE | FALSE |
| Abrocoma shistacea | NA | unresolved | TRUE | FALSE |
| Abrocoma vaccarum | NA | unresolved | TRUE | FALSE |
suggestions0 <- reconcile_suggest(rec0, n = 3, threshold = 0.9)
#> ✔ Found suggestions for 4683 of 5393 unresolved species.
suggestions_to_review <- suggestions0 |>
transmute(
name_x = unresolved,
name_y = suggestion,
score = score
) |>
filter(score >= 0.9, score < 1) |>
arrange(desc(score))
if (nrow(suggestions_to_review) == 0) {
cat("No high-confidence, non-perfect suggestions were found.\n")
} else {
cat(
"Showing up to 10 high-confidence suggested matches with score below 1.\n\n",
sep = ""
)
knitr::kable(slice_head(suggestions_to_review, n = 10), digits = 3)
}
#> Showing up to 10 high-confidence suggested matches with score below 1.| name_x | name_y | score |
|---|---|---|
| Tamiops mcclellandii | Tamiops_macclellandii | 0.969 |
| Tamiops_mcclellandii | Tamiops_macclellandii | 0.969 |
| Rattus arfakiensis | Rattus_arfakienis | 0.964 |
| Rattus_arfakiensis | Rattus_arfakienis | 0.964 |
| Galeopterus variegates | Galeopterus_variegatus | 0.960 |
| Galeopterus_variegates | Galeopterus_variegatus | 0.960 |
| Herpestes edwardsi | Herpestes_edwardsii | 0.956 |
| Herpestes_edwardsi | Herpestes_edwardsii | 0.956 |
| Neophascogale lorentzi | Neophascogale_lorentzii | 0.956 |
| Neophascogale_lorentzi | Neophascogale_lorentzii | 0.956 |
Step 4: Add manual corrections
Automated reconciliation is useful, but some names still need human review. The suggestion table above helps identify likely matches. Manual corrections are stored as a small editable table.
The important rule is simple: name_x must be a name from
the trait database, and name_y must be an exact tip label
in the tree.
manual_overrides <- suggestions_to_review |>
slice_head(n = 2) |>
mutate(user_note = "Accepted from high-confidence reconciliation suggestion") |>
select(name_x, name_y, user_note)
if (nrow(manual_overrides) == 0) {
cat("No manual corrections were added in this example.\n")
} else {
knitr::kable(manual_overrides, digits = 3)
}| name_x | name_y | user_note |
|---|---|---|
| Tamiops mcclellandii | Tamiops_macclellandii | Accepted from high-confidence reconciliation suggestion |
| Tamiops_mcclellandii | Tamiops_macclellandii | Accepted from high-confidence reconciliation suggestion |
We then apply any manual corrections to the reconciliation table. If
manual_overrides is empty, the automated mapping is kept
unchanged.
mapping_final <- mapping0 |>
left_join(
manual_overrides |>
rename(manual_name_y = name_y, manual_note = user_note),
by = "name_x"
) |>
mutate(
species_tree = coalesce(manual_name_y, name_y),
matched_to_tree = species_tree %in% mammal_tree_example$tip.label,
match_type = if_else(!is.na(manual_name_y), "manual", match_type),
notes = manual_note
) |>
select(-manual_name_y, -manual_note)
final_reconciliation_summary <- mapping_final |>
filter(in_x) |>
count(match_type, name = "n_names") |>
arrange(desc(n_names), match_type)
knitr::kable(final_reconciliation_summary)| match_type | n_names |
|---|---|
| unresolved | 5391 |
| exact | 4677 |
| normalized | 1234 |
| manual | 2 |
The corrected mapping can now be joined back to the full source-level trait table.
name_map <- mapping_final |>
filter(in_x) |>
transmute(
species_raw = name_x,
species_tree = species_tree,
matched_to_tree = matched_to_tree,
match_type = match_type,
notes = notes
)
db_full <- db_long_raw |>
rename(species_raw = species) |>
left_join(name_map, by = "species_raw") |>
relocate(source, row_in_source, species_raw, species_tree,
matched_to_tree, match_type)
db_tree_matched <- db_full |>
filter(matched_to_tree, !is.na(species_tree))
knitr::kable(
db_tree_matched |>
select(source, species_raw, species_tree, match_type,
adult_mass_g, female_mass_g, litter_size_n,
litters_per_year_n) |>
slice_head(n = 10),
digits = 3
)| source | species_raw | species_tree | match_type | adult_mass_g | female_mass_g | litter_size_n | litters_per_year_n |
|---|---|---|---|---|---|---|---|
| AMNIOTE | Amblysomus_corriae | Amblysomus_corriae | exact | 64.80 | NA | 1.93 | 2 |
| AMNIOTE | Amblysomus_hottentotus | Amblysomus_hottentotus | exact | 64.80 | NA | 1.93 | 2 |
| AMNIOTE | Amblysomus_marleyi | Amblysomus_marleyi | exact | 64.80 | NA | 1.93 | 2 |
| AMNIOTE | Amblysomus_robustus | Amblysomus_robustus | exact | 64.80 | NA | 1.93 | 2 |
| AMNIOTE | Amblysomus_septentrionalis | Amblysomus_septentrionalis | exact | 64.80 | NA | 1.93 | 2 |
| AMNIOTE | Calcochloris_obtusirostris | Calcochloris_obtusirostris | exact | 24.05 | NA | 2.00 | NA |
| AMNIOTE | Carpitalpa_arendsi | Carpitalpa_arendsi | exact | 52.34 | NA | NA | NA |
| AMNIOTE | Chlorotalpa_duthieae | Chlorotalpa_duthieae | exact | 31.34 | NA | 2.00 | NA |
| AMNIOTE | Chlorotalpa_sclateri | Chlorotalpa_sclateri | exact | 38.30 | NA | 2.00 | NA |
| AMNIOTE | Chrysochloris_asiatica | Chrysochloris_asiatica | exact | 36.93 | NA | 3.12 | NA |
Step 5: Collapse to one row per species
The source-level database can now be summarised to one record per matched species. Here we use the median trait value across available source records and keep simple provenance columns.
db_species_summary <- db_tree_matched |>
group_by(species_tree) |>
summarise(
n_sources_total = n_distinct(source),
sources = safe_sources(source),
adult_mass_g = safe_median(adult_mass_g),
female_mass_g = safe_median(female_mass_g),
litter_size_n = safe_median(litter_size_n),
litters_per_year_n = safe_median(litters_per_year_n),
adult_mass_n_records = sum(!is.na(adult_mass_g)),
female_mass_n_records = sum(!is.na(female_mass_g)),
litter_size_n_records = sum(!is.na(litter_size_n)),
litter_y_n_records = sum(!is.na(litters_per_year_n)),
.groups = "drop"
) |>
mutate(annual_offspring_n = litter_size_n * litters_per_year_n)
trait_coverage <- db_species_summary |>
summarise(
n_species = n(),
adult_mass_species = sum(!is.na(adult_mass_g)),
female_mass_species = sum(!is.na(female_mass_g)),
litter_size_species = sum(!is.na(litter_size_n)),
litters_per_year_species= sum(!is.na(litters_per_year_n)),
annual_offspring_species= sum(!is.na(annual_offspring_n))
)
knitr::kable(trait_coverage)| n_species | adult_mass_species | female_mass_species | litter_size_species | litters_per_year_species | annual_offspring_species |
|---|---|---|---|---|---|
| 5911 | 5634 | 395 | 3446 | 2000 | 1975 |
Each *_species column counts the species with a
non-missing value for that trait. Some species still carry
NA: even with three sources combined, not every species has
every trait measured. Downstream comparative models handle these gaps
through imputation or complete-case analysis.
Step 6: Align the database and the tree
For phylogenetic comparative models, the data and tree must refer to the same species. Here we prune the tree and order the data rows to match the tree tips.
matched_tips <- intersect(
mammal_tree_example$tip.label,
db_species_summary$species_tree
)
tree_pcm <- keep.tip(mammal_tree_example, matched_tips)
pcm_data <- db_species_summary |>
filter(species_tree %in% tree_pcm$tip.label) |>
mutate(species = species_tree) |>
arrange(match(species, tree_pcm$tip.label)) |>
relocate(species)
stopifnot(identical(pcm_data$species, tree_pcm$tip.label))
alignment_check <- tibble::tibble(
object = c("pcm_data", "tree_pcm"),
species_or_tips = c(nrow(pcm_data), ape::Ntip(tree_pcm)),
aligned = c(
identical(pcm_data$species, tree_pcm$tip.label),
identical(pcm_data$species, tree_pcm$tip.label)
)
)
knitr::kable(alignment_check)| object | species_or_tips | aligned |
|---|---|---|
| pcm_data | 5911 | TRUE |
| tree_pcm | 5911 | TRUE |
Final database ready for model fitting
The final objects are pcm_data and
tree_pcm. The table below shows the first 10 rows of the
assembled database. This is the object that would be passed to
downstream phylogenetic comparative models.
knitr::kable(
pcm_data |>
select(species, adult_mass_g, litter_size_n,
litters_per_year_n, annual_offspring_n,
n_sources_total, sources) |>
slice_head(n = 10),
digits = 3
)| species | adult_mass_g | litter_size_n | litters_per_year_n | annual_offspring_n | n_sources_total | sources |
|---|---|---|---|---|---|---|
| Zaglossus_bartoni | 8951.71 | 1 | 1.00 | 1.00 | 1 | AMNIOTE |
| Zaglossus_bruijnii | 16500.00 | 1 | NA | NA | 1 | TETRAPODTRAITS |
| Zaglossus_attenboroughi | 3000.00 | NA | NA | NA | 1 | AMNIOTE |
| Tachyglossus_aculeatus | 3169.50 | 1 | 0.58 | 0.58 | 1 | AMNIOTE |
| Ornithorhynchus_anatinus | 1225.00 | 2 | 1.10 | 2.20 | 1 | AMNIOTE |
| Hydrodamalis_gigas | 8950000.00 | 1 | NA | NA | 2 | AMNIOTE; TETRAPODTRAITS |
| Trichechus_manatus | 387500.00 | 1 | 0.40 | 0.40 | 1 | AMNIOTE |
| Trichechus_senegalensis | 477000.00 | 1 | NA | NA | 1 | AMNIOTE |
| Trichechus_inunguis | 294000.59 | 1 | 0.50 | 0.50 | 1 | AMNIOTE |
| Dugong_dugon | 360000.00 | 1 | 0.22 | 0.22 | 1 | AMNIOTE |
The helper functions used in this vignette
(prep_source(), pull_number_or_na(),
safe_sources(), safe_median()) are local to
this document; they are deliberately minimal so you can copy and adapt
them for your own assembly script. If a generic version proves useful in
practice we can graduate them to exported package functions.