Reconcile species names between a dataset and a phylogenetic tree

Match the species in a trait data frame (x) to the tip labels of a phylogenetic tree (tree), producing a reconciliation object ready to feed into reconcile_apply(), PGLS, phylogenetic GLMMs, ancestral state reconstruction, or any other phylogenetic comparative method (PCM). This is typically the first function you call in a prepR4pcm workflow.

Usage

reconcile_tree(
  x,
  tree,
  x_species = NULL,
  authority = "col",
  rank = c("species", "subspecies"),
  overrides = NULL,
  db_version = NULL,
  fuzzy = FALSE,
  fuzzy_threshold = 0.9,
  flag_threshold = 0.95,
  resolve = c("flag", "first"),
  quiet = FALSE,
  x_label = NULL
)

Arguments

x

A data frame containing the trait data. Must have one column of scientific names.

tree

An ape::phylo object, or a length-1 character vector giving the path to a Newick (.nwk, .tre, .tree) or Nexus (.nex, .nexus) file. File format is auto-detected.

x_species

A length-1 character vector. Name of the column in x containing scientific names (the same column referenced by x above; the term “species names” elsewhere in this help page is a synonym for the same scientific names). When NULL, the column is auto-detected from a small list of common labels (e.g. species, Species1, scientific_name); the list is not exhaustive — pass the column name explicitly if your data uses a non-standard label.

authority

A length-1 character vector, or NULL. Taxonomic authority used for synonym resolution (stage 3 of the cascade). One of:

"col" (default): Catalogue of Life — broad, curated, frequently updated. A sensible default for most taxa.
"itis": Integrated Taxonomic Information System — strong for North American vertebrates and plants.
"gbif": Global Biodiversity Information Facility backbone. Wider coverage; includes more recent synonymy.
"ncbi": NCBI Taxonomy — best when working with sequence data.
"ott": Open Tree of Life synthetic taxonomy. Useful when your downstream phylogeny is from the Open Tree synthesis.
"itis_test": A small bundled subset of ITIS, cached locally with taxadb for testing. Intended for examples and unit tests; not for analysis.
"gnverifier": HTTP-backed verification against ~100 sources via the Global Names verifier; no local database download. See vignette("getting-started") for the trade-off (wider coverage, requires network and the httr2 package).
NULL: Skip the synonym stage entirely. Useful for quick checks or when taxadb is unavailable. Stages 1, 2 and 4 still run.

Five authority codes that earlier versions of the package advertised — "iucn", "tpl", "fb", "slb", "wd" — are no longer accepted. Empirical testing against taxadb v22.12 showed that iucn errors with a schema mismatch and the others are not taxadb providers at all. Passing one of those values now produces a helpful migration error.

rank

A length-1 character vector. Controls how trinomials are handled during normalisation:

"species" (default): Strip infraspecific epithets so that "Parus major major" becomes "Parus major" before matching.
"subspecies": Keep trinomials intact. Use this when your analysis operates at subspecies level.

overrides

Optional pre-built corrections. Either a data frame with at least columns name_x and name_y (plus an optional user_note column), or a file path to a CSV with the same columns. Any name listed here bypasses the cascade and is recorded as match_type = "manual". Useful for applying published crosswalks (see reconcile_crosswalk()) or for locking down decisions made in a previous run.

db_version

A length-1 character vector. taxadb database snapshot to use (e.g. "22.12"). NULL (default) uses the latest available.

fuzzy

Logical. Enables the fuzzy-matching stage when TRUE. Default FALSE. Turn this on to catch likely typos (Corvus brachyrhnchos -> Corvus brachyrhynchos). When FALSE, stages 1–3 still run.

fuzzy_threshold

Numeric in [0, 1]. Minimum genus-weighted similarity score for a fuzzy match to be accepted. Default 0.9 (roughly "no more than ~10% of characters differ"). Lower values (e.g. 0.7) are more permissive but produce more false positives; always review fuzzy matches with reconcile_suggest() or reconcile_review() before trusting them.

flag_threshold

Numeric in [0, 1]. When resolve = "flag", fuzzy matches with a score below this value are recorded as match_type = "flagged" rather than "fuzzy", marking them for manual review. Default 0.95. Must be >= fuzzy_threshold to have any effect.

resolve

A length-1 character vector. What to do with borderline matches:

"flag" (default): Mark low-confidence fuzzy matches (score below flag_threshold) and names with indirect taxadb synonymy as match_type = "flagged" so you can audit them with reconcile_review() or reconcile_suggest().
"first": Accept the highest-scoring candidate silently, without flagging. Faster but riskier; use only when you have already reviewed the ambiguities.

quiet

Logical. Suppresses progress messages when TRUE. Default FALSE.

x_label

A length-1 character vector or NULL. Human-readable label for source x stored in the reconciliation metadata and shown in print() / format(). Defaults to the expression passed as x (via deparse(substitute())). Set this explicitly when calling reconcile_data() inside another function so the label reflects the real data source rather than the local argument name.

Value

A reconciliation object with meta$type == "data_tree". The mapping tibble has one row per unique name: matched species (in_x & in_y), data-only orphans (in_x & !in_y, candidates for reconcile_augment()), and tree-only orphans (!in_x & in_y, candidates for reconcile_apply() to prune).

Details

Internally, reconcile_tree() treats the tree's tip labels as the y argument of reconcile_data() and runs the same four-stage matching cascade (exact -> normalized -> synonym -> fuzzy). Tip labels typically differ from data names only in formatting (underscores, capitalisation, authority strings), so even with authority = NULL you usually recover most matches at the normalized stage. Turn on fuzzy = TRUE to also catch spelling mistakes.

After reconciliation, the typical workflow is:

Inspect with reconcile_summary() or reconcile_plot().
Investigate unresolved names with reconcile_suggest() and fix them with reconcile_override() or reconcile_override_batch().
Produce an aligned data frame and pruned tree via reconcile_apply().
Optionally, graft orphan species onto the tree with reconcile_augment() (exploratory only; always run sensitivity analyses).

References

Paradis, E. & Schliep, K. (2019) ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics 35:526–528. doi:10.1093/bioinformatics/bty633

Examples

# Reconcile the bundled AVONET subset against the Jetz et al. (2012)
# bird tree. `authority = NULL` keeps the example offline; in a real
# analysis you would usually set `authority = "col"` (Catalogue of
# Life) to pick up taxonomic synonyms.
data(avonet_subset)
data(tree_jetz)

rec <- reconcile_tree(
  avonet_subset, tree_jetz,
  x_species = "Species1",
  authority = NULL,
  fuzzy     = TRUE          # also catch typos
)
#> ℹ Reconciling 919 data names vs 657 tree tips
#> ℹ Matching 919 x 657 names through 3 stages...
#> ℹ Stage 1/3: Exact matching...
#> ℹ Stage 2/3: Normalised matching (0 matched so far)...
#> ℹ Stage 3/3: Fuzzy matching (657 matched so far)...
#> ✔ Matched 657/919 data names to tree tips
rec                         # one-line status
#> 
#> ── Reconciliation: data vs tree ────────────────────────────────────────────────
#>   Source x: avonet_subset
#>   Source y: phylo (657 tips)
#>   Authority: none
#>   Timestamp: 2026-06-16 10:09:58
#> ℹ Match coverage: [█████████████████████░░░░░░░░░] 71% (657/919)
#> 
#> ── Match summary ──
#> 
#> • Exact: 0 ( 0.0%)
#> • Normalized: 657 (71.5%)
#> • Synonym: 0 ( 0.0%)
#> • Fuzzy: 0 ( 0.0%)
#> • Manual: 0 ( 0.0%)
#> ! Unresolved (x only):262 (28.5%)
#> ! Unresolved (y only):0
#> ! Flagged for review: 0
#> ℹ Use `reconcile_summary()` for details, `reconcile_mapping()` for the full table.
reconcile_summary(rec)      # full breakdown by match type
#> 
#> === Reconciliation Report ===
#> Type: data_tree
#> Timestamp: 2026-06-16 10:09:58
#> Package: prepR4pcm 0.4.0.9000
#> Authority: NONE (version: latest)
#> Rank: species
#> 
#> --- Match Summary ---
#>   Exact:       0 / 919
#>   Normalized:  657 / 919
#>   Synonym:     0 / 919
#>   Fuzzy:       0 / 919
#>   Manual:      0 / 919
#>   Unresolved:  262 (x only) + 0 (y only)
#> 
#> --- Normalized Matches (657) ---
#>   "Acanthiza apicalis" -> "Acanthiza_apicalis"  ['Acanthiza apicalis' normalised to 'Acanthiza apicalis']
#>   "Acanthiza chrysorrhoa" -> "Acanthiza_chrysorrhoa"  ['Acanthiza chrysorrhoa' normalised to 'Acanthiza chrysorrhoa']
#>   "Acanthiza ewingii" -> "Acanthiza_ewingii"  ['Acanthiza ewingii' normalised to 'Acanthiza ewingii']
#>   "Acanthiza inornata" -> "Acanthiza_inornata"  ['Acanthiza inornata' normalised to 'Acanthiza inornata']
#>   "Acanthiza iredalei" -> "Acanthiza_iredalei"  ['Acanthiza iredalei' normalised to 'Acanthiza iredalei']
#>   "Acanthiza katherina" -> "Acanthiza_katherina"  ['Acanthiza katherina' normalised to 'Acanthiza katherina']
#>   "Acanthiza lineata" -> "Acanthiza_lineata"  ['Acanthiza lineata' normalised to 'Acanthiza lineata']
#>   "Acanthiza murina" -> "Acanthiza_murina"  ['Acanthiza murina' normalised to 'Acanthiza murina']
#>   "Acanthiza nana" -> "Acanthiza_nana"  ['Acanthiza nana' normalised to 'Acanthiza nana']
#>   "Acanthiza pusilla" -> "Acanthiza_pusilla"  ['Acanthiza pusilla' normalised to 'Acanthiza pusilla']
#>   "Acanthiza reguloides" -> "Acanthiza_reguloides"  ['Acanthiza reguloides' normalised to 'Acanthiza reguloides']
#>   "Acanthiza robustirostris" -> "Acanthiza_robustirostris"  ['Acanthiza robustirostris' normalised to 'Acanthiza robustirostris']
#>   "Acanthiza uropygialis" -> "Acanthiza_uropygialis"  ['Acanthiza uropygialis' normalised to 'Acanthiza uropygialis']
#>   "Acanthornis magna" -> "Acanthornis_magna"  ['Acanthornis magna' normalised to 'Acanthornis magna']
#>   "Aphelocephala leucopsis" -> "Aphelocephala_leucopsis"  ['Aphelocephala leucopsis' normalised to 'Aphelocephala leucopsis']
#>   "Aphelocephala nigricincta" -> "Aphelocephala_nigricincta"  ['Aphelocephala nigricincta' normalised to 'Aphelocephala nigricincta']
#>   "Aphelocephala pectoralis" -> "Aphelocephala_pectoralis"  ['Aphelocephala pectoralis' normalised to 'Aphelocephala pectoralis']
#>   "Calamanthus campestris" -> "Calamanthus_campestris"  ['Calamanthus campestris' normalised to 'Calamanthus campestris']
#>   "Calamanthus fuliginosus" -> "Calamanthus_fuliginosus"  ['Calamanthus fuliginosus' normalised to 'Calamanthus fuliginosus']
#>   "Crateroscelis murina" -> "Crateroscelis_murina"  ['Crateroscelis murina' normalised to 'Crateroscelis murina']
#>   ... and 637 more
#> 
#> --- Unresolved: In x But Not In y (262) ---
#>   Acanthiza cinerea
#>   Calamanthus cautus
#>   Calamanthus montanellus
#>   Calamanthus pyrrhopygius
#>   Gerygone citrina
#>   Pyrrholaemus sagittatus
#>   Artamus leucoryn
#>   Cracticus argenteus
#>   Melloria quoyi
#>   Ceblepyris caesius
#>   Ceblepyris cinereus
#>   Ceblepyris cucullatus
#>   Ceblepyris graueri
#>   Ceblepyris pectoralis
#>   Celebesica abbotti
#>   Coracina dobsoni
#>   Coracina panayensis
#>   Coracina welchmani
#>   Cyanograucalus azureus
#>   Edolisoma anale
#>   Edolisoma ceramense
#>   Edolisoma coerulescens
#>   Edolisoma dispar
#>   Edolisoma dohertyi
#>   Edolisoma grayi
#>   Edolisoma holopolium
#>   Edolisoma incertum
#>   Edolisoma insperatum
#>   Edolisoma melas
#>   Edolisoma meyerii
#>   ... and 232 more
#>  

# Produce aligned data + pruned tree ready for PGLS / PGLMM
aligned <- reconcile_apply(rec,
                           data = avonet_subset,
                           tree = tree_jetz,
                           species_col = "Species1",
                           drop_unresolved = TRUE)
#> ! Dropped 262 rows with unresolved species from data
#> ℹ Tree has 657 tips after alignment
nrow(aligned$data)
#> [1] 657
ape::Ntip(aligned$tree)
#> [1] 657