Skip to contents

Match the species column of one data frame (x) to the species column of another (y), returning a reconciliation object that records how every name was resolved. Use this when combining trait datasets, range datasets, or any other species-level tables that may use slightly different taxonomies or spellings.

Usage

reconcile_data(
  x,
  y,
  x_species = NULL,
  y_species = NULL,
  authority = "col",
  rank = c("species", "subspecies"),
  overrides = NULL,
  db_version = NULL,
  fuzzy = FALSE,
  fuzzy_threshold = 0.9,
  flag_threshold = 0.95,
  resolve = c("flag", "first"),
  quiet = FALSE,
  x_label = NULL,
  y_label = NULL
)

Arguments

x

A data frame whose species will be matched from.

y

A data frame whose species will be matched to (typically the "reference" taxonomy or the dataset you want to merge with).

x_species

A length-1 character vector. Name of the column in x containing scientific names. Auto-detected (e.g. species, Species1, scientific_name) when NULL.

y_species

A length-1 character vector. Name of the column in y containing scientific names. Auto-detected when NULL.

authority

A length-1 character vector, or NULL. Taxonomic authority used for synonym resolution (stage 3 of the cascade). One of:

"col" (default)

Catalogue of Life — broad, curated, frequently updated. A sensible default for most taxa.

"itis"

Integrated Taxonomic Information System — strong for North American vertebrates and plants.

"gbif"

Global Biodiversity Information Facility backbone. Wider coverage; includes more recent synonymy.

"ncbi"

NCBI Taxonomy — best when working with sequence data.

"ott"

Open Tree of Life synthetic taxonomy. Useful when your downstream phylogeny is from the Open Tree synthesis.

"itis_test"

A small bundled subset of ITIS, cached locally with taxadb for testing. Intended for examples and unit tests; not for analysis.

"gnverifier"

HTTP-backed verification against ~100 sources via the Global Names verifier; no local database download. See vignette("getting-started") for the trade-off (wider coverage, requires network and the httr2 package).

NULL

Skip the synonym stage entirely. Useful for quick checks or when taxadb is unavailable. Stages 1, 2 and 4 still run.

Five authority codes that earlier versions of the package advertised — "iucn", "tpl", "fb", "slb", "wd" — are no longer accepted. Empirical testing against taxadb v22.12 showed that iucn errors with a schema mismatch and the others are not taxadb providers at all. Passing one of those values now produces a helpful migration error.

rank

A length-1 character vector. Controls how trinomials are handled during normalisation:

"species" (default)

Strip infraspecific epithets so that "Parus major major" becomes "Parus major" before matching.

"subspecies"

Keep trinomials intact. Use this when your analysis operates at subspecies level.

overrides

Optional pre-built corrections. Either a data frame with at least columns name_x and name_y (plus an optional user_note column), or a file path to a CSV with the same columns. Any name listed here bypasses the cascade and is recorded as match_type = "manual". Useful for applying published crosswalks (see reconcile_crosswalk()) or for locking down decisions made in a previous run.

db_version

A length-1 character vector. taxadb database snapshot to use (e.g. "22.12"). NULL (default) uses the latest available.

fuzzy

Logical. Enables the fuzzy-matching stage when TRUE. Default FALSE. Turn this on to catch likely typos (Corvus brachyrhnchos -> Corvus brachyrhynchos). When FALSE, stages 1–3 still run.

fuzzy_threshold

Numeric in [0, 1]. Minimum genus-weighted similarity score for a fuzzy match to be accepted. Default 0.9 (roughly "no more than ~10% of characters differ"). Lower values (e.g. 0.7) are more permissive but produce more false positives; always review fuzzy matches with reconcile_suggest() or reconcile_review() before trusting them.

flag_threshold

Numeric in [0, 1]. When resolve = "flag", fuzzy matches with a score below this value are recorded as match_type = "flagged" rather than "fuzzy", marking them for manual review. Default 0.95. Must be >= fuzzy_threshold to have any effect.

resolve

A length-1 character vector. What to do with borderline matches:

"flag" (default)

Mark low-confidence fuzzy matches (score below flag_threshold) and names with indirect taxadb synonymy as match_type = "flagged" so you can audit them with reconcile_review() or reconcile_suggest().

"first"

Accept the highest-scoring candidate silently, without flagging. Faster but riskier; use only when you have already reviewed the ambiguities.

quiet

Logical. Suppresses progress messages when TRUE. Default FALSE.

x_label

A length-1 character vector or NULL. Human-readable label for source x stored in the reconciliation metadata and shown in print() / format(). Defaults to the expression passed as x (via deparse(substitute())). Set this explicitly when calling reconcile_data() inside another function so the label reflects the real data source rather than the local argument name.

y_label

A length-1 character vector or NULL. Same as x_label, for source y.

Value

A reconciliation object. The accompanying mapping tibble, match-type counts, provenance metadata, and applied / unused override slots are documented in reconciliation. See the "After the call" section above for the most common next steps.

Details

Names are passed through a four-stage matching cascade, and the first stage that returns a match is recorded in match_type:

  1. exact — verbatim string equality.

  2. normalized — after stripping underscores, authority strings ("Corvus corax Linnaeus, 1758"), diacritics, and case/whitespace differences.

  3. synonym — lookup in a local taxonomic database via taxadb (Catalogue of Life, GBIF, ITIS, NCBI, ...). Skipped if authority = NULL.

  4. fuzzy — character-level similarity (opt-in via fuzzy = TRUE). Uses a genus-weighted Levenshtein score (60% genus, 40% specific epithet) with a genus pre-filter so that only plausibly similar genera are compared.

Names that survive all four stages are labelled unresolved. Any entries supplied through overrides take precedence over the cascade.

After the call. A reconciliation object is the input to most other functions in the package. Common next steps:

References

Norman, K.E., Chamberlain, S. & Boettiger, C. (2020) taxadb: A high-performance local taxonomic database interface. Methods in Ecology and Evolution 11:1153–1159. doi:10.1111/2041-210X.13440

Examples

# Merge AVONET morphology with nest-site data. Both datasets use
# slightly different taxonomies; authority = NULL keeps the example
# offline (no taxadb download).
data(avonet_subset)
data(nesttrait_subset)

rec <- reconcile_data(avonet_subset, nesttrait_subset,
                      x_species = "Species1",
                      y_species = "Scientific_name",
                      authority = NULL)
#>  Reconciling 919 names (x) vs 916 names (y)
#>  Matching 919 x 916 names through 2 stages...
#>  Stage 1/2: Exact matching...
#>  Stage 2/2: Normalised matching (916 matched so far)...
#>  Matched 916/919 names from x
rec                      # concise print method
#> 
#> ── Reconciliation: data vs data ────────────────────────────────────────────────
#>   Source x: avonet_subset
#>   Source y: nesttrait_subset
#>   Authority: none
#>   Timestamp: 2026-06-16 10:09:48
#>  Match coverage: [██████████████████████████████] 100% (916/919)
#> 
#> ── Match summary ──
#> 
#>  Exact: 916 (99.7%)
#>  Normalized: 0 ( 0.0%)
#>  Synonym: 0 ( 0.0%)
#>  Fuzzy: 0 ( 0.0%)
#>  Manual: 0 ( 0.0%)
#> ! Unresolved (x only):3 ( 0.3%)
#> ! Unresolved (y only):0
#> ! Flagged for review: 0
#>  Use `reconcile_summary()` for details, `reconcile_mapping()` for the full table.
reconcile_summary(rec)   # full breakdown
#> 
#> === Reconciliation Report ===
#> Type: data_data
#> Timestamp: 2026-06-16 10:09:48
#> Package: prepR4pcm 0.4.0.9000
#> Authority: NONE (version: latest)
#> Rank: species
#> 
#> --- Match Summary ---
#>   Exact:       916 / 919
#>   Normalized:  0 / 919
#>   Synonym:     0 / 919
#>   Fuzzy:       0 / 919
#>   Manual:      0 / 919
#>   Unresolved:  3 (x only) + 0 (y only)
#> 
#> --- Unresolved: In x But Not In y (3) ---
#>   Myzomela irianawidodoae
#>   Myzomela prawiradilagae
#>   Myzomela wahe
#>  

# Join the two datasets on the reconciled species key
merged <- reconcile_merge(rec, avonet_subset, nesttrait_subset,
                          species_col_x = "Species1",
                          species_col_y = "Scientific_name")
#>  Merged 916 species (inner join)
head(merged[, c("species_resolved", "Family1", "Common_name")])
#>           species_resolved      Family1              Common_name
#> 1 Acanthagenys rufogularis Meliphagidae Spiny-cheeked Honeyeater
#> 2       Acanthiza apicalis Acanthizidae         Inland Thornbill
#> 3    Acanthiza chrysorrhoa Acanthizidae  Yellow-rumped Thornbill
#> 4        Acanthiza cinerea Acanthizidae           Grey Thornbill
#> 5        Acanthiza ewingii Acanthizidae      Tasmanian Thornbill
#> 6       Acanthiza inornata Acanthizidae        Western Thornbill