Normalise scientific names to a canonical form

Apply a sequence of deterministic text transformations so that scientific names which differ only in formatting compare equal. This is the same routine used by stage 2 of the matching cascade in reconcile_data() and reconcile_tree(). Use it directly when you want to clean a column of names without running a full reconciliation — for example, when building a crosswalk by hand.

Usage

pr_normalize_names(
  names,
  rank = c("species", "subspecies"),
  parser = c("internal", "gnparser")
)

Arguments

names

A character vector of scientific names (any length; each element is a single name). NA values are preserved as NA.

rank

A length-1 character vector. Taxonomic rank to normalise to:

"species" (default): Strip infraspecific epithets so trinomials become binomials (Parus major major -> Parus major).
"subspecies": Keep trinomials intact.

parser

A length-1 character vector. Which parsing engine to use:

"internal" (default): The package's own regex-based cascade described above. No external dependency.
"gnparser": Delegates parsing to rgnparser::gn_parse_tidy(), which wraps the gnparser Go binary (part of the Global Names Architecture). Handles hybrid signs, complex multi-author year strings, and trailing parentheticals (Open Tree homonym / rank flags) more robustly than the internal cascade. Requires both the rgnparser R package and the gnparser binary on the system PATH; the function errors helpfully if either is missing. Returns the same shape and normalisation_log attribute as the internal path, so the two are drop-in interchangeable.

Value

A character vector of normalised names, the same length as names, with an attribute "normalisation_log" — a tibble recording every non-trivial change, for auditing.

Details

The transformations, applied in order, are:

Replace underscores and multiple whitespace with a single space (Homo_sapiens -> Homo sapiens).
Strip authority strings and year, including multi-author and parenthetical forms (Corvus corax (Linnaeus, 1758) -> Corvus corax).
Strip any other trailing parenthetical qualifier, such as the Open Tree of Life homonym / rank flags that rotl returns (Prunella (genus in kingdom Archaeplastida) -> Prunella).
Fold diacritics to ASCII (Passer domesticus stays as Passer domesticus; accented characters are simplified).
Standardise case: genus capitalised, epithet lowercase.
Strip infraspecific epithets if rank = "species".
Trim whitespace and collapse leftover empty tokens.

Note

On the spelling: the title and prose use British English normalise, consistent with the package's Language: en-GB declaration. The function identifier pr_normalize_names() keeps the American-English z because R-package function names conventionally use ASCII identifiers in the form most R users expect. The two spellings are equivalent and intentional.

Examples

pr_normalize_names(c("Homo_sapiens",
                     "homo sapiens",
                     "Parus major major",
                     "Corvus corax (Linnaeus, 1758)"))
#> [1] "Homo sapiens" "Homo sapiens" "Parus major"  "Corvus corax"
#> attr(,"normalisation_log")
#> # A tibble: 4 × 3
#>   original                      normalised   changed
#>   <chr>                         <chr>        <lgl>  
#> 1 Homo_sapiens                  Homo sapiens TRUE   
#> 2 homo sapiens                  Homo sapiens TRUE   
#> 3 Parus major major             Parus major  TRUE   
#> 4 Corvus corax (Linnaeus, 1758) Corvus corax TRUE   

# Keep trinomials
pr_normalize_names("Parus major major", rank = "subspecies")
#> [1] "Parus major major"
#> attr(,"normalisation_log")
#> # A tibble: 1 × 3
#>   original          normalised        changed
#>   <chr>             <chr>             <lgl>  
#> 1 Parus major major Parus major major FALSE