Apply a sequence of deterministic text transformations so that
scientific names which differ only in formatting compare equal.
This is the same routine used by stage 2 of the matching cascade in
reconcile_data() and reconcile_tree(). Use it directly when you
want to clean a column of names without running a full
reconciliation — for example, when building a crosswalk by hand.
Arguments
- names
A character vector of scientific names (any length; each element is a single name).
NAvalues are preserved asNA.- rank
A length-1 character vector. Taxonomic rank to normalise to:
"species"(default)Strip infraspecific epithets so trinomials become binomials (
Parus major major->Parus major)."subspecies"Keep trinomials intact.
- parser
A length-1 character vector. Which parsing engine to use:
"internal"(default)The package's own regex-based cascade described above. No external dependency.
"gnparser"Delegates parsing to
rgnparser::gn_parse_tidy(), which wraps the gnparser Go binary (part of the Global Names Architecture). Handles hybrid signs, complex multi-author year strings, and trailing parentheticals (Open Tree homonym / rank flags) more robustly than the internal cascade. Requires both the rgnparser R package and thegnparserbinary on the system PATH; the function errors helpfully if either is missing. Returns the same shape andnormalisation_logattribute as the internal path, so the two are drop-in interchangeable.
Value
A character vector of normalised names, the same length as
names, with an attribute "normalisation_log" — a tibble
recording every non-trivial change, for auditing.
Details
The transformations, applied in order, are:
Replace underscores and multiple whitespace with a single space (
Homo_sapiens->Homo sapiens).Strip authority strings and year, including multi-author and parenthetical forms (
Corvus corax (Linnaeus, 1758)->Corvus corax).Strip any other trailing parenthetical qualifier, such as the Open Tree of Life homonym / rank flags that
rotlreturns (Prunella (genus in kingdom Archaeplastida)->Prunella).Fold diacritics to ASCII (
Passer domesticusstays asPasser domesticus; accented characters are simplified).Standardise case: genus capitalised, epithet lowercase.
Strip infraspecific epithets if
rank = "species".Trim whitespace and collapse leftover empty tokens.
Note
On the spelling: the title and prose use British English
normalise, consistent with the package's
Language: en-GB declaration. The function identifier
pr_normalize_names() keeps the American-English z because
R-package function names conventionally use ASCII identifiers
in the form most R users expect. The two spellings are
equivalent and intentional.
See also
reconcile_data() and reconcile_tree() for the full
four-stage matching cascade; pr_extract_tips() for pulling tip
labels out of a tree prior to normalising them.
Other name utilities:
pr_extract_tips()
Examples
pr_normalize_names(c("Homo_sapiens",
"homo sapiens",
"Parus major major",
"Corvus corax (Linnaeus, 1758)"))
#> [1] "Homo sapiens" "Homo sapiens" "Parus major" "Corvus corax"
#> attr(,"normalisation_log")
#> # A tibble: 4 × 3
#> original normalised changed
#> <chr> <chr> <lgl>
#> 1 Homo_sapiens Homo sapiens TRUE
#> 2 homo sapiens Homo sapiens TRUE
#> 3 Parus major major Parus major TRUE
#> 4 Corvus corax (Linnaeus, 1758) Corvus corax TRUE
# Keep trinomials
pr_normalize_names("Parus major major", rank = "subspecies")
#> [1] "Parus major major"
#> attr(,"normalisation_log")
#> # A tibble: 1 × 3
#> original normalised changed
#> <chr> <chr> <lgl>
#> 1 Parus major major Parus major major FALSE