vignettes/taxonomic_ranks.Rmd
taxonomic_ranks.Rmd
Taxonomic ranks are “whole thing”, as I like to say when something is complicated/messy. Taxonomic ranks are the relative placement of a taxon in a taxonomic hierarchy. If there was only one taxonomic hierarchy with the same taxonomic rank names life would be so much easier for all of us. Unfortunately, this is not the case. Nearly every source of taxonomic information has slightly different taxonomic rank names they use, even if many are the same across sources. This article highlights some of the sticky messes in taxonomic ranks.
NCBI taxonomy (https://www.ncbi.nlm.nih.gov/taxonomy) is an outlier
among the taxonomic data sources. They use a lot of rank names that hold
no information as to the placement of that taxon within a taxonomic
hierarchy. The most common of these is no rank. If
you’ve retrieved taxonomic information from NCBI you may have noticed
this supposed rank name. When you get data from NCBI in taxize you will
most certainly run into this rank; just know that it doesn’t tell you
where it falls within a taxonomic hierarchy. You can however use
classification()
to hopefully get other rank names that do
hold actual information.
Another one that holds no information which has started to become more common with NCBI is clade. It’s started to replace no rank in at least some cases. As far as we can tell this rank name also holds no information about it’s placement within a taxonomic hierarchy.
downstream()
and related functions in taxize require a
rank to be given to the downto
parameter, which says you
want taxonomic names down to a particular rank.
Some ranks are not useful for the downto
parameter. They
include “no rank”, “unspecified”, “unranked”, and “clade”; there are
possibly more. They can’t be used because they have no rank placement in
a hierarchy - they can be anywhere in the hierarchy.
As a work around, try looking at the taxonomic hierarchy for your
taxa of interest using classification()
or on the data
providers website and see if there’s a taxonomic rank below the rank you
want, and use that rank in combination with setting
intermediate=TRUE
in your downstream()
function call. From the intermediate data you can fetch data you
need.
taxize maintains a taxonomic hierarchy reference within the package. There are two:
rank_ref
: data.frame with two columns and around 45
rows (may increase or decrease through time as we incorporate or take
away some ranks). ?rank_ref
for more informationrank_ref_zoo
: same as above but more specifically for
zoological rank systems; as of this writing only used for WoRMS;
rank_ref
is used for all other data sources.
?rank_ref_zoo
for more information
head(rank_ref)
#> rankid ranks
#> 1 01 domain
#> 2 05 superkingdom
#> 3 10 kingdom
#> 4 20 subkingdom
#> 5 25 infrakingdom,superphylum
#> 6 30 phylum,division
We reference this rank information in classification()
,
tax_name()
, and all of the exported downstream
functions. We use the rankid
numeric column to be able to
filter data using <
/>
/etc.
We often reference rank information inside the helper functions
which_rank()
and prune_too_low()
, utility
functions to help filter results from data providers by rank
information.
We manage these two datasets using the script at
inst/ignore/rank_ref_script.R
in the package, which ends
with updating each dataset in the data/
folder of the
package.
Let us know by opening an issue (https://github.com/ropensci/taxize/issues) if you think
you have run into any issues with rank information. The most common
issues we see with ranks is with the downstream()
function
and the data source specific functions that support it.