# A global gazetteer of biodiversity institutions

## Background

Most of the geographic species occurrence records publicly available from aggregated databases such as the Global Biodiversity Information Facility (GBIF), are either based on collected specimens stored in a museum, university, botanical garden, herbarium or zoo, or on human observations, e.g. vegetation surveys or citizen science projects. A relatively common error in the geographic information of these records are coordinates assigned to the physical location of the institution hosting the specimen. The reasons for these errors may include among others individuals escaped from horticulture, specimens erroneously geo-reference to their physical location as well as records based on pictures taken by laymen in zoos or botanical gardens. These records are problematic as the conditions at these locations do not represent the species’ natural habitat and might in fact differ considerably from them.

To identify these records, CoordinateCleaner includes a novel geo-referenced global database of biodiversity institutions - defined here as institutions that generally are concerned with biodiversity research and/or hosting collections of living or mounted biological specimens. We implement a cleaning check using this database as gazetteer in the cc_inst function and the institutions argument of the clean_coordinates function of the CoordinateCleaner R-package. Furthermore, we hope that this database can prove useful beyond cleaning geographic records, for instance to assess sampling biases in biological collections.

## Data compilation

We compiled names of biodiversity institutions from six different sources (Figure 1) (BGCI 2017; Index Herbariorum 2017; The Global Registry of Biodiversity Repositories 2017; Wikipedia 2017; Global Biodiveristy Information Facility 2017; GeoNames 2017) and geo-referenced them using the Google maps API via the ggmap package in R (Kahle and Wickham 2013) using institution names and, if this yielded no results the institutions address. For those records that did not yield any results we used opencage via the opencage R-package (Salmon 2017) for geo-referencing. We manually geo-referenced those institutions that could not be geo-referenced automatically (c. 50%) using the WWW and Google earth (Google Inc 2017). In total the database comprises almost 9700 geo-referenced institutions (and another 2500 entries for which geo-referencing was not possible, either to problems with non-English names or geographic ambiguities). The spatial extent of the database is global (Figure 2), but we acknowledge that there is a focus on English-speaking countries and countries using the Roman alphabet (See Figure 3 and Figure 4). This is partly a bias due to the data compilation process. We hope that this bias can be overcome by future contributions to the database from researchers in non-English speaking and non-Roman alphabet countries. In general, we acknowledge that the database may not be complete and created a webmask at (http://biodiversity-institutions.surge.sh/) were researchers can easily submit their institution or a comment on an existing institution. The webpage also includes an overview on the institutions included in the dataset.

## Data structure

In addition to the name and geographic coordinates for each institution, the database includes information on the type of the institutions (“type”, e.g. “herbarium” or “university”, see Figure 3), the source from where we obtained the name of the institution (“source”), the precision of the coordinates (“geocoding.precision.m” and “geocoding.issue”) as well as the city and address (when available, “city” and “address”). The quality of the meta-data might vary among different sources). Furthermore, the database includes a column identifying if the respective institution is located within a protected area (UNEP-WCMC and IUCN 2017), and if this is the case, the World Database of Protected Areas ID of the respective protected area (WDPA, shape file available at: https://www.protectedplanet.net/). We include this flag, as biodiversity institutions within protected areas might or might not be relevant for coordinate cleaning, depending on downstream analyses.

## Data accessability

The database is open-source and available as R data file (.rda) as part of the CoordinateCleaner package either from CRAN or GitHub under a CC-BY license. We acknowledge, that this database is not complete and can constantly be improved, any feedback can be provided via the GitHub page of (https://github.com/ropensci/CoordinateCleaner/).

# References

BGCI. 2017. “Botanic Gardens Conservation International.” https://www.bgci.org/.

GeoNames. 2017. “Www.geonames.org.”

Global Biodiveristy Information Facility. 2017. “List of data publishers.” www.gbif.org/publisher/search.