Cleaning geographic coordinates by multiple empirical tests to flag potentially erroneous coordinates, addressing issues common in biological collection databases.

clean_coordinates(x, lon = "decimallongitude", lat = "decimallatitude",
species = "species", countries = NULL, tests = c("capitals",
"centroids", "equal", "gbif", "institutions", "outliers", "seas",
"zeros"), capitals_rad = 10000, centroids_rad = 1000,
centroids_detail = "both", inst_rad = 100,
outliers_method = "quantile", outliers_mtp = 5, outliers_td = 1000,
outliers_size = 7, range_rad = 0, zeros_rad = 0.5,
capitals_ref = NULL, centroids_ref = NULL, country_ref = NULL,
inst_ref = NULL, range_ref = NULL, seas_ref = NULL,
seas_scale = 50, urban_ref = NULL, value = "spatialvalid",
verbose = TRUE, report = FALSE)

Arguments

x data.frame. Containing geographical coordinates and species names. character string. The column with the longitude coordinates. Default = “decimallongitude”. character string. The column with the latitude coordinates. Default = “decimallatitude”. a character string. A vector of the same length as rows in x, with the species identity for each record. If missing, the outliers test is skipped. a character string. The column with the country assignment of each record in three letter ISO code. Default = “countrycode”. If missing, the countries test is skipped. a vector of character strings, indicating which tests to run. See details for all tests available. Default = c("capitals", "centroids", "equal", "gbif", "institutions", "outliers", "seas", "zeros") numeric. The radius around capital coordinates in meters. Default = 10000. numeric. The radius around capital coordinates in meters. Default = 1000. a character string. If set to ‘country’ only country (adm-0) centroids are tested, if set to ‘provinces’ only province (adm-1) centroids are tested. Default = ‘both’. numeric. The radius around biodiversity institutions coordinates in metres. Default = 100. The method used for outlier testing. See details. numeric. The multiplier for the interquartile range of the outlier test. If NULL outliers.td is used. Default = 5. numeric. The minimum distance of a record to all other records of a species to be identified as outlier, in km. Default = 1000. numerical. The minimum number of records in a dataset to run the taxon-specific outlier test. Default = 7. buffer around natural ranges. Default = 0. numeric. The radius around 0/0 in degrees. Default = 0.5. a data.frame with alternative reference data for the country capitals test. If missing, the countryref dataset is used. Alternatives must be identical in structure. a data.frame with alternative reference data for the centroid test. If NULL, the countryref dataset is used. Alternatives must be identical in structure. a SpatialPolygonsDataFrame as alternative reference for the countries test. If NULL, the rnaturalearth:ne_countries('medium') dataset is used. a data.frame with alternative reference data for the biodiversity institution test. If NULL, the institutions dataset is used. Alternatives must be identical in structure. a SpatialPolygonsDataFrame of species natural ranges. Required to include the 'ranges' test. See cc_iucn for details. a SpatialPolygonsDataFrame as alternative reference for the seas test. If NULL, the rnaturalearth::ne_download(=scale = 110, type = 'land', category = 'physical') dataset is used. The scale of the default landmass reference. Must be one of 10, 50, 110. Higher numbers equal higher detail. Default = 50. a SpatialPolygonsDataFrame as alternative reference for the urban test. If NULL, the test is skipped. See details for a reference gazetteers. a character string defining the output value. See the value section for details. one of ‘spatialvalid’, ‘summary’, ‘clean’. Default = ‘spatialvalid’. logical. If TRUE reports the name of the test and the number of records flagged. logical or character. If TRUE a report file is written to the working directory, summarizing the cleaning results. If a character, the path to which the file should be written. Default = FALSE.

Value

Depending on the output argument:

“spatialvalid”

an object of class spatialvalid similar to x with one column added for each test. TRUE = clean coordinate entry, FALSE = potentially problematic coordinate entries. The .summary column is FALSE if any test flagged the respective coordinate.

“flagged”

a logical vector with the same order as the input data summarizing the results of all test. TRUE = clean coordinate, FALSE = potentially problematic (= at least one test failed).

“clean”

a data.frame similar to x with potentially problematic records removed

Details

The function needs all coordinates to be formally valid according to WGS84. If the data contains invalid coordinates, the function will stop and return a vector flagging the invalid records. TRUE = non-problematic coordinate, FALSE = potentially problematic coordinates.

• capitals tests a radius around adm-0 capitals. The radius is capitals_rad.

• centroids tests a radius around country centroids. The radius is centroids_rad.

• countries tests if coordinates are from the country indicated in the country column. Switched off by default.

• duplicates tests for duplicate records. This checks for identical coordinates or if a species vector is provided for identical coordinates within a species. All but the first records are flagged as duplicates. Switched off by default.

• equal tests for equal absolute longitude and latitude.

• gbif tests a one-degree radius around the GBIF headquarters in Copenhagen, Denmark.

• institutions tests a radius around known biodiversity institutions from instiutions. The radius is inst_rad.

• outliers tests each species for outlier records. Depending on the outliers_mtp and outliers.td arguments either flags records that are a minimum distance away from all other records of this species (outliers_td) or records that are outside a multiple of the interquartile range of minimum distances to the next neighbour of this species (outliers_mtp). Three different methods are available for the outlier test: "If “outlier” a boxplot method is used and records are flagged as outliers if their mean distance to all other records of the same species is larger than mltpl * the interquartile range of the mean distance of all records of this species. If “mad” the median absolute deviation is used. In this case a record is flagged as outlier, if the mean distance to all other records of the same species is larger than the median of the mean distance of all points plus/minus the mad of the mean distances of all records of the species * mltpl. If “distance” records are flagged as outliers, if the minimum distance to the next record of the species is > tdi.

• ranges tests if records fall within provided natural range polygons on a per species basis. See cc_iucn for details.

• seas tests if coordinates fall into the ocean.

• urban tests if coordinates are from urban areas. Switched off by default

• validity checks if coordinates correspond to a lat/lon coordinate reference system. This test is always on, since all records need to pass for any other test to run.

• zeros tests for plain zeros, equal latitude and longitude and a radius around the point 0/0. The radius is zeros.rad.

Note

Always tests for coordinate validity: non-numeric or missing coordinates and coordinates exceeding the global extent (lon/lat, WGS84). See https://ropensci.github.io/CoordinateCleaner/ for more details and tutorials.

See also

Other Wrapper functions: clean_dataset, clean_fossils

Examples



exmpl <- data.frame(species = sample(letters, size = 250, replace = TRUE),
decimallongitude = runif(250, min = 42, max = 51),
decimallatitude = runif(250, min = -26, max = -11))

test <- clean_coordinates(x = exmpl,
tests = c("equal"))#> Testing coordinate validity#> Flagged 0 records.#> Testing equal lat/lon#> Flagged 0 records.#> Flagged 0 of 250 records, EQ = 0.
# NOT RUN {
#run more tests
test <- clean_coordinates(x = exmpl,
tests = c("capitals",
"centroids","equal",
"gbif", "institutions",
"outliers", "seas",
"zeros"))
# }

summary(test)#>     .val     .equ .summary
#>        0        0        0