Removes out or flags records that are outliers in geographic space according
to the method defined via the method
argument. Geographic outliers
often represent erroneous coordinates, for example due to data entry errors,
imprecise geo-references, individuals in horticulture/captivity.
cc_outl(
x,
lon = "decimalLongitude",
lat = "decimalLatitude",
species = "species",
method = "quantile",
mltpl = 5,
tdi = 1000,
value = "clean",
sampling_thresh = 0,
verbose = TRUE,
min_occs = 7,
thinning = FALSE,
thinning_res = 0.5
)
data.frame. Containing geographical coordinates and species names.
character string. The column with the longitude coordinates. Default = “decimalLongitude”.
character string. The column with the latitude coordinates. Default = “decimalLatitude”.
character string. The column with the species name. Default = “species”.
character string. Defining the method for outlier selection. See details. One of “distance”, “quantile”, “mad”. Default = “quantile”.
numeric. The multiplier of the interquartile range
(method == 'quantile'
) or median absolute deviation (method ==
'mad'
)to identify outliers. See details. Default = 5.
numeric. The minimum absolute distance (method ==
'distance'
) of a record to all other records of a species to be identified
as outlier, in km. See details. Default = 1000.
character string. Defining the output value. See value.
numeric. Cut off threshold for the sampling
correction. Indicates the quantile of sampling in which outliers should be
ignored. For instance, if sampling_thresh
== 0.25, records in the
25
(no sampling correction).
logical. If TRUE reports the name of the test and the number of records flagged.
Minimum number of geographically unique datapoints needed for
a species to be tested. This is necessary for reliable outlier estimation.
Species with fewer than min_occs records will not be tested and the output
value will be 'TRUE'. Default is to 7. If method == 'distance'
,
consider a lower threshold.
forces a raster approximation for the distance calculation. This is routinely used for species with more than 10,000 records for computational reasons, but can be enforced for smaller datasets, which is recommended when sampling is very uneven.
The resolution for the spatial thinning in decimal degrees. Default = 0.5.
Depending on the ‘value’ argument, either a data.frame
containing the records considered correct by the test (“clean”) or a logical vector (“flagged”), with TRUE = test passed and FALSE = test failed/potentially problematic . Default = “clean”.
The method for outlier identification depends on the method
argument.
If “quantile”: a boxplot method is used and records are flagged as
outliers if their mean distance to all other records of the same
species is larger than mltpl * the interquartile range of the mean distance
of all records of this species. If “mad”: the median absolute
deviation is used. In this case a record is flagged as outlier, if the
mean distance to all other records of the same species is larger than
the median of the mean distance of all points plus/minus the mad of the mean
distances of all records of the species * mltpl. If “distance”:
records are flagged as outliers, if the minimum distance to the next
record of the species is > tdi
. For species with records from > 10000
unique locations a random sample of 1000 records is used for the distance
matrix calculation. The test skips species with fewer than min_occs
,
geographically unique records.
The likelihood of occurrence records being erroneous outliers is linked to the sampling effort in any given location. To account for this, the sampling_cor option fetches the number of occurrence records available from www.gbif.org, per country as a proxy of sampling effort. The outlier test (the mean distance) for each records is than weighted by the log transformed number of records per square kilometre in this country. See for https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.13152 an example and further explanation of the outlier test.
See https://ropensci.github.io/CoordinateCleaner/ for more details and tutorials.
x <- data.frame(species = letters[1:10],
decimalLongitude = runif(100, -180, 180),
decimalLatitude = runif(100, -90,90))
cc_outl(x)
#> Testing geographic outliers
#> Removed 0 records.
#> species decimalLongitude decimalLatitude
#> 1 a -166.734140 67.7983944
#> 2 b 98.309641 -30.2983092
#> 3 c -93.650464 77.8124338
#> 4 d 167.349719 17.0580427
#> 5 e -104.442430 -35.6467802
#> 6 f 114.178918 31.5713251
#> 7 g 73.278478 -2.5544471
#> 8 h 178.860480 -8.4248014
#> 9 i -74.951180 41.9592909
#> 10 j 138.319755 -72.9571236
#> 11 a -39.404879 -25.9041824
#> 12 b -155.415490 -34.4565346
#> 13 c 163.392163 -81.8338668
#> 14 d 171.900505 12.3856285
#> 15 e 140.365106 16.7903421
#> 16 f 22.368051 31.8395472
#> 17 g 115.062829 -85.0786525
#> 18 h -147.382517 -45.4319912
#> 19 i 134.812375 -5.6636119
#> 20 j 142.923047 -60.5312263
#> 21 a 113.209576 17.0677688
#> 22 b 164.443335 41.8427002
#> 23 c -46.808064 -82.3436117
#> 24 d 54.980445 88.1917448
#> 25 e -134.371561 -14.0223085
#> 26 f 1.099822 -56.7685619
#> 27 g -101.460269 -88.3876949
#> 28 h 76.292558 77.3638814
#> 29 i 139.765357 16.7961550
#> 30 j -43.471362 75.3359226
#> 31 a 147.978006 78.0082618
#> 32 b -120.053240 -85.3209297
#> 33 c 85.398925 -79.8288402
#> 34 d 38.620958 65.6755897
#> 35 e -111.461286 50.2022438
#> 36 f 37.586563 -72.6932916
#> 37 g 61.909352 -89.1298452
#> 38 h 92.566249 32.1167114
#> 39 i 10.489470 -61.7511275
#> 40 j -46.718138 62.0432867
#> 41 a 170.554377 6.4895156
#> 42 b -25.664982 32.0149804
#> 43 c -166.438845 73.6694749
#> 44 d -58.110353 40.2955630
#> 45 e 54.340525 -78.1959907
#> 46 f -8.744069 -21.8429594
#> 47 g 132.346343 -59.3009500
#> 48 h -84.960212 51.0002726
#> 49 i 139.504670 -33.0669822
#> 50 j -102.067557 43.7191747
#> 51 a 165.350528 39.0488048
#> 52 b 59.655976 -74.3048462
#> 53 c -31.964427 -72.2516459
#> 54 d 80.999259 -18.8016596
#> 55 e -132.296646 22.8351028
#> 56 f -45.193013 -74.8497798
#> 57 g 87.986337 7.2154532
#> 58 h 172.347444 43.8970930
#> 59 i -49.091666 65.2350802
#> 60 j -77.346975 78.8476939
#> 61 a 26.982772 -9.5217459
#> 62 b 158.238673 -22.8168754
#> 63 c 9.132454 -80.4072736
#> 64 d 68.665172 75.1819960
#> 65 e -116.568908 -63.0196669
#> 66 f -117.988640 -69.0043274
#> 67 g 115.195135 40.1280645
#> 68 h 160.915815 3.0927944
#> 69 i -20.096711 -88.2364526
#> 70 j 72.613967 -82.0420930
#> 71 a -153.845346 -15.3968361
#> 72 b -15.487557 80.6981480
#> 73 c 30.974348 -2.1761179
#> 74 d -141.993606 -49.3856963
#> 75 e -78.949258 -36.8678680
#> 76 f -27.735618 69.2380178
#> 77 g 87.427949 38.3464553
#> 78 h -17.626052 -17.1217376
#> 79 i -5.420146 -47.3429105
#> 80 j 98.025516 86.9011125
#> 81 a -171.171004 31.6211660
#> 82 b -98.434968 -21.4210422
#> 83 c -27.200701 -71.9649185
#> 84 d -56.936156 12.9681117
#> 85 e 137.204372 -0.0337002
#> 86 f 80.614949 -86.7830849
#> 87 g 79.610152 81.7146947
#> 88 h -34.941441 14.2288230
#> 89 i -118.943267 40.2566450
#> 90 j -122.767857 -83.5447125
#> 91 a -170.979889 46.0867570
#> 92 b 150.704208 17.4628800
#> 93 c 35.644877 -23.8294777
#> 94 d 106.891720 -62.1885668
#> 95 e 107.680817 25.6034603
#> 96 f -57.598225 27.2663416
#> 97 g 27.322493 66.1210746
#> 98 h -149.129202 -26.4568585
#> 99 i -139.945984 51.2361376
#> 100 j -164.129052 37.7751713
cc_outl(x, method = "quantile", value = "flagged")
#> Testing geographic outliers
#> Flagged 0 records.
#> [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [46] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [61] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [76] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [91] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
cc_outl(x, method = "distance", value = "flagged", tdi = 10000)
#> Testing geographic outliers
#> Flagged 0 records.
#> [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [46] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [61] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [76] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [91] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
cc_outl(x, method = "distance", value = "flagged", tdi = 1000)
#> Testing geographic outliers
#> Flagged 89 records.
#> [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [13] FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
#> [25] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
#> [37] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [49] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
#> [61] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
#> [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [97] FALSE FALSE FALSE FALSE