`Background_dataset_level_cleaning.Rmd`

Some problems with biological collection data are not apparent from individual records, but rather linked to properties of an entire data set. The CleanCoordinatesDS function can use dataset properties to flag three types of potential problems, under the assumption that these problems will effect many records in a dataset from the same source (but not necessarily all):

An erroneous conversion of coordinates in degree minute annotation into decimal degrees, where the decimal sign is erroneously translated into the decimal delimiter, e.g. 10°30’ to 10.3 °. This is a problem that has been observed in particular for older data sets.

A periodicity in the decimals of a data set, as will arise when coordinates are either rounded or recorded in a raster format and coordinates represent raster cell centres. This problem represents low precision rather than errors, but can also be fatal, if undetected and taken as actual localities, for example in distribution modelling.

`cd_ddmm`

)Geographic coordinates in a longitude/latitude coordinate reference system can be noted in different ways. The most common notations are degree minutes seconds (ddmm, e.g. 38°54’22’‘) and decimal degrees (dd.dd, e.g. 38.90611°). A hybrid annotation of degrees with decimal minutes is also sometimes used (e.g. 38°54.367). However, analyses using distribution data almost exclusively rely on the machine readable decimal-degree format. Therefore the diversity of formats is challenging for databases comprised of coordinate records from different sources which potentially use different annotation formats. Systematic errors can arise, especially if old data are combined and digitized automatically, without appropriate conversion. A particular problem reported repeatedly is the misinterpretation of the degree sign (°) as the decimal delimiter (e.g. 38°54’ converted to 38.54°), which leads to biased geographic occurrence information. A particular caveat in identifying these problems post-hoc in a database compiled from many sources is that biased records might be mixed with unproblematic records. For instance, specimens from a certain herbarium might have been digitized in different instances in a way, that part of the records are biased whereas others are not.

As part of this study we present a novel algorithm to identify data sets potentially biased by erroneous conversion from ddmm to dd.dd due to the misinterpretation of the degree sign as decimal delimiter.

The binomial test and frequency comparison described in the main text are based on an analysis matrix which is recording the distribution of coordinates decimals in the longitude/latitude space (Figure 1). The analysis matrix does not record the number of records in a cell, but only presence/absence, to account for clustered sampling (i.e. a large number of records with similar decimals are most likely not related to conversion error, but rather to multiple samples from the same location, or coordinate rounding, see section 3).

The test is implemented in the `cd_ddmm`

function and the `ddmm`

argument of the `clean_dataset`

wrapper function. The input follows the general package design, and is a `data.frame`

with at least three columns including the decimal longitude and latitude and a data set identifier for each record. The names of the columns can be specified via the `lon`

, `lat`

and `ds`

arguments. There are three additional arguments to customize the test, in particular to modify its sensitivity to the fraction of a data set that is biased: The `pvalue`

argument controls the cut-off p-value for significance of the one-sided t-test. The `diff`

argument controls the threshold difference for the `ddmm test`

, and indicates by which fraction the records with decimals below 0.6 must outnumber the records with decimals above 0.6. The size of the analysis matrix can be adapted using the `mat.size`

argument.

We used simulations to asses (A) the effect of varying the diff parameter and (B) the effect of data set size on the performance of the `ddmm`

test.

We simulated 100,000 datasets of species occurrences with varying number of records and degree of sample clustering. For each iteration we first draw a random number, N as \(\Gamma(\alpha = 2, \beta = 1) *500\) for the number of records. We the simulated N latitude and longitude coordinates between 0° an 90° using \(K \in [1,5)\) truncated normal distributions with \(\mu_i \sim \mathcal{U}(0,90)\) and \(\sigma \sim \mathcal{U}(0.1,5)\). We then added a bias by replacing between 0 - 80% of the samples by records with decimals sampled from \(\mathcal{U}(0, 0.599)\). We then analysed the simulated data using `cd_ddmm`

using diff parameters between 0.1 and 1.

Figure 2 shows the effect of the `diff`

parameter on the sensitivity of the `cd_ddmm`

test and Figure 3 the effect of dataset size. In general, the test is identifying the presence of a bias with rate > 0.1 (10% bias) well, for datasets with more than 100 individual occurrence records. However, for empirical data we recommend a more conservative `diff`

threshold of 0.5-1 to identify datasets with more than 30% bias. This is because smaller bias rates might be caused by irregular sampling rather than conversion errors. However, the advisable diff threshold depends on downstream analyses and higher `diff`

values might be necessary (Figure 2). We suggest to manually check the decimal distribution in flagged datasets to avoid data loss. This can be done easily by visually inspecting the analyses matrix, by setting the `diagnostic`

argument of `cd_ddmm`

to TRUE.

##Expected failure rate The simulations suggest the expectable false positives for a given diff (p-value constant at 0.025) are low.

diff | false.positives | n | false.positive.rate |
---|---|---|---|

0.1 | 56 | 901 | 0.06 |

0.3 | 5 | 838 | 0.01 |

0.5 | 0 | 864 | 0.00 |

1.0 | 0 | 827 | 0.00 |

The conversion error test is implemented in the cd_ddmm function. You can easily run the test with few lines of code. By default the test requires a two degree span for each tested dataset to avoid false flags due to habitat restrictions, as for instance on islands or patchy species distributions, for instance islands of forests in grassland. Nevertheless, we consider the `cd_ddmm`

test a tool to identify potentially problematic datasets rather than for automatic filtering and recommend to double check flagged datasets using summary statistics and the diagnostic plots.

```
clean <- data.frame(species = letters[1:10],
decimallongitude = runif(100, -180, 180),
decimallatitude = runif(100, -90,90),
dataset = "clean")
#problematic dataset
lon <- sample(0:180, size = 100, replace = TRUE) + runif(100, 0,0.59)
lat <- sample(0:90, size = 100, replace = TRUE) + runif(100, 0,0.59)
biased <- data.frame(species = letters[1:10],
decimallongitude = lon,
decimallatitude = lat,
dataset = "biased")
dat <- rbind(clean, biased)
# with diagnostic plots and small matrix due to small dataset size and for visualization
par(mfrow = c(1,2))
cd_ddmm(x = dat, diagnostic = TRUE, value = "dataset", mat_size = 100)
```

```
## binomial.pvalue perc.difference pass
## clean 0.358 0.063 TRUE
## biased 0.000 1.873 FALSE
```

```
#inspect geographic extent of the flagged dataset to exclude island or patchy habitat
min(biased$decimallongitude)
```

`## [1] 3.274181`

`max(biased$decimallongitude)`

`## [1] 175.1158`

The diagnostic plots clearly show the biased distribution of decimals in the biased dataset and the large geographic extent excludes islands or patchy habitats as cause of this distribution.

`cd_round`

)Species occurrence records with coordinates often do not represent point occurrences but are either derived from rasterized collection designs (e.g. presence/absence in a 100x100 km grid cell) or have been subject to strong decimal rounding. If the relevant meta-data are missing, these issues cannot be identified on the record level, especially if the records have been combined with precise GPS-based point occurrences into data sets of mixed precision. However, knowledge of coordinate precision and awareness of a large number of imprecise records in a data set can be crucial for downstream analyses. For instance, coordinates collected in a 100km x 100km raster might be unsuitable for species distribution modelling on the local and regional scale.

The sensitivity of the algorithms presented in the main text test can be customized based on the raster regularity and the fraction of biased records in a dataset (`T1`

), the geographic extent of the raster(`reg.dist.min`

and `reg.dist.max`

) and the number of raster nodes (`reg.out.thresh`

).

```
## Testing for rasterized collection
## Testing for rasterized collection
```

We simulated 100,000 datasets of species occurrences with varying number of records and degree of sample clustering. For each iteration we first draw a random number, \(N \sim \Gamma(\alpha = 2, \beta = 1) *500\) for the number of records. We the simulated N latitude and longitude coordinates between 0° an 90° using \(K \in [1,5)\) truncated normal distributions with \(\mu_i \sim \mathcal{U}(0,90)\) and \(\sigma \sim \mathcal{U}(0.1,5)\). We then added a fraction \(\rho\) of biased records, where \(\rho \in[0,0.6]\) (=0-60%). We assumed a rasterized bias with five nodes (i.e. a raster with five rows and five columns). We first sampled a random coordinate as origin for the raster and then sampled the four remaining nodes at a resolution \(\tau \in [0.1, 2]\). We then analysed the simulated data using the `cd_round`

function with `T1`

parameters between 3 and 13. Figure 5 shows the effect of the `T1`

parameter on the sensitivity of the `ds_ddmm`

test.

The following example illustrates the use of `cd_round`

.

```
## dataset lon.n.outliers lon.n.regular.distance lon.regular.distance
## biased biased 8 4 1
## clean clean 0 0 NA
## summary
## biased FALSE
## clean TRUE
```

If `value = "dataset"`

the output of cd_round is a table indicating the number of regular outliers (i.e. the raster nodes) and indicating which datasets have been flagged. Additionally, the diagnostic plots clearly show the rasterized pattern in the biased dataset and confirm the automated flag.