Flags datasets with periodicity patterns indicative of a rasterized (lattice) collection scheme, as often obtain from e.g. atlas data. Using a combination of autocorrelation and sliding-window outlier detection to identify periodicity patterns in the data. See https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.13152 for further details and a description of the algorithm

cd_round(
  x,
  lon = "decimalLongitude",
  lat = "decimalLatitude",
  ds = "dataset",
  T1 = 7,
  reg_out_thresh = 2,
  reg_dist_min = 0.1,
  reg_dist_max = 2,
  min_unique_ds_size = 4,
  graphs = TRUE,
  test = "both",
  value = "clean",
  verbose = TRUE
)

Arguments

x

data.frame. Containing geographical coordinates and species names.

lon

character string. The column with the longitude coordinates. Default = “decimalLongitude”.

lat

character string. The column with the latitude coordinates. Default = “decimalLatitude”.

ds

a character string. The column with the dataset of each record. In case x should be treated as a single dataset, identical for all records. Default = “dataset”.

T1

numeric. The threshold for outlier detection in a in an interquantile range based test. This is the major parameter to specify the sensitivity of the test: lower values, equal higher detection rate. Values between 7-11 are recommended. Default = 7.

reg_out_thresh

numeric. Threshold on the number of equal distances between outlier points. See details. Default = 2.

reg_dist_min

numeric. The minimum detection distance between outliers in degrees (the minimum resolution of grids that will be flagged). Default = 0.1.

reg_dist_max

numeric. The maximum detection distance between outliers in degrees (the maximum resolution of grids that will be flagged). Default = 2.

min_unique_ds_size

numeric. The minimum number of unique locations (values in the tested column) for datasets to be included in the test. Default = 4.

graphs

logical. If TRUE, diagnostic plots are produced. Default = TRUE.

test

character string. Indicates which column to test. Either “lat” for latitude, “lon” for longitude, or “both” for both. In the latter case datasets are only flagged if both test are failed. Default = “both”

value

character string. Defining the output value. See value.

verbose

logical. If TRUE reports the name of the test and the number of records flagged.

Value

Depending on the ‘value’ argument, either a data.frame

with summary statistics and flags for each dataset (“dataset”) or a data.frame containing the records considered correct by the test (“clean”) or a logical vector (“flagged”), with TRUE = test passed and FALSE = test failed/potentially problematic. Default = “clean”.

Note

See https://ropensci.github.io/CoordinateCleaner/ for more details and tutorials.

See also

Other Datasets: cd_ddmm()

Examples


#simulate bias grid, one degree resolution, 10% error on a 1000 records dataset
#simulate biased fraction of the data, grid resolution = 1 degree
#simulate non-biased fraction of the data
  bi <- sample(3 + 0:5, size = 100, replace = TRUE)
  mu <- runif(3, 0, 15)
  sig <- runif(3, 0.1, 5)
  cl <- rnorm(n = 900, mean = mu, sd = sig)
  lon <- c(cl, bi)
  
  bi <- sample(9:13, size = 100, replace = TRUE)
  mu <- runif(3, 0, 15)
  sig <- runif(3, 0.1, 5)
  cl <- rnorm(n = 900, mean = mu, sd = sig)
  lat <- c(cl, bi)
  
  #add biased data
  
  inp <- data.frame(decimalLongitude = lon,
                    decimalLatitude = lat,
                    dataset = "test")
            
          
  #run test
  if (FALSE) {
  cd_round(inp, value = "dataset")
  }