Introduction

This vignette compares storage and retrieval of data by git2rdata with other standard R functionality. We consider write.table() and read.table() for data stored in a plain text format. saveRDS() and readRDS() use a compressed binary format.

In order to get some meaningful results, we will use the nassCDS dataset from the DAAG package. We’ll avoid the dependency on the package by directly downloading the data.

Data Storage

On a File System

We start by writing the dataset as is with write.table(), saveRDS(), write_vc() and write_vc() without storage optimization. Note that write_vc() uses optimization by default. Since write_vc() creates two files for each data set, we take their combined file size into account.

library(git2rdata)
root <- tempfile("git2rdata-efficient")
dir.create(root)
write.table(airbag, file.path(root, "base_R.tsv"), sep = "\t")
base_size <- file.size(file.path(root, "base_R.tsv"))

saveRDS(airbag, file.path(root, "base_R.rds"))
rds_size <- file.size(file.path(root, "base_R.rds"))

fn <- write_vc(airbag, "airbag_optimize", root, sorting = "X")
optim_size <- sum(file.size(file.path(root, fn)))

fn <- write_vc(airbag, "airbag_verbose", root, sorting = "X", optimize = FALSE)
verbose_size <- sum(file.size(file.path(root, fn)))

Since the data is highly compressible, saveRDS() yields the smallest file at the cost of having a binary file format. Both write_vc() formats yield smaller files than write.table(). Partly because write_vc() doesn’t store row names and only uses quotes when needed. The difference between the optimized and verbose version of write_vc() is, in this case, solely due to the way factors are stored in the data (tsv) file. The optimized version stores the indices of the factor whereas the verbose version stores the levels. For example: airbag$dvcat has 5 levels with fairly short labels (on average 5 character), however storing the index requires only 1 character. Resulting in more compact files.

Resulting file sizes (in kB) and file sizes relative to the size of write.table().
method file_size relative
saveRDS() 313.13 0.12
write_vc(), optimized 1739.99 0.64
write_vc(), verbose 2467.20 0.91
write.table() 2716.93 1.00

The reduction in file size when storing in factors depends on the length of the labels, the number of levels and the number of observations. The figure below illustrates the huge gain as soon as the level labels contain a few characters. The gain is less pronounced when the factor has a large number of levels. The optimization fails only in extreme cases with very short factor labels and a high number of levels.

Effect of the label length on the efficiency of storing factor optimized, assuming 1000 observations

Effect of the label length on the efficiency of storing factor optimized, assuming 1000 observations

The effect of the number of observations is mainly due to the overhead of storing the metadata. The importance of this overhead increases when the number of observations is small.

Effect of the number of observations on the efficiency of storing factor optimized assuming labels with 10 characters

Effect of the number of observations on the efficiency of storing factor optimized assuming labels with 10 characters

In Git Repositories

Here we will simulate how much space the data requires when the history is stored in a git repository. We will create a git repository for each method and store several subsets of the same data. Each commit contains a new version of the data. Each version is a random sample containing 90% of the observations of the airbag data. Two consecutive versions of the subset will have about 90% of the observations in common. 10% of the observations will be replaced by other observations.

After writing each version, we commit the file, perform garbage collection (git gc) on the git repository to minimize its size and then calculate the size of the git history (git count-objects -v).

library(git2r)
tmp_repo <- function() {
  root <- tempfile("git2rdata-efficient-git")
  dir.create(root)
  repo <- init(root)
  config(repo, user.name = "me", user.email = "me@me.com")
  return(repo)
}
commit_and_size <- function(repo, filename) {
  add(repo, filename)
  commit(repo, "test", session = TRUE)
  git_size <- system(
    sprintf("cd %s\ngit gc\ngit count-objects -v", dirname(repo$path)), 
    intern = TRUE
  )
  git_size <- git_size[grep("size-pack", git_size)]
  as.integer(gsub(".*: (.*)", "\\1", git_size))
}

repo_wt <- tmp_repo()
repo_wts <- tmp_repo()
repo_rds <- tmp_repo()
repo_wvco <- tmp_repo()
repo_wvcv <- tmp_repo()

repo_size <- replicate(
  100, {
    observed_subset <- rbinom(nrow(airbag), size = 1, prob = 0.9) == 1
    this <- airbag[
      sample(which(observed_subset)), 
      sample(ncol(airbag))
    ]
    this_sorted <- airbag[observed_subset, ]
    fn_wt <- file.path(workdir(repo_wt), "base_R.tsv")
    write.table(this, fn_wt, sep = "\t")
    fn_wts <- file.path(workdir(repo_wts), "base_R.tsv")
    write.table(this_sorted, fn_wts, sep = "\t")
    fn_rds <- file.path(workdir(repo_rds), "base_R.rds")
    saveRDS(this, fn_rds)
    fn_wvco <- write_vc(this, "airbag_optimize", repo_wvco, sorting = "X")
    fn_wvcv <- write_vc(
      this, "airbag_verbose", repo_wvcv, sorting = "X", optimize = FALSE
    )
    c(
      write.table = commit_and_size(repo_wt, fn_wt),
      write.table.sorted = commit_and_size(repo_wts, fn_wts),
      saveRDS = commit_and_size(repo_rds, fn_rds), 
      write_vc.optimized = commit_and_size(repo_wvco, fn_wvco), 
      write_vc.verbose = commit_and_size(repo_wvcv, fn_wvcv)
    )
})

Each version of the data has on purpose a random order of observations and variables. This is what would happen in a worst case scenario as it would generate the largest possible diff. We also test write.table() with a stable ordering of the observations and variables.

The randomised write.table() yields the largest git repository, converging to about 6.5 times the size of a git repository based on the sorted write.table(). saveRDS() yields a 26% reduction in repository size compared to the randomised write.table(), but still is 4.8 times larger than the sorted write.table(). Note that the gain of storing binary files in a git repository is much smaller than the gain in individual file size because the git repository will be compressed too. The optimized write_vc() starts at 84% and converges toward 73%, the verbose version starts at 89% and converges towards 116%. There is a clear gain when using write_vc() with optimization in terms of storage size and the availability of metadata. The verbose option of write_vc() lacks the gain in terms of storage size but still has the metadata advantage.

Size of the git history using the different storage methods.

Size of the git history using the different storage methods.

Relative size of the git repository when compared to write.table().

Relative size of the git repository when compared to write.table().

Timings

The code below runs a microbenchmark on the four methods. A microbenchmark runs the code a hundred times and yields a distribution of timings for each expression.

Writing Data

library(microbenchmark)
mb <- microbenchmark(
  write.table = write.table(airbag, file.path(root, "base_R.tsv"), sep = "\t"),
  saveRDS = saveRDS(airbag, file.path(root, "base_R.rds")),
  write_vc.optim = write_vc(airbag, "airbag_optimize", root, sorting = "X"),
  write_vc.verbose = write_vc(airbag, "airbag_verbose", root, sorting = "X", 
                              optimize = FALSE)
)
mb$time <- mb$time / 1e6

write_vc() takes 91% to 112% more time than write.table() because it needs to prepare the metadata and sort the observations and variables. When overwriting existing data, the new data is checked against the existing metadata. saveRDS() requires only 46% of the time that write.table() needs.

Boxplot of the write timings for the different methods.

Boxplot of the write timings for the different methods.

Reading Data

mb <- microbenchmark(
  read.table = read.table(file.path(root, "base_R.tsv"), header = TRUE, 
                          sep = "\t"),
  readRDS = readRDS(file.path(root, "base_R.rds")),
  read_vc.optim = read_vc("airbag_optimize", root),
  read_vc.verbose = read_vc("airbag_verbose", root)
)
mb$time <- mb$time / 1e6

The timings on reading the data is another story. Reading the binary format takes about 8% of the time needed to read the standard plain text format using read.table(). read_vc() takes about 102% (optimized) and 117% (verbose) of the time needed by read.table(), which at first seems strange because read_vc() calls read.table() to read the files and has some extra work to convert the variables to the correct data type. The main difference is that read_vc() knows the required data type a priori and passes this info to read.table(). Otherwise, read.table() has to guess the correct data type from the file.

Boxplots for the read timings for the different methods.

Boxplots for the read timings for the different methods.