Introduction

This vignette describes a suggested workflow for storing a snapshot of dataframes as git2rdata objects under version control. The workflow comes in two flavours:

  1. A single repository holding both the data and the analysis code. The single repository set-up is simple. A single reference (e.g. commit) points to both the data and the code.
  2. One repository holding the data and a second repository holding the code. The data and the code have an independent history under a two repository set-up. Documenting the analysis requires one reference to each repository. Such a set-up is useful for repeating the same analysis (stable code) on updated data.

In this vignette we use a git2r::repository() object as the root. This adds git functionality to write_vc() and read_vc(), provided by the git2r package. This allows to pull, stage, commit and push from within R.

Each commit in the data git repository describes a complete snapshot of the data at the time of the commit. The difference between two commits can consist of changes in existing git2rdata object (updated observations, new observations, deleted observations or updated metadata). Besides updating the existing git2rdata objects, we can also add new git2rdata objects or remove existing ones. Such higher level addition and deletions need to be tracked as well.

We illustrate the workflow with a mock analysis on the datasets::beaver1 and datasets::beaver2 datasets.

Setup

We start by initializing a git repository. git2rdata assumes that is already done. Therefore we’ll use the git2r functions to do so. We start by creating a local bare repository. In practice we will use a remote on an external server (GitHub, Gitlab, Bitbucket, …). The example below creates a local git repository with an upstream git repository. Any other workflow to create a similar structure is fine.

Structuring Git2rdata Objects Within a Project

git2rdata imposes very little structure. Both the .tsv and the .yml file need to be in the same folder. That’s it. For the sake of simplicity, in this vignette we dump all git2rdata objects at the root of the repository.

However, this might not be good idea for real project. We recommend to use at least a different directory tree for each import script. This directory can go into the root of a data only repository. It goes in the data directory in case of a data and code repository. Or the inst directory is case of an R package.

Your project might need a different directory structure. Feel free to implement the most relevant data structure for your project.

Storing Dataframes ad Hoc into a Git Repository

First Commit

In the first commit we use datasets::beaver1. We connect to the git repository using repository(). Note that this assumes that path is an existing git repository. Now we can write the dataset as a git2rdata object in the repository. If the root argument of write_vc() is a git_repository, it gains two additional arguments: stage and force. Setting stage = TRUE, will automatically stage the files written by write_vc().

library(git2rdata)
repo <- repository(path)
fn <- write_vc(beaver1, "beaver", repo, sorting = "time", stage = TRUE)

We can use status() to check that the required files are written and staged. Then we commit() the changes.

Second Commit

The second commit adds beaver2.

fn <- write_vc(beaver2, "extra_beaver", repo, sorting = "time", stage = TRUE)
status(repo)
#> working directory clean

Notice that extra_beaver is not listed in the status(), although it was written to the repository. The reason is that we set a .gitignore which contains "*extra*, so any git2rdata object with a name containing “extra” is ignored. We can force it to be staged by setting force = TRUE.

status(repo, ignored = TRUE)
#> Ignored files:
#>  Ignored:    extra_beaver.tsv
#>  Ignored:    extra_beaver.yml
fn <- write_vc(beaver2, "extra_beaver", repo, sorting = "time", stage = TRUE, 
               force = TRUE)
status(repo)
#> Staged changes:
#>  New:        extra_beaver.tsv
#>  New:        extra_beaver.yml
cm2 <- commit(repo, message = "Second commit")

Third Commit

At this point in time we decide that a single git2rdata object containing the data of both beavers is more relevant. We add an ID variable for each of the animals. This requires updating the sorting to eliminate ties. And strict = FALSE to update the metadata. The “extra_beaver” git2rdata object is no longer needed so we remove it. We use all = TRUE to stage the removal of “extra_beaver” while committing the changes.

Scripted Workflow for Storing Dataframes

We strongly recommend to add git2rdata object through an import script instead of adding them ad hoc. Store this script in the (analysis) repository. It documents the creation of the git2rdata objects. Rerun this script whenever updated data becomes available.

Old versions of the import script and the associated git2rdata remain available through the version control history. Remove obsolete git2rdata objects from the import script. This keeps both the import script and the working directory tidy and minimal.

Basically, the import script should create all git2rdata objects within a given directory tree. This gives the advantage that we start the import script by clearing any existing git2rdata object in this directory. Any git2rdata object which no longer is created by the import script gets removed without the need to track what git2rdata objects existed in the previous version.

The brute force method of removing all files or all .tsv / .yml pairs is not a good idea. This removes the existing metadata which we need for efficient storage (see vignette("efficiency", package = "git2rdata")). A better solution is to use rm_data() on the directory at the start of the import script. This removes all .tsv files which have valid metadata. The existing metadata remains untouched at this point.

Then write all git2rdata objects and stage them. Unchanged objects will not lead to a diff, even if we first deleted and then recreated them. The script won’t recreate the .tsv file of obsolete git2rdata objects. Use prune_meta() to remove any leftover metadata files.

Commit and push the changes at the end of the script.

Below is an example script recreating the “beaver” git2rdata object from the third commit.

R Package Workflow for Storing Dataframes

We recommend a two repository set-up in case of recurring analyses. These are relative stable analyses which have to run with some frequency on updated data (e.g. once a month). Then it is worthwhile to convert the analyses into an R package. Long scripts can be converted into a set of shorter functions which are much easier to document and maintain. An R package offers lots of functionality out of the box to check the quality of your code.

The example below converts the import script above into a function. We illustrate how you can use Roxygen2 (see vignette("roxygen2", package = "roxygen2")) tags to document the function and to list its dependencies. Note that we added session = TRUE to commit(). This will append the sessionInfo() at the time of the commit to the commit message. Thus documenting all loaded R packages and their version. This documents to code used to create the git2rdata object since your analysis code resides in a dedicated package with its own version number. We strongly recommend to run the import from a fresh R session. Then the sessionInfo() at commit time is limited to those packages with are strictly required for the import. Consider running the import from the command line. e.g. Rscript -e 'mypackage::import_body_temp("path/to/root")'.

Analysis Workflow with Reproducible Data

The example below is a small trivial example of a standardized analysis in which the source of the data is documented by describing the name of the data, the repository URL and the commit. We can use this information when reporting the results. This makes the data underlying the results traceable.

In this case we can run every analysis by looping over the list of datasets in the repository.

repo <- repository(path)
current <- lapply(list_data(repo), analysis, repo = repo)
names(current) <- list_data(repo)
result <- lapply(current, report)
junk <- lapply(result, print)
dataset: beaver
commit: 6c4ba3b308dee1c7d91ea9f428e452486ff02c23
repository: /var/folders/nz/vv4_9tw56nv9k3tkvyszvwg80000gn/T//RtmpApJ1v1/git2rdata-workflow-remote1516c7ec3733
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.9084247 0.0198546 1858.93938 0
activ 0.9346636 0.0352219 26.53644 0

The example below does exactly the same thing for the first and second commit.

# checkout first commit
git2r::checkout(cm1)
# do analysis
previous <- lapply(list_data(repo), analysis, repo = repo)
names(previous) <- list_data(repo)
result <- lapply(previous, report)
junk <- lapply(result, print)
dataset: beaver
commit: 30f1b6e7ff0765a154572d308bb638b73825d90c
repository: /var/folders/nz/vv4_9tw56nv9k3tkvyszvwg80000gn/T//RtmpApJ1v1/git2rdata-workflow-remote1516c7ec3733
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.8421296 0.0167694 2196.987569 0e+00
activ 0.3812037 0.0730961 5.215107 8e-07
# checkout second commit
git2r::checkout(cm2)
# do analysis
previous <- lapply(list_data(repo), analysis, repo = repo)
names(previous) <- list_data(repo)
result <- lapply(previous, report)
junk <- lapply(result, print)
dataset: beaver
commit: 30f1b6e7ff0765a154572d308bb638b73825d90c
repository: /var/folders/nz/vv4_9tw56nv9k3tkvyszvwg80000gn/T//RtmpApJ1v1/git2rdata-workflow-remote1516c7ec3733
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.8421296 0.0167694 2196.987569 0e+00
activ 0.3812037 0.0730961 5.215107 8e-07
dataset: extra_beaver
commit: 0d69367112ee05455c4b1c028c33474191ab99eb
repository: /var/folders/nz/vv4_9tw56nv9k3tkvyszvwg80000gn/T//RtmpApJ1v1/git2rdata-workflow-remote1516c7ec3733
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.0968421 0.0345624 1073.32955 0
activ 0.8062224 0.0438943 18.36736 0

If you inspect the reported results carefully you’ll notice that all the output (coefficients and commit hash) for “beaver” object is identical for the first and second commit. This makes sense since the “beaver” object didn’t change during the second commit. The output for the current (third) commit is different because the dataset changed.

Long running analysis

Imagine the case where an individual analysis takes quite a while to run. We store the most recent version of each analysis and add the information from recent_commit(). When preparing the analysis, you can run recent_commit() again on the dataset and compare the commit hash with that one of the currently available analysis. If the commit hashes match, then the data hasn’t changed. So there is no need to rerun the analysis1, saving valuable computing resources and time.


  1. assuming the code for running the analysis didn’t change.