## Introduction

This vignette motivates why we wrote git2rdata and illustrates how you can use it to store dataframes as plain text files.

### Maintaining variable classes

R has several options to store dataframes as plain text files from R. Base R has write.table() and its companions like write.csv(). Some other options are data.table::fwrite(), readr::write_delim(), readr::write_csv() and readr::write_tsv(). Each of them writes a dataframe as a plain text file by converting all variables into characters. After reading the file, the conversion is reversed. However, the distinction between character and factor is lost in translation. read.table() converts by default all strings to factors, readr::read_csv() keeps by default all strings as character. The factor levels are another thing which is lost. These functions determine factor levels based on the observed levels in the plain text file. Hence factor levels without observations will disappear. The order of the factor levels is also determined by the available levels in the plain text file, which can be different from the original order.

The write_vc() and read_vc() functions from git2rdata keep track of the class of each variable and, in case of a factor, also of the factor levels and their order. Hence this function pair preserves the information content of the dataframe. The vc suffix stands for version control as these functions use their full capacity in combination with a version control system. Efficiency in terms of storage and time ### Optimizing file storage

Plain text files require more disk space than binary files. This is the price we have to pay for a readable file format. The default option of write_vc() is to minimize file size as much as possible prior to writing. Since we use a tab delimited file format, we can omit quotes around character variables. This saves 2 bytes per row for each character variable. Quotes are added automatically in the exceptional cases when they are needed, e.g. to store a string that contains tab or newline characters. In such cases, quotes are only used in row-variable combinations where the exception occurs.

Since we store the class of each variable, further file size reductions can be achieved by following rules:

• logical is written as 0 (FALSE), 1 (TRUE) or NA to the data
• factor is stored as its indices in the data. The index and labels of levels and their order are stored in the metadata.
• POSIXct is written as a numeric to the data. The class and the origin are stored in the metadata. Timestamps are always stored and returned as UTC.
• Date is written as an integer to the data. The class and the origin are stored in the metadata.

Storing the factors, POSIXct and Date as their index, makes them less user readable. The user can turn off this optimization when user readability is more important than file size.

### Optimized for version control

Another main goal of git2rdata is to optimise the storage of the plain text files under version control. write_vc() and read_vc() has methods for interacting with git repositories using the git2r framework. Users who want to use git without git2r or use a different version control system (e.g. Subversion, Mercurial), still can use git2rdata to write the files to disk and uses their preferred workflow on version control.

Hence, write_vc() will always perform checks to look for changes which potentially lead to large diffs. More details on this in vignette("version_control", package = "git2rdata"). Some problems will always yield a warning. Other problems will yield an error by default. The user can turn these errors into warnings by setting the strict = FALSE argument.

As this vignette ignores the part on version control, we will always use write_vc(strict = FALSE) and hide the warnings to improve the readability.

## Basic Usage

Let’s start by setting up the environment. We need a directory to store the data and a dataframe to store.

# Create a directory in tempdir
path <- tempfile(pattern = "git2r-")
dir.create(path)
# Create dummy data
set.seed(20190222)
x <- data.frame(
x = sample(LETTERS),
y = factor(
sample(c("a", "b", NA), 26, replace = TRUE),
levels = c("a", "b", "c")
),
z = c(NA, 1:25),
abc = c(rnorm(25), NA),
def = sample(c(TRUE, FALSE, NA), 26, replace = TRUE),
timestamp = seq(
as.POSIXct("2018-01-01"),
as.POSIXct("2019-01-01"),
length = 26
),
stringsAsFactors = FALSE
)
str(x)
#> 'data.frame':    26 obs. of  6 variables:
#>  $x : chr "V" "U" "Z" "W" ... #>$ y        : Factor w/ 3 levels "a","b","c": 1 2 NA NA 1 NA 2 1 NA 1 ...
#>  $z : int NA 1 2 3 4 5 6 7 8 9 ... #>$ abc      : num  -0.382 -0.42 -0.917 0.387 -0.992 ...
#>  $def : logi TRUE FALSE NA FALSE NA NA ... #>$ timestamp: POSIXct, format: "2018-01-01 00:00:00" "2018-01-15 14:24:00" ...

## Storing Optimized

Use write_vc() to store the dataframe. The root argument refers to the base directory where the data is stored. The file argument is used as the base name of the files. The data file gets a .tsv extension, the metadata file a .yml extension. file can include a relative path starting from root.

library(git2rdata)
write_vc(x = x, file = "first_test", root = path, strict = FALSE)
#> aae8d62188d95d09dd4a12db05212f2d0f180b40 c209fccae21b2f38a75f250a2c630e43c0b1fdf6
#>                         "first_test.tsv"                         "first_test.yml"

write_vc() returns a vector of relative paths to the raw data and metadata files. The hashes of these files are used as names of the vector. We can have a look at both files. We’ll only display the first 10 rows of the raw data. Notice that the YAML format of the metadata has the benefit of being both human and machine readable.

print_file <- function(file, root, n = -1) {
fn <- file.path(root, file)
data <- readLines(fn, n = n)
cat(data, sep = "\n")
}
print_file("first_test.tsv", path, 10)
#> x    y   z   abc def timestamp
#> V    1   NA  -0.382010380419258  1   1514764800
#> U    2   1   -0.420347607856041  0   1516026240
#> Z    NA  2   -0.916731402237418  NA  1517287680
#> W    NA  3   0.387455128525654   0   1518549120
#> L    1   4   -0.992354993526956  NA  1519810560
#> C    NA  5   0.0228713954429028  NA  1521072000
#> R    2   6   -0.947557467717088  1   1522333440
#> S    1   7   -0.16302914628615   NA  1523594880
#> O    NA  8   0.523643352634392   1   1524856320
print_file("first_test.yml", path)
#> ..generic:
#>   git2rdata: 0.1.0.9000
#>   optimize: yes
#>   NA string: NA
#>   hash: f8350dc218051af4bafcd8872d92b1a29cbb4f31
#>   data_hash: aae8d62188d95d09dd4a12db05212f2d0f180b40
#> x:
#>   class: character
#> 'y':
#>   class: factor
#>   labels:
#>   - a
#>   - b
#>   - c
#>   index:
#>   - 1
#>   - 2
#>   - 3
#>   ordered: no
#> z:
#>   class: integer
#> abc:
#>   class: numeric
#> def:
#>   class: logical
#> timestamp:
#>   class: POSIXct
#>   origin: 1970-01-01 00:00:00
#>   timezone: UTC

## Storing Verbose

Adding optimize = FALSE to write_vc() will keep the raw data in a human readable format. The metadata file is slightly different. The most obvious is the optimize: no tag and the different hash. Another difference is the metadata for POSIXct and Date classes. They will no longer have an origin tag but a format tag.

write_vc(x = x, file = "verbose", root = path, optimize = FALSE, strict = FALSE)
#> 335cca97a0e4b36bf8312f5b1f24068d20b693c4 59104897699409d0da2d884896db1ca909dffb27
#>                            "verbose.tsv"                            "verbose.yml"
print_file("verbose.tsv", path, 10)
#> x    y   z   abc def timestamp
#> V    a   NA  -0.382010380419258  TRUE    2018-01-01T00:00:00Z
#> U    b   1   -0.420347607856041  FALSE   2018-01-15T14:24:00Z
#> Z    NA  2   -0.916731402237418  NA  2018-01-30T04:48:00Z
#> W    NA  3   0.387455128525654   FALSE   2018-02-13T19:12:00Z
#> L    a   4   -0.992354993526956  NA  2018-02-28T09:36:00Z
#> C    NA  5   0.0228713954429028  NA  2018-03-15T00:00:00Z
#> R    b   6   -0.947557467717088  TRUE    2018-03-29T14:24:00Z
#> S    a   7   -0.16302914628615   NA  2018-04-13T04:48:00Z
#> O    NA  8   0.523643352634392   TRUE    2018-04-27T19:12:00Z
print_file("verbose.yml", path)
#> ..generic:
#>   git2rdata: 0.1.0.9000
#>   optimize: no
#>   NA string: NA
#>   hash: 9fc40476f0ba0c4cb4225d5313d56760a6d3b065
#>   data_hash: 335cca97a0e4b36bf8312f5b1f24068d20b693c4
#> x:
#>   class: character
#> 'y':
#>   class: factor
#>   labels:
#>   - a
#>   - b
#>   - c
#>   index:
#>   - 1
#>   - 2
#>   - 3
#>   ordered: no
#> z:
#>   class: integer
#> abc:
#>   class: numeric
#> def:
#>   class: logical
#> timestamp:
#>   class: POSIXct
#>   format: '%Y-%m-%dT%H:%M:%SZ'
#>   timezone: UTC

## Efficiency in Terms of File Storage

Storing dataframes optimized or verbose has an impact on the required file size. A comparison can be found in the efficiency vignette.

## Reading Data

The data can be retrieved with read_vc(). This function will reinstate the variables to their original state.

y <- read_vc(file = "first_test", root = path)
all.equal(x, y, check.attributes = FALSE)
#> [1] TRUE
y2 <- read_vc(file = "verbose", root = path)
all.equal(x, y2, check.attributes = FALSE)
#> [1] TRUE

As read_vc() requires the meta data, it can only read dataframes which were stored by write_vc().

## Missing Values

write_vc() has an na argument which specifies the string which is used to indicate missing values. Because we avoid using quotes, this string must be different from any character value in the data. This includes factor labels when the data is stored verbose. This is checked and will always return an error, even with strict = FALSE.

write_vc(x, "custom_na", path, strict = FALSE, na = "X", optimize = FALSE)
#> Error: one of the strings matches the NA string ('X')
#> Please use a different NA string or consider using a factor.
write_vc(x, "custom_na", path, strict = FALSE, na = "b", optimize = FALSE)
#> Error: one of the levels matches the NA string ('b').
#> Please use a different NA string or use optimize = TRUE
write_vc(x, "custom_na", path, strict = FALSE, na = "X")
#> Error: one of the strings matches the NA string ('X')
#> Please use a different NA string or consider using a factor.
write_vc(x, "custom_na", path, strict = FALSE, na = "b")
#> fe59643653a210bf8818866615db42052406cc19 c05270c916255ddf2c239aa255d94e66ba9bf877
#>                          "custom_na.tsv"                          "custom_na.yml"

Please note that a single NA string is used for the entire dataset, thus for every variable.

print_file("custom_na.tsv", path, 10)
#> x    y   z   abc def timestamp
#> V    1   b   -0.382010380419258  1   1514764800
#> U    2   1   -0.420347607856041  0   1516026240
#> Z    b   2   -0.916731402237418  b   1517287680
#> W    b   3   0.387455128525654   0   1518549120
#> L    1   4   -0.992354993526956  b   1519810560
#> C    b   5   0.0228713954429028  b   1521072000
#> R    2   6   -0.947557467717088  1   1522333440
#> S    1   7   -0.16302914628615   b   1523594880
#> O    b   8   0.523643352634392   1   1524856320
print_file("custom_na.yml", path, 4)
#> ..generic:
#>   git2rdata: 0.1.0.9000
#>   optimize: yes
#>   NA string: b

The default string for missing values is "NA". We recommend to keep this default, as long as the dataset permits it. A first good alternative is an empty string (""). If that won’t work either, you’ll have to use your imagination. Try to keep it short, clear and robust1.

write_vc(x, "custom_na", path, strict = FALSE, na = "")
#> 5110e7a69abf23c85768ecf695dc300a75f49622 f6912a8f9066e62bab4de998f5618efe3ad34902
#>                          "custom_na.tsv"                          "custom_na.yml"

1. robust in the sense that you won’t need to change it later