R has several options to store dataframes as plain text files from R. Base R has
write.table() and its companions like
write.csv(). Some other options are
readr::write_tsv(). Each of them writes a dataframe as a plain text file by converting all variables into characters. After reading the file, the conversion is reversed. However, the distinction between
factor is lost in translation.
read.table() converts by default all strings to factors,
readr::read_csv() keeps by default all strings as character. The factor levels are another thing which is lost. These functions determine factor levels based on the observed levels in the plain text file. Hence factor levels without observations will disappear. The order of the factor levels is also determined by the available levels in the plain text file, which can be different from the original order.
read_vc() functions from
git2rdata keep track of the class of each variable and, in case of a factor, also of the factor levels and their order. Hence this function pair preserves the information content of the dataframe. The
vc suffix stands for version control as these functions use their full capacity in combination with a version control system. Efficiency in terms of storage and time ### Optimizing file storage
Plain text files require more disk space than binary files. This is the price we have to pay for a readable file format. The default option of
write_vc() is to minimize file size as much as possible prior to writing. Since we use a tab delimited file format, we can omit quotes around character variables. This saves 2 bytes per row for each character variable. Quotes are added automatically in the exceptional cases when they are needed, e.g. to store a string that contains tab or newline characters. In such cases, quotes are only used in row-variable combinations where the exception occurs.
Since we store the class of each variable, further file size reductions can be achieved by following rules:
logicalis written as 0 (FALSE), 1 (TRUE) or NA to the data
factoris stored as its indices in the data. The index and labels of levels and their order are stored in the metadata.
POSIXctis written as a numeric to the data. The class and the origin are stored in the metadata. Timestamps are always stored and returned as UTC.
Dateis written as an integer to the data. The class and the origin are stored in the metadata.
Storing the factors, POSIXct and Date as their index, makes them less user readable. The user can turn off this optimization when user readability is more important than file size.