Introduction
git2rdata
supports extra metadata since version 0.4.1.
Metadata is stored in a separate file with the same name as the data
file, but with the extension .yml
. The metadata file is a
YAML file with a specific structure. The metadata file contains a
generic section and a section for each field in the data file. The
generic section contains information about the data file as a whole. The
fields sections contain information about the fields in the data file.
The metadata file is stored in the same directory as the data file.
The generic section contains the following mandatory properties,
automatically created by git2rdata
:
-
git2rdata
: the version ofgit2rdata
used to create the metadata. -
datahash
: the hash of the data file. -
hash
: the hash of the metadata file. -
optimize
: a logical indicating whether the data file is optimized forgit2rdata
. -
sorting
: a character vector with the names of the fields in the data file. -
split_by
: a character vector with the names of the fields used to split the data file. -
NA string
: the string used to represent missing values in the data file.
The generic section can contain the following optional properties:
-
table name
: the name of the dataset. -
title
: the title of the dataset. -
description
: a description of the dataset.
The fields sections contain the following mandatory properties,
automatically created by git2rdata
:
-
type
: the type of the field. -
class
: the class of the field. -
levels
: the levels of the field (for factors). -
index
: the index of the field (for factors). -
NA string
: the string used to represent missing values in the field.
The fields sections can contain the following optional properties:
-
description
: a description of the field.
Adding metadata
write_vc()
only stores the mandatory properties in the
metadata file.
library(git2rdata)
root <- tempfile("git2rdata-metadata")
dir.create(root)
write_vc(iris, file = "iris", root = root, sorting = "Sepal.Length")
## Warning: Sorting on 'Sepal.Length' results in ties.
## Add extra sorting variables to ensure small diffs.
## 9282fad022c924c16a76bd8b3c174e71fc4515fe
## "iris.tsv"
## 0d434e56d22a710c99c5b912e8624d52abd41aaf
## "iris.yml"
Reading metadata
read_vc()
reads the metadata file and adds it as
attributes to the data.frame
. print()
and
summary()
alert the user to the
display_metadata()
function. This function displays the
metadata of a git2rdata
object. Missing optional metadata
results in an NA
value in the output of
display_metadata()
.
## Classes 'git2rdata' and 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 4.3 4.4 4.4 4.4 4.5 4.6 4.6 4.6 4.6 4.7 ...
## $ Sepal.Width : num 3 2.9 3 3.2 2.3 3.1 3.4 3.6 3.2 3.2 ...
## $ Petal.Length: num 1.1 1.4 1.3 1.3 1.3 1.5 1.4 1 1.4 1.3 ...
## $ Petal.Width : num 0.1 0.2 0.2 0.2 0.3 0.2 0.3 0.2 0.2 0.2 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "source")= Named chr [1:2] "/tmp/Rtmpk3umea/git2rdata-metadata170c5e01e260/iris.tsv" "/tmp/Rtmpk3umea/git2rdata-metadata170c5e01e260/iris.yml"
## ..- attr(*, "names")= chr [1:2] "9282fad022c924c16a76bd8b3c174e71fc4515fe" "0d434e56d22a710c99c5b912e8624d52abd41aaf"
## - attr(*, "optimize")= logi TRUE
## - attr(*, "sorting")= chr "Sepal.Length"
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 4.3 3.0 1.1 0.1 setosa
## 2 4.4 2.9 1.4 0.2 setosa
## 3 4.4 3.0 1.3 0.2 setosa
## 4 4.4 3.2 1.3 0.2 setosa
## 5 4.5 2.3 1.3 0.3 setosa
## 6 4.6 3.1 1.5 0.2 setosa
##
## Use `display_metadata()` to view the metadata.
summary(my_iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
##
## Use `display_metadata()` to view the metadata.
display_metadata(my_iris)
## Table name: NA
## Title: NA
## Description: NA
## Path: /tmp/Rtmpk3umea/git2rdata-metadata170c5e01e260/iris.tsv (9282fad022c924c16a76bd8b3c174e71fc4515fe), /tmp/Rtmpk3umea/git2rdata-metadata170c5e01e260/iris.yml (0d434e56d22a710c99c5b912e8624d52abd41aaf)
## Sorting order: Sepal.Length
## Optimized storage: TRUE
## Variables:
## - Sepal.Length: NA
## - Sepal.Width: NA
## - Petal.Length: NA
## - Petal.Width: NA
## - Species: NA
Updating the optional metadata
To add metadata to a git2rdata
object, use the
update_metadata()
function. This function allows you to add
or update the optional metadata of a git2rdata
object.
Setting an argument to NA
or an empty string will remove
the corresponding property from the metadata. The function only updates
the metadata file, not the data file. To see the changes, read the
object again before using display_metadata()
. Note that all
the metadata is available in the data.frame
as
attributes.
update_metadata(
file = "iris", root = root, name = "iris", title = "Iris dataset",
description =
"The Iris dataset is a multivariate dataset introduced by the British
statistician and biologist Ronald Fisher in his 1936 paper The use of multiple
measurements in taxonomic problems.",
field_description = c(
Sepal.Length = "The length of the sepal in cm",
Sepal.Width = "The width of the sepal in cm",
Petal.Length = "The length of the petal in cm",
Petal.Width = "The width of the petal in cm",
Species = "The species of the iris"
)
)
my_iris <- read_vc("iris", root = root)
display_metadata(my_iris)
## Table name: iris
## Title: Iris dataset
## Description: The Iris dataset is a multivariate dataset introduced by the British
## statistician and biologist Ronald Fisher in his 1936 paper The use of multiple
## measurements in taxonomic problems.
## Path: /tmp/Rtmpk3umea/git2rdata-metadata170c5e01e260/iris.tsv (9282fad022c924c16a76bd8b3c174e71fc4515fe), /tmp/Rtmpk3umea/git2rdata-metadata170c5e01e260/iris.yml (1d7f38f89830a2b7996e759044d7dfd59e5da4f6)
## Sorting order: Sepal.Length
## Optimized storage: TRUE
## Variables:
## - Sepal.Length: The length of the sepal in cm
## - Sepal.Width: The width of the sepal in cm
## - Petal.Length: The length of the petal in cm
## - Petal.Width: The width of the petal in cm
## - Species: The species of the iris
str(my_iris)
## Classes 'git2rdata' and 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 4.3 4.4 4.4 4.4 4.5 4.6 4.6 4.6 4.6 4.7 ...
## ..- attr(*, "description")= chr "The length of the sepal in cm"
## $ Sepal.Width : num 3 2.9 3 3.2 2.3 3.1 3.4 3.6 3.2 3.2 ...
## ..- attr(*, "description")= chr "The width of the sepal in cm"
## $ Petal.Length: num 1.1 1.4 1.3 1.3 1.3 1.5 1.4 1 1.4 1.3 ...
## ..- attr(*, "description")= chr "The length of the petal in cm"
## $ Petal.Width : num 0.1 0.2 0.2 0.2 0.3 0.2 0.3 0.2 0.2 0.2 ...
## ..- attr(*, "description")= chr "The width of the petal in cm"
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
## ..- attr(*, "description")= chr "The species of the iris"
## - attr(*, "source")= Named chr [1:2] "/tmp/Rtmpk3umea/git2rdata-metadata170c5e01e260/iris.tsv" "/tmp/Rtmpk3umea/git2rdata-metadata170c5e01e260/iris.yml"
## ..- attr(*, "names")= chr [1:2] "9282fad022c924c16a76bd8b3c174e71fc4515fe" "1d7f38f89830a2b7996e759044d7dfd59e5da4f6"
## - attr(*, "table name")= chr "iris"
## - attr(*, "title")= chr "Iris dataset"
## - attr(*, "description")= chr "The Iris dataset is a multivariate dataset introduced by the British\nstatistician and biologist Ronald Fisher "| __truncated__
## - attr(*, "optimize")= logi TRUE
## - attr(*, "sorting")= chr "Sepal.Length"