find_* functions from
jstor all work on a single file. Data from DfR however contains many single files, from up to 25,000 when using the self-service functions, up to several hundreds of thousands of files when requesting a large dataset via an agreement.
jstor offers two implementations to import many files:
jst_import(). The first one lets you import data directly from zip archives, the second works for file paths, so you need to unzip the archive first. I will first introduce
jst_import_zip() and discuss the approach of with
Unpacking and working with many files directly is unpractical for at least three reasons:
system("find...")on UNIX. Depending on the size of your data, this can take some time.
Importing directly from the zip archive simplifies all those tasks with a single function:
jst_import_zip(). For the following demonstration, we will use a small sample archive that comes with the package.
As a first step, we should take a look at the archive and its content. This is made easy with
We can see that we have a simple archive with three metadata files and one ngram file. Before we can use
jst_import_zip(), we first need to think about, what we actually want to import: all of the data, or just parts? What kind of data do we want to extract from articles, books and pamphlets? We can specify this via
In this case, we want to import data on articles (standard metadata plus information on the authors), general data on books and unigrams (ngram1). This specification can then be used with
# set up a temporary folder for output files tmp <- tempdir() # extract the content and write output to disk jst_import_zip(jst_example("pseudo_dfr.zip"), import_spec = import_spec, out_file = "my_test", out_path = tmp) #> Processing files for book_chapter with functions jst_get_book #> Processing files for journal_article with functions jst_get_article, jst_get_authors #> Processing files for ngram1 with functions jst_get_ngram
We can take a look at the files within
As you can see,
jst_import_zip() automatically creates file names based on the string you supplied to
out_file to delineate the different types of output.
If we want to re-import the data for further analysis, we can either use functions like
readr::read_csv(), or a small helper function from the package which determines and sets the column types correctly:
|journal-article-standard_case||NA||kewbulletin||NA||Kew Bulletin||10.2307/4117222||NA||NA||research-article||Two New Species of Ischaemum||5||2||eng||1||1||1950||187||188||187-188|
A side note on ngrams: For larger archives, importing all ngrams can take a very long time. It is thus advisable to only import ngrams for articles which you want to analyse, i.e. most likely a subset of the initial request. The function
jst_subset_ngrams() helps you with this (see also the section on importing bigrams in the case study).
Since the above process might take a while for larger archives (files have to be unpacked, read and parsed), there might be a benefit of executing the function in parallel.
furrr at their core, therefore adding parallelism is very easy. Just add the following lines at the beginning of your script, and the import will use all available cores:
You can find out more about futures by reading the package vignette:
The above approach of importing directly from zip archives is very convenient, but in some cases you might want to have more control over how data is imported. For example, if you run into problems because the output from any of the functions provided with the package looks corrupted, you could want to look at the original files. Alongside this, you could unzip the archive and work with the files directly, which I will demonstrate in the following sections.
For simple purposes it might be sensible to unzip to a temporary directory (with
unzip()) but for my research I simply extracted files to an external SSD, since I a) lacked disk space, b) needed to read them fast, and c) wanted to be able to look at specific files for debugging.
For demonstration purposes I use files contained in
jstor which can be accessed via
file_names_listed <- list.files(path = example_dir, full.names = T, pattern = "*.xml") file_names_listed #>  "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/jstor/extdata/article_with_footnotes.xml" #>  "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/jstor/extdata/article_with_references.xml" #>  "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/jstor/extdata/book.xml" #>  "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/jstor/extdata/parsed_references.xml"
library(stringr) file_names_system <- file_names %>% str_replace("^\\.\\/", "") %>% str_c(example_dir, "/", .) file_names_system #>  "/Library/Frameworks/R.framework/Versions/3.4/Resources/library/jstor/extdata/sample_with_footnotes.xml" #>  "/Library/Frameworks/R.framework/Versions/3.4/Resources/library/jstor/extdata/sample_book.xml" #>  "/Library/Frameworks/R.framework/Versions/3.4/Resources/library/jstor/extdata/sample_with_references.xml"
In this case the two approaches give the same result. The main difference seems to be though, that
list.files sorts the output, whereas
find does not. For a large amount of files (200,000) this makes
list.files slower, for smaller datasets the difference shouldn’t make an impact.
Once the file list is generated, we can apply any of the
jst_get_*-functions to the list. A good and simple way for small to moderate amounts of files is to use
|article_with_footnotes||NA||washhistq||NA||The Washington Historical Quarterly||NA||NA||40428382||research-article||The Nisqually Journal (Continued)||13||2||eng||1||4||1922||131||141||NA|
|article_with_references||NA||tranamermicrsoci||NA||Transactions of the American Microscopical Society||10.2307/3221896||NA||NA||research-article||On the Protozoa Parasitic in Frogs||41||2||eng||1||4||1922||59||76||59-76|
This works well if 1) there are no errors and 2) if there is only a moderate size of files. For larger numbers of files,
jst_import() can streamline the process for you. This function works very similar to
jst_import_zip(), the main difference being that it needs file paths as input and can only handle one type of output.
The result is again written to disk, as can be seen below: