1. Download GenBank

In the first section of this introduction to restez we will explain how to download parts of GenBank on to your computer and from these downloads, construct a local database. GenBank can be downloaded from NCBI using the File Transfer Protocol (FTP), the online portal for these downloads can be viewed from your web browser, FTP NCBI GenBank. GenBank is hosted at this portal as 1000s of compressed sequence files called flatfiles. New versions of these flatfiles are released once a month as new sequence information is uploaded. For information on the latest GenBank release, see the NCBI statistics page.

restez downloads a selection of these flatfiles and unpacks them into an SQL-like database (specifically, MonetDB) that can be queried using restez’s functions. Because there are potentially huge amounts of sequence information, all of which is unlikely to be of interest to a user, restez allows you to select the sets of flatfiles to download based on GenBank’s domains. For the most part these sets of flatfile types are determined by taxonomy, for example, a user can download all primate or plant sequences.

Here, we will explain how to set up restez, download genbank and then create a database from the downloaded files.

1.1 Setting up

Before we can download anything, we need to tell restez where we would like to store the downloaded files and the eventual GenBank database. To do this we use the restez_path_set function. With this function a user can provide a system file path to a folder within which the function will create a new folder called ‘restez’. In this example, we will create a restez folder in a temporary folder.

library(restez)
#> -------------
#> restez v1.0.0
#> -------------
#> Remember to restez_path_set() and, then, restez_connect()
# rstz_pth <- tempdir()
restez_path_set(filepath = rstz_pth)
#> ... Creating '/var/folders/ps/g89999v12490dmp0jnsfmykm0043m3/T//RtmpPSdaDl/restez'
#> ... Creating '/var/folders/ps/g89999v12490dmp0jnsfmykm0043m3/T//RtmpPSdaDl/restez/downloads'

A user must always set the restez path in order to use the restez library and we are reminded to do this whenever we load the package. (We are also reminded to run restez_connect but we’ll come to that a bit later.) Additionally, by running restez_path_set on a new folder messages are printed to console telling us that the path was created and that a downloads subfolder was created.

Note, whenever you intend to use restez you will need to specify the restez path. If you are using it regularly, it might be a good idea to set it using your .Rprofile.

1.2 Download

Now we can download GenBank files using db_download(). This is an interactive function, it looks up the latest GenBank release, parses the release notes and prints to console the available sets of flatfiles that can be downloaded. For this example, we will select the smallest available domain which is ‘Unannotated’ and can be pre-specified with ‘20’. Normally, you would run this function without the preselection argument and just wait for the function to prompt you for a domain selection.

After launching, the download function will ask you whether it is OK to download and set up a database for the selected sequence files. If the selection is likely to be large you can always quit the process using either Esc or Ctrl+c. Please note that the diskspace estimates are preliminary. The package trys to be conservative in its estimates.

db_download(preselection = '20')
#> ───────────────────────────────────────────────────────────────────────────────────────────────
#> Looking up latest GenBank release ...
#> ... release number 228
#> [1] ""
#> ... found 3232 sequence files
#> ───────────────────────────────────────────────────────────────────────────────────────────────
#> Which sequence file types would you like to download?
#> Choose from those listed below:
#> ● 1  - 'Bacterial'
#>         542 files and 123 GB
#> ● 2  - 'EST (expressed sequence tag)'
#>         486 files and 243 GB
#> ● 3  - 'Constructed'
#>         372 files and 84.6 GB
#> ● 4  - 'Patent'
#>         340 files and 77.9 GB
#> ● 5  - 'GSS (genome survey sequence)'
#>         308 files and 117 GB
#> ● 6  - 'TSA (transcriptome shotgun assembly)'
#>         234 files and 53.9 GB
#> ● 7  - 'Plant sequence entries (including fungi and algae)'
#>         232 files and 62.8 GB
#> ● 8  - 'HTGS (high throughput genomic sequencing)'
#>         155 files and 36.7 GB
#> ● 9  - 'Invertebrate'
#>         112 files and 24.8 GB
#> ● 10 - 'Environmental sampling'
#>         103 files and 24.1 GB
#> ● 11 - 'Other vertebrate'
#>         94 files and 19.4 GB
#> ● 12 - 'Primate'
#>         59 files and 13.7 GB
#> ● 13 - 'Viral'
#>         58 files and 13.1 GB
#> ● 14 - 'Other mammalian'
#>         55 files and 9.44 GB
#> ● 15 - 'Rodent'
#>         31 files and 7.32 GB
#> ● 16 - 'STS (sequence tagged site)'
#>         20 files and 4.45 GB
#> ● 17 - 'HTC (high throughput cDNA sequencing)'
#>         15 files and 3.43 GB
#> ● 18 - 'Synthetic and chimeric'
#>         10 files and 2.42 GB
#> ● 19 - 'Phage'
#>         5 files and 1.12 GB
#> ● 20 - 'Unannotated'
#>         1 files and 0.00111 GB
#> Provide one or more numbers separated by spaces.
#> e.g. to download all Mammalian sequences, type: "12 14 15" followed by Enter
#> 
#> Which files would you like to download?
#> ───────────────────────────────────────────────────────────────────────────────────────────────
#> You've selected a total of 1 file(s) and 0.00111 GB of uncompressed data.
#> These represent: 
#> ● 'Unannotated'
#> 
#> Based on stated GenBank files sizes, we estimate ... 
#> ... 0.000222 GB for  compressed, downloaded files
#> ... 0.00135 GB for the SQL database
#> Leading to a total of 0.00158 GB
#> 
#> Please note, the real sizes of the database and its downloads cannot be
#> accurately predicted beforehand.
#> These are just estimates, actual sizes may differ by up to a factor of two.
#> 
#> Is this OK?
#> ───────────────────────────────────────────────────────────────────────────────────────────────
#> Downloading ...
#> ... 'gbuna1.seq' (1/1)
#> Done. Enjoy your day.
#> [1] TRUE

Now the download has begun, we just need to wait for it to finish. For the largest domains, this can take quite a while and may be prone to server errors. If you ever encounter a failed downloaded file it is possible to re-run the db_download function again and – so long as the same selection is made – the function will only download the missing files.

1.3 Create a database

After the download has completed, we need to use db_create() to create our local database. This looks in the downloads folder of our restez path, breaks these files up into separate GenBank records and adds them to the SQL-like database. Again, for very large collections of sequences this can take quite a while.

Before running the db_create function, however, we must first establish a connection to the database in the restez path using the restez_connect function. As with SQL databases we must always make sure to connect and then disconnect after we have interacted with a database.

restez_connect()
#> Remember to run `restez_disconnect()`
db_create()
#> Adding 1 file(s) to the database ...
#> ... [32m'gbuna1.seq.gz'[39m([34m1 / 1[39m)
#> Done.
restez_disconnect()

db_create() allows a user to specify minimum and maximum sequence lengths. It’s always a good idea to limit the number of sequences in a database so that look-up times are faster. If you know you are only interested in certain lengths of sequences it is a good idea to limit the sequences in the database at this stage. You can always run db_create() again to change the limits. You will simply need to delete the original database first with db_delete.

1.4 Checking the setup

After the download and the database creation steps are complete, we can confirm our setup using restez_status.

library(restez)
restez_path_set(rstz_pth)
restez_connect()
#> Remember to run `restez_disconnect()`
restez_status()
#> Checking setup status at  ...
#> ───────────────────────────────────────────────────────────────────────────────────────────────
#> Restez path ...
#> ... Path '/var/folders/ps/g89999v12490dmp0jnsfmykm0043m3/T//RtmpPSdaDl/restez'
#> ... Does path exist? 'Yes'
#> ───────────────────────────────────────────────────────────────────────────────────────────────
#> Download ...
#> ... Path '/var/folders/ps/g89999v12490dmp0jnsfmykm0043m3/T//RtmpPSdaDl/restez/downloads'
#> ... Does path exist? 'Yes'
#> ... N. files 2
#> ... N. GBs 0
#> ... GenBank division selections 'Unannotated'
#> ... GenBank Release 228
#> ... Last updated '2018-11-26 22:09:08'
#> ───────────────────────────────────────────────────────────────────────────────────────────────
#> Database ...
#> ... Path '/var/folders/ps/g89999v12490dmp0jnsfmykm0043m3/T//RtmpPSdaDl/restez/sql_db'
#> ... Does path exist? 'Yes'
#> ... N. GBs 0
#> ... Is database connected? 'Yes'
#> ... Does the database have data? 'Yes'
#> ... Number of sequences 448
#> ... Min. sequence length 0
#> ... Max. sequence length Inf
#> ... Last_updated '2018-11-26 22:09:11'
restez_disconnect()

The status function allows a user to always touch base with their restez set-up. It can be run at any point, before and/or after download or database creation. It is also designed to provide useful messages to a user for what they need to do next in order for them to make queries to their database. If you ever get confused, run restez_status to see what is happening.

Additionally, if you are developing your own functions that depend on restez, you can use restez_ready which will simply return TRUE or FALSE on whether the database can be queried.

library(restez)
restez_path_set(rstz_pth)
restez_connect()
#> Remember to run `restez_disconnect()`
if (restez_ready()) {
  print('Database is ready to be queried!')
} else {
  print('Database cannot be queried :-(')
}
#> [1] "Database is ready to be queried!"
restez_disconnect()

2. Query

Once a restez database has been set up we can query the database using the gb_*_get() functions. These functions allow us to retrieve specific columns in the SQL-like database: ‘sequence’, ‘definition’, ‘accession’, ‘version’ and ‘organism’. Also, they allow us to get the whole text formatted record and sequence data in fasta format. In this example, we can use list_db_ids() to identify accession numbers in the database. We could also use entrez_search() provided the database contains sequences of interest to us, see ‘search and fetch’.

Always remember to disconnect after you have finished your querying!

library(restez)
restez_path_set(rstz_pth)
restez_connect()
#> Remember to run `restez_disconnect()`
ids <- suppressWarnings(list_db_ids(db = 'nucleotide', n = 100))
(id <- sample(ids, 1))
#> [1] "AF298084"
# sequence
str(gb_sequence_get(id))
#>  Named chr "GATCCTTCCGATGTCTTCGGCCTGGAAGAGATCCGGAAGAGAACGGGGTTCATGGCCGAGCCCCTCGTGGCGCCGGGTATGCTCATCGACTGGGCTTTGGAGAAGTATTAT"| __truncated__
#>  - attr(*, "names")= chr "AF298084"
# definition
(gb_definition_get(id))
#>                                                    AF298084 
#> "Unidentified clone A22 DNA sequence from ocean beach sand"
# version
(gb_version_get(id))
#>     AF298084 
#> "AF298084.1"
# organism
(gb_organism_get(id))
#>       AF298084 
#> "unidentified"
# fasta
cat(gb_fasta_get(id))
#> >AF298084.1 Unidentified clone A22 DNA sequence from ocean beach sand
#> GATCCTTCCGATGTCTTCGGCCTGGAAGAGATCCGGAAGAGAACGGGGTTCATGGCCGAGCCCCTCGTGG
#> CGCCGGGTATGCTCATCGACTGGGCTTTGGAGAAGTATTATCCGGAACCGGACAACGGGAGCAGCTCGGA
#> GAACGTGGTTGCCACCGCCCCTCAGGCCTCTCCTGCCACCCAGGACCAAGCCGTCCAGGGCGGCCCCGCG
#> CCGACGGAGGTAGAGCCTCTCTCCCCTCCCGGGCAACCCGTGGAAGGAGGCGTCTTCGGAGCCGTTCCCG
#> AGATCGTGGCGCCCCAACCCCCGGGAGAGGGAGCGAGTGACATCAAGGTGGCCGATTTCGAGGACTCGGA
#> GGCTGAGATCGTGGCCGGCGCANCGCCGGCGCCCGTTCAACCTGAAGTCCCCGCGGCCACGGAGATCCCG
#> GCCTTCGAGGAGTATGACGGA
# Note, for large databases these requests can take a long time to complete.
# Always try and limit the size of the database by specifying min and max
# sequence lengths with db_create()
# Also note, if an id is not present in the database nothing is returned
cat(gb_fasta_get(id = c(id, 'notanid')))
#> >AF298084.1 Unidentified clone A22 DNA sequence from ocean beach sand
#> GATCCTTCCGATGTCTTCGGCCTGGAAGAGATCCGGAAGAGAACGGGGTTCATGGCCGAGCCCCTCGTGG
#> CGCCGGGTATGCTCATCGACTGGGCTTTGGAGAAGTATTATCCGGAACCGGACAACGGGAGCAGCTCGGA
#> GAACGTGGTTGCCACCGCCCCTCAGGCCTCTCCTGCCACCCAGGACCAAGCCGTCCAGGGCGGCCCCGCG
#> CCGACGGAGGTAGAGCCTCTCTCCCCTCCCGGGCAACCCGTGGAAGGAGGCGTCTTCGGAGCCGTTCCCG
#> AGATCGTGGCGCCCCAACCCCCGGGAGAGGGAGCGAGTGACATCAAGGTGGCCGATTTCGAGGACTCGGA
#> GGCTGAGATCGTGGCCGGCGCANCGCCGGCGCCCGTTCAACCTGAAGTCCCCGCGGCCACGGAGATCCCG
#> GCCTTCGAGGAGTATGACGGA
# Always remember to disconnect
restez_disconnect()

Additionally, for more flexibility and options for extracting sequence record information see GenBank record parsing.

3. Entrez

Entrez wrappers are part of the restez package. These allow a user to make use of the local GenBank using functions that were built for rentrez. This minimises the amount of coding changes required for any Entrez dependent code.

Currently, only entrez_fetch() is available with restez and only text formatted rettypes are allowed.

library(restez)
restez_path_set(rstz_pth)
restez_connect()
#> Remember to run `restez_disconnect()`
ids <- suppressWarnings(list_db_ids(db = 'nucleotide', n = 100))
(id <- sample(ids, 1))
#> [1] "AF148959"
# get fasta sequences with entrez
res <- restez::entrez_fetch(db = 'nucleotide', id = id, rettype = 'fasta')
cat(res)
#> >AF148959.1 PCR artifactual sequence for diagnosis of avian malaria
#> CATGAATAAATCCTCGTATTCAGGTGAGCCTCCCTGTCCTCTCGCCTCCCCCTTTGTGTCTGTCTCTGGG
#> AAAAGAAAAAGGTTGAAAAACCCCGGAGCACGAGGTGCGCAAGCCCTCCTGGCTGCGAGCGCTCTGCGGA
#> GGAGTGAGCGGCTGGTTCGCTGTGTATAAACAAGTGGAAAAGGCTTAAAAAGCAAAGCAAACGCGGCGGG
#> GCAGCTGGTTCCAGGCAGAGCCCGGCACTGGGGGCACGGAGCTTGTTATCTGAGGGCACCTGTGCCAGCA
#> GGGGGTGAGATCCATCGCCAAGTGACAGCGTGGCATGGGAACAGGACCGTGGGGTGTGTGTCTGAGTGTG
#> ACACTGGGCTGCAGGCATTTCCAAATCCCCTAATGCCGAGGGATTCTCTTCTGCCTTCTCCTGTCTGGCT
#> TGCCAGTTTGGCCCTACCGGGTGAGGGCATTTGCCCTCTGCTCGGGCAGCTCCTCCTCCCCGGCTGGCAC
#> AGAATGTGCAGCCACCCTG
# entrez_fetch will also search via Entrez for any ids not in the db
plant_sequence <- 'AY952423'
res <- restez::entrez_fetch(db = 'nucleotide', id = c(id, plant_sequence),
                            rettype = 'fasta')
#> [1] id(s) are unavailable locally, searching online.
cat(res)
#> >AF148959.1 PCR artifactual sequence for diagnosis of avian malaria
#> CATGAATAAATCCTCGTATTCAGGTGAGCCTCCCTGTCCTCTCGCCTCCCCCTTTGTGTCTGTCTCTGGG
#> AAAAGAAAAAGGTTGAAAAACCCCGGAGCACGAGGTGCGCAAGCCCTCCTGGCTGCGAGCGCTCTGCGGA
#> GGAGTGAGCGGCTGGTTCGCTGTGTATAAACAAGTGGAAAAGGCTTAAAAAGCAAAGCAAACGCGGCGGG
#> GCAGCTGGTTCCAGGCAGAGCCCGGCACTGGGGGCACGGAGCTTGTTATCTGAGGGCACCTGTGCCAGCA
#> GGGGGTGAGATCCATCGCCAAGTGACAGCGTGGCATGGGAACAGGACCGTGGGGTGTGTGTCTGAGTGTG
#> ACACTGGGCTGCAGGCATTTCCAAATCCCCTAATGCCGAGGGATTCTCTTCTGCCTTCTCCTGTCTGGCT
#> TGCCAGTTTGGCCCTACCGGGTGAGGGCATTTGCCCTCTGCTCGGGCAGCTCCTCCTCCCCGGCTGGCAC
#> AGAATGTGCAGCCACCCTG
#> 
#> >AY952423.1 Livistona chinensis tRNA-Lys (trnK) gene, partial sequence; and matK gene, complete sequence; chloroplast
#> ATTGGGGTTGCTAACTCAACGGTAGAGTACTCGGCTTTTAAGTGCGACTATCATCTTTTACACATTTGGA
#> TGAAGTAAGGAATTCGTCCAGACTATTGGTAGAGTCTATAAGACCACGACTGATCCTGAAAGGTAATGAA
#> TGGAAAAAATAGCATGTCGTACGTAATACAATGAGAAACTTGTAATTTCTTATTGTAATTTTTTAAGTAG
#> AACTTTGAGTTTATCCTTACTGGATCATTACAAAAATATTGTATTTTATTTTTGGAAGGGGACGAAAAAA
#> AGGAAATTCCCAACATTTATTGTTTGGTCTAATGAATAAATGGATAGGGGCCTAGGGTAGGGCCCAATTT
#> TTGTAAAACAAAAAGCAACGAGCTTATGTTCTTAATTTGAATAATTACCCGATCTAATTAGATGTTAAAA
#> ATAAATTAGTGCCAGATGTGGTAAAGGGTTCTACTGTAAGTGGACCTTTTTTTTTTTTTTTTATGAATCC
#> TACCTATTATCTATTATGGATTAAAGATGGATGTGTATAAGAAGAAGTATACTGATAAAGAGAATTTTTC
#> CAAAGTCAAAAGAGCAATCGGGTTGCAAAAATAAAGGATTTTTACCTCCGAGAATTATAAATTAATTGGA
#> TCAAAAGGAGAGGAAAAAGTCTGTGATTGGACTCCTTCTATCCGCGGGTATGGGTATATAGTAGGTATAT
#> ATGTATATTTGTATACTATATAAATTACATGCCCTGTTCTGACCGTATTGCACTATGTATTATTTGATAA
#> TCCAAGAAATGCCTCCTACTTCTGGTTCAAGTAGAAATGAAAATAGAAGAATTACAAGAATATTTAGAAA
#> AAGATAGATCTCGGCAACAACACTTCCTATACCCACTTTTCTTTCAGGAGTATATTTATGCACTTGCTCA
#> TGATTATGGGTTTAAAGGGTTCGATTTTTTACGAACCTATGGAAATTGGGGGTTATGATAATAAATCTAG
#> TTCAGTACTTGTAAAACATTTAATTACTCGAATGTATCAACAGAATTATTTGATTTATTCTGTTAATGAA
#> TCTAACCAAAATCGATTGATTGAGCATAACAATTCTTTTTATTCTCAAATGATATCTGAAGTTTTTGCGA
#> TCATTGCAGAAATTCCATTCTCTCAGCAATTACTATTTTCTCTTCGAGGAAAAAAGAATACCAAAATCTC
#> AGACTTTACGATCTATTCATTCAATATTTCCCTTTTTAGAAGACAAATTATCACATTTAAACTATGTGTC
#> AGATATATTAATACCCTATCCCATCCATTTGGAAATCTTGGTGCAAATTCTTCAATGCTGGATCCAAGAT
#> GTTTCTTCTTTGCATTTATTGCGATTCTTTCTCCACGAACATCATAATGGGAATAGTTTTTTTTTTCCAA
#> AGAAATCCTTTTCAAAAGAAAATAAAAGACTCTTTCGATTCCTATATAATTCTTATGTATCTGAATGTGA
#> ATTTGTCTTAGTGTTTCTTCGTAAACAATCCTCTTATTTACAATCAAAATCCTATGGAATCTTTCTTGAG
#> CGAACACATTTCTATGGAAGAATGGAACATCTTATAGTAGTGTGTCATAATTATTGTCAGAAGGCCTTTT
#> GGGTCTTCAAGGATCCTTTTATGCATTATGTTCGATATCAAGGAAAAGCAATTCTGGCATCAAAAGGATC
#> TTATCTTTTGATGAAGAAATGGAGATGTCATCTTGTCAATTTCTGGCAATATTATTTTCATTTTTGGGCT
#> CAGCCTTACAGAATTTCAATAAACCAATTAGGAAATCATTCCTTCTATTTTCTCGGTTATCTTTCAAGTG
#> TATTAAAAAATACTTCGTCTGTAAGGAATCAAATGCTAGAGAATTCCTTTTTAATAGATACTATTACTAA
#> TAAATTGGATACCATAGTCCCAGTTCTTCCTCTTATTGGATCTTTGTCTAAAGCTAAATTTTGTACCGTA
#> TCCGGGCATCCTAGTAGTAAGCCAATCTGGACGGATTTATCGGATTCTGATATTATTGATAGATTTGGTC
#> GGATATGTAGAAATCTTTCTCATTATTATAGTGGATCCTCAAAAAAACAGAGCTTATATCGAATAAGGTA
#> TATACTTCGACTTTCTTGTGCTAGAACTTTAGCTCGTAAACATAAAAGTACAGTACGTGCTTTTTTGCAA
#> AGATTAGGTTCGGAATTATTAGAAGAATTCTTTACAGAAGAAGAAGGAGTTGTTTTTTTGATTTCCCAAA
#> AGAACAAAACCTCTTTTCCTCTCTATAGGTCACATAGAGAACGCATTTGGTATTTGGATATTATCCATAT
#> TAATGAATTGGTGAATTCATTTATGATGGGGCGATAAGCCCCTATAAAATAAGAAATATAAATTTTTTCT
#> AATGTCTAATAAATAGACGACAAATTCATTAATTTTCATTCTGAAATGCTCATCTAGTAGTGTAGTGATT
#> GAATCAACTGAGTATTCAAAATTTTTAGACAAACTTCTAGGGATAGAAGTTTGTTTTATCTGTATACATA
#> GGTAAAGTCGTGTGCAATGAAAAATGCAAGCACGATTTGGGGAGAGATAATTTTCTCTATTGTAACAAAT
#> AAAAATTATCTACTCCATCCGACTAGTTAATCG
restez_disconnect()