The NCBI shares a lot of data. At the time this document was compiled, there were 26 million papers in PubMed, including 4 million full-text records available in PubMed Central. The NCBI Nucleotide Database (which includes GenBank) has data for 201.9 million different sequences, and dbSNP describes 774.1 million different genetic variants. All of these records can be cross-referenced with the 1.14 million species in the NCBI taxonomy or 24.4 thousand disease-associated records in OMIM.
The NCBI makes this data available through a web interface,
an FTP server and through a REST API called the
Entrez Utilities (Eutils for
short). This package provides functions to use that API, allowing users to
gather and combine data from multiple NCBI databases in the comfort of an R
session or script.
### Finding cross-references : `entrez_link()`:
One of the strengths of the NCBI databases is the degree to which records of one
type are connected to other records within the NCBI or to external data
sources. The function `entrez_link()` allows users to discover these links
between records.
####My god, it's full of links
To get an idea of the degree to which records in the NCBI are cross-linked we
can find all NCBI data associated with a single gene (in this case the
Amyloid Beta Precursor gene, the product of which is associated with the
plaques that form in the brains of Alzheimer's Disease patients).
The function `entrez_link()` can be used to find cross-referenced records. In
the most basic case we need to provide an ID (`id`), the database from which this
ID comes (`dbfrom`) and the name of a database in which to find linked records (`db`).
If we set this last argument to 'all' we can find links in multiple databases:
```r
all_the_links <- entrez_link(dbfrom='gene', id=351, db='all')
all_the_links
```
```
#> elink object with contents:
#> $links: IDs for linked records from NCBI
#>
```
Just as with `entrez_search` the returned object behaves like a list, and we can
learn a little about its contents by printing it. In the case, all of the
information is in `links` (and there's a lot of them!):
```r
all_the_links$links
```
```
#> elink result with information from 53 databases:
#> [1] gene_bioconcepts gene_biosystems
#> [3] gene_biosystems_all gene_clinvar
#> [5] gene_dbvar gene_gene_h3k4me3
#> [7] gene_genome gene_gtr
#> [9] gene_medgen_diseases gene_pcassay
#> [11] gene_pcassay_alltarget_list gene_pcassay_alltarget_summary
#> [13] gene_pcassay_rnai gene_pcassay_target
#> [15] gene_probe gene_structure
#> [17] gene_bioproject gene_books
#> [19] gene_cdd gene_gene_neighbors
#> [21] gene_genereviews gene_genome2
#> [23] gene_geoprofiles gene_homologene
#> [25] gene_nuccore gene_nuccore_mgc
#> [27] gene_nuccore_pos gene_nuccore_refseqgene
#> [29] gene_nuccore_refseqrna gene_nucest
#> [31] gene_nucest_clust gene_nucleotide
#> [33] gene_nucleotide_clust gene_nucleotide_mgc
#> [35] gene_nucleotide_mgc_url gene_nucleotide_pos
#> [37] gene_omim gene_pcassay_proteintarget
#> [39] gene_pccompound gene_pcsubstance
#> [41] gene_pmc gene_pmc_nucleotide
#> [43] gene_protein gene_protein_refseq
#> [45] gene_pubmed gene_pubmed_citedinomim
#> [47] gene_pubmed_pmc_nucleotide gene_pubmed_rif
#> [49] gene_snp gene_snp_geneview
#> [51] gene_taxonomy gene_unigene
#> [53] gene_varview
```
The names of the list elements are in the format `[source_database]_[linked_database]`
and the elements themselves contain a vector of linked-IDs. So, if we want to
find open access publications associated with this gene we could get linked records
in PubMed Central:
```r
all_the_links$links$gene_pmc[1:10]
```
```
#> [1] "4648968" "4528011" "4502972" "4496119" "4476499" "4467552" "4449029"
#> [8] "4440096" "4418866" "4410718"
```
Or if were interested in this gene's role in diseases we could find links to clinVar:
```r
all_the_links$links$gene_clinvar
```
```
#> [1] "160886" "155682" "155309" "155093" "155053" "154360" "154063"
#> [8] "153438" "152839" "151388" "150018" "149551" "149418" "149160"
#> [15] "149035" "148411" "148262" "148180" "146125" "145984" "145474"
#> [22] "145468" "145332" "145107" "144677" "144194" "127268" "98242"
#> [29] "98241" "98240" "98239" "98238" "98237" "98236" "98235"
#> [36] "59247" "59246" "59245" "59243" "59226" "59224" "59223"
#> [43] "59222" "59221" "59010" "59005" "59004" "37145" "32099"
#> [50] "18106" "18105" "18104" "18103" "18102" "18101" "18100"
#> [57] "18099" "18098" "18097" "18096" "18095" "18094" "18093"
#> [64] "18092" "18091" "18090" "18089" "18088" "18087"
```
####Narrowing our focus
If we know beforehand what sort of links we'd like to find , we can
to use the `db` argument to narrow the focus of a call to `entrez_link`.
For instance, say we are interested in knowing about all of the
RNA transcripts associated with the Amyloid Beta Precursor gene in humans.
Transcript sequences are stored in the nucleotide database (referred
to as `nuccore` in EUtils), so to find transcripts associated with a given gene
we need to set `dbfrom=gene` and `db=nuccore`.
```r
nuc_links <- entrez_link(dbfrom='gene', id=351, db='nuccore')
nuc_links
```
```
#> elink object with contents:
#> $links: IDs for linked records from NCBI
#>
```
```r
nuc_links$links
```
```
#> elink result with information from 5 databases:
#> [1] gene_nuccore gene_nuccore_mgc gene_nuccore_pos
#> [4] gene_nuccore_refseqgene gene_nuccore_refseqrna
```
The object we get back contains links to the nucleotide database generally, but
also to special subsets of that database like [refseq](http://www.ncbi.nlm.nih.gov/refseq/).
We can take advantage of this narrower set of links to find IDs that match unique
transcripts from our gene of interest.
```r
nuc_links$links$gene_nuccore_refseqrna
```
```
#> [1] "324021747" "324021746" "324021739" "324021737" "324021735"
#> [6] "228008405" "228008404" "228008403" "228008402" "228008401"
```
We can use these ids in calls to `entrez_fetch()` or `entrez_summary()` to learn
more about the transcripts they represent.
####External links
In addition to finding data within the NCBI, `entrez_link` can turn up
connections to external databases. Perhaps the most interesting example is
finding links to the full text of papers in PubMed. For example, when I wrote
this document the first paper linked to Amyloid Beta Precursor had a unique ID of
`25500142`. We can find links to the full text of that paper with `entrez_link`
by setting the `cmd` argument to 'llinks':
```r
paper_links <- entrez_link(dbfrom="pubmed", id=25500142, cmd="llinks")
paper_links
```
```
#> elink object with contents:
#> $linkouts: links to external websites
```
Each element of the `linkouts` object contains information about an external
source of data on this paper:
```r
paper_links$linkouts
```
```
#> $ID_25500142
#> $ID_25500142[[1]]
#> Linkout from Elsevier Science
#> $Url: http://linkinghub.elsevier ...
#>
#> $ID_25500142[[2]]
#> Linkout from Europe PubMed Central
#> $Url: http://europepmc.org/abstr ...
#>
#> $ID_25500142[[3]]
#> Linkout from Ovid Technologies, Inc.
#> $Url: http://ovidsp.ovid.com/ovi ...
#>
#> $ID_25500142[[4]]
#> Linkout from PubMed Central
#> $Url: http://www.ncbi.nlm.nih.go ...
#>
#> $ID_25500142[[5]]
#> Linkout from Genetics Home Reference
#> $Url: https://ghr.nlm.nih.gov/ge ...
#>
#> $ID_25500142[[6]]
#> Linkout from MedlinePlus Health Information
#> $Url: https://www.nlm.nih.gov/me ...
#>
#> $ID_25500142[[7]]
#> Linkout from Mouse Genome Informatics (MGI)
#> $Url: http://www.informatics.jax ...
```
Each of those linkout objects contains quite a lot of information, but the URL
is probably the most useful. For that reason, `rentrez` provides the
function `linkout_urls` to make extracting just the URL simple:
```r
linkout_urls(paper_links)
```
```
#> $ID_25500142
#> [1] "http://linkinghub.elsevier.com/retrieve/pii/S0014-4886(14)00393-8"
#> [2] "http://europepmc.org/abstract/MED/25500142"
#> [3] "http://ovidsp.ovid.com/ovidweb.cgi?T=JS&PAGE=linkout&SEARCH=25500142.ui"
#> [4] "http://www.ncbi.nlm.nih.gov/pmc/articles/pmid/25500142/"
#> [5] "https://ghr.nlm.nih.gov/gene=APP"
#> [6] "https://www.nlm.nih.gov/medlineplus/alzheimersdisease.html"
#> [7] "http://www.informatics.jax.org/marker/reference/25500142"
```
The full list of options for the `cmd` argument are given in in-line
documentation (`?entrez_link`). If you are interested in finding full text
records for a large number of articles checkout the package
[fulltext](https://github.com/ropensci/fulltext) which makes use of multiple
sources (including the NCBI) to discover the full text articles.
####Using more than one ID
It is possible to pass more than one ID to `entrez_link()`. By default, doing so
will give you a single elink object containing the complete set of links for
_all_ of the IDs that you specified. So, if you were looking for protein IDs
related to specific genes you could do:
```r
all_links_together <- entrez_link(db="protein", dbfrom="gene", id=c("93100", "223646"))
all_links_together
```
```
#> elink object with contents:
#> $links: IDs for linked records from NCBI
#>
```
```r
all_links_together$links$gene_protein
```
```
#> [1] "768043930" "767953815" "558472750" "194394158" "166221824"
#> [6] "154936864" "148697547" "148697546" "119602646" "119602645"
#> [11] "119602644" "119602643" "119602642" "81899807" "74215266"
#> [16] "74186774" "37787317" "37787309" "37787307" "37787305"
#> [21] "37589273" "33991172" "31982089" "26339824" "26329351"
#> [26] "21619615" "10834676"
```
Although this behaviour might sometimes be useful, it means we've lost track of
which `protein` ID is linked to which `gene` ID. To retain that information we
can set `by_id` to `TRUE`. This gives us a list of elink objects, each once
containing links from a single `gene` ID:
```r
all_links_sep <- entrez_link(db="protein", dbfrom="gene", id=c("93100", "223646"), by_id=TRUE)
all_links_sep
```
```
#> List of 2 elink objects,each containing
#> $links: IDs for linked records from NCBI
#>
```
```r
lapply(all_links_sep, function(x) x$links$gene_protein)
```
```
#> [[1]]
#> [1] "768043930" "767953815" "558472750" "194394158" "166221824"
#> [6] "154936864" "119602646" "119602645" "119602644" "119602643"
#> [11] "119602642" "37787309" "37787307" "37787305" "33991172"
#> [16] "21619615" "10834676"
#>
#> [[2]]
#> [1] "148697547" "148697546" "81899807" "74215266" "74186774"
#> [6] "37787317" "37589273" "31982089" "26339824" "26329351"
```
### Getting summary data: `entrez_summary()`
Having found the unique IDs for some records via `entrez_search` or `entrez_link()`, you are
probably going to want to learn something about them. The `Eutils` API has two
ways to get information about a record. `entrez_fetch()` returns 'full' records
in varying formats and `entrez_summary()` returns less information about each
record, but in relatively simple format. Very often the summary records have the information
you are after, so `rentrez` provides functions to parse and summarise summary
records.
####The summary record
`entrez_summary()` takes a vector of unique IDs for the samples you want to get
summary information from. Let's start by finding out something about the paper
describing [Taxize](https://github.com/ropensci/taxize), using its PubMed ID:
```r
taxize_summ <- entrez_summary(db="pubmed", id=24555091)
taxize_summ
```
```
#> esummary result with 43 items:
#> [1] uid pubdate epubdate
#> [4] source authors lastauthor
#> [7] title sorttitle volume
#> [10] issue pages lang
#> [13] nlmuniqueid issn essn
#> [16] pubtype recordstatus pubstatus
#> [19] articleids history references
#> [22] attributes pmcrefcount fulljournalname
#> [25] elocationid viewcount doctype
#> [28] srccontriblist booktitle medium
#> [31] edition publisherlocation publishername
#> [34] srcdate reportnumber availablefromurl
#> [37] locationlabel doccontriblist docdate
#> [40] bookname chapter sortpubdate
#> [43] sortfirstauthor
```
Once again, the object returned by `entrez_summary` behaves like a list, so you can extract
elements using `$`. For instance, we could convert our PubMed ID to another
article identifier...
```r
taxize_summ$articleids
```
```
#> idtype idtypen value
#> 1 doi 3 10.12688/f1000research.2-191.v2
#> 2 pubmed 1 24555091
#> 3 pmc 8 PMC3901538
#> 4 eid 8 24555091
#> 5 rid 8 24563765
#> 6 version 8 2
#> 7 version-id 8 2
#> 8 pmcid 5 pmc-id: PMC3901538;
```
...or see how many times the article has been cited in PubMed Central papers
```r
taxize_summ$pmcrefcount
```
```
#> [1] 6
```
####Dealing with many records
If you give `entrez_summary()` a vector with more than one ID you'll get a
list of summary records back. Let's get those _Plasmodium vivax_ papers we found
in the `entrez_search()` section back, and fetch some summary data on each paper:
```r
vivax_search <- entrez_search(db = "pubmed",
term = "(vivax malaria[MeSH]) AND (folic acid antagonists[MeSH])")
multi_summs <- entrez_summary(db="pubmed", id=vivax_search$ids)
```
`rentrez` provides a helper function, `extract_from_esummary()` that takes one
or more elements from every summary record in one of these lists. Here it is
working with one...
```r
extract_from_esummary(multi_summs, "fulljournalname")
```
```
#> 24861816
#> "Infection, genetics and evolution : journal of molecular epidemiology and evolutionary genetics in infectious diseases"
#> 24145518
#> "Antimicrobial agents and chemotherapy"
#> 24007534
#> "Malaria journal"
#> 23230341
#> "The Korean journal of parasitology"
#> 23043980
#> "Experimental parasitology"
#> 20810806
#> "The American journal of tropical medicine and hygiene"
#> 20412783
#> "Acta tropica"
#> 19597012
#> "Clinical microbiology reviews"
#> 17556611
#> "The American journal of tropical medicine and hygiene"
#> 17519409
#> "JAMA"
#> 17368986
#> "Trends in parasitology"
#> 12374849
#> "Proceedings of the National Academy of Sciences of the United States of America"
```
... and several elements:
```r
date_and_cite <- extract_from_esummary(multi_summs, c("pubdate", "pmcrefcount", "title"))
knitr::kable(head(t(date_and_cite)), row.names=FALSE)
```
|pubdate |pmcrefcount |title |
|:----------|:-----------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|2014 Aug | |Prevalence of mutations in the antifolates resistance-associated genes (dhfr and dhps) in Plasmodium vivax parasites from Eastern and Central Sudan. |
|2014 |3 |Prevalence of polymorphisms in antifolate drug resistance molecular marker genes pvdhfr and pvdhps in clinical isolates of Plasmodium vivax from Kolkata, India. |
|2013 Sep 5 |2 |Prevalence and patterns of antifolate and chloroquine drug resistance markers in Plasmodium vivax across Pakistan. |
|2012 Dec |9 |Prevalence of drug resistance-associated gene mutations in Plasmodium vivax in Central China. |
|2012 Dec |6 |Novel mutations in the antifolate drug resistance marker genes among Plasmodium vivax isolates exhibiting severe manifestations. |
|2010 Sep |13 |Mutations in the antifolate-resistance-associated genes dihydrofolate reductase and dihydropteroate synthase in Plasmodium vivax isolates from malaria-endemic countries. |
###Fetching full records: `entrez_fetch()`
As useful as the summary records are, sometimes they just don't have the
information that you need. If you want a complete representation of a record you
can use `entrez_fetch`, using the argument `rettype` to specify the format you'd
like the record in.
####Fetch DNA sequences in fasta format
Let's extend the example given in the `entrez_link()` section about finding
transcript for a given gene. This time we will fetch cDNA sequences of those
transcripts.We can start by repeating the steps in the earlier example
to get nucleotide IDs for refseq transcripts of two genes:
```r
gene_ids <- c(351, 11647)
linked_seq_ids <- entrez_link(dbfrom="gene", id=gene_ids, db="nuccore")
linked_transripts <- linked_seq_ids$links$gene_nuccore_refseqrna
head(linked_transripts)
```
```
#> [1] "755509222" "755509221" "755509220" "568930486" "563317858" "563317856"
```
Now we can get our sequences with `entrez_fetch`, setting `rettype` to "fasta"
(the list of formats available for [each database is give in this table](http://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.T._valid_values_of__retmode_and/)):
```r
all_recs <- entrez_fetch(db="nuccore", id=linked_transripts, rettype="fasta")
class(all_recs)
```
```
#> [1] "character"
```
```r
nchar(all_recs)
```
```
#> [1] 55475
```
Congratulations, now you have a really huge character vector! Rather than
printing all those thousands of bases we can take a peak at the top of the file:
```r
cat(strwrap(substr(all_recs, 1, 500)), sep="\n")
```
```
#> >gi|755509222|ref|XM_006538499.2| PREDICTED: Mus musculus alkaline
#> phosphatase, liver/bone/kidney (Alpl), transcript variant X3, mRNA
#> CCGGGTGACTGCAGAGGATCGGAACGTCAATTAACGTCAATTAACATCTGACGCTGCCCCCCCCCCCCTC
#> TTCCCACCATCTGGGCTCCAGCGAGGGACGAATCTCAGGGTACACCATGATCTCACCATTTTTAGTACTG
#> GCCATCGGCACCTGCCTTACCAACTCTTTTGTGCCAGAGAAAGAGAGAGACCCCAGTTACTGGCGACAGC
#> AAGCCCAAGAGACCTTGAAAAATGCCCTGAAACTCCAAAAGCTCAACACCAATGTAGCCAAGAATGTCAT
#> CATGTTCCTGGGAGATGGTATGGGCGTCTCCACAGTAACCGCTGCCCGAATCCTTAAGGGCCAGCTACAC
#> CACAACACGGG
```
If we wanted to use these sequences in some other application we could write them
to file:
```r
write(all_recs, file="my_transcripts.fasta")
```
Alternatively, if you want to use them within an R session
we could write them to a temporary file then read that. In this case I'm using `read.dna()` from the
pylogenetics package ape (but not executing the code block in this vignette, so
you don't have to install that package):
```r
temp <- tempfile()
write(all_recs, temp)
parsed_recs <- ape::read.dna(all_recs, temp)
```
####Fetch a parsed XML document
Most of the NCBI's databases can return records in XML format. In additional to
downloading the text-representation of these files, `entrez_fetch()` can return
objects parsed by the `XML` package. As an example, we can check out the Taxonomy
database's record for (did I mention they are amazing....) _Tetrahymena
thermophila_, specifying we want the result to be parsed by setting
`parsed=TRUE`:
```r
Tt <- entrez_search(db="taxonomy", term="(Tetrahymena thermophila[ORGN]) AND Species[RANK]")
tax_rec <- entrez_fetch(db="taxonomy", id=Tt$ids, rettype="xml", parsed=TRUE)
class(tax_rec)
```
```
#> [1] "XMLInternalDocument" "XMLAbstractDocument"
```
The package XML (which you have if you have installed `rentrez`) provides
functions to get information from these files. For relatively simple records
like this one you can use `XML::xmlToList`:
```r
tax_list <- XML::xmlToList(tax_rec)
tax_list$Taxon$GeneticCode
```
```
#> $GCId
#> [1] "6"
#>
#> $GCName
#> [1] "Ciliate Nuclear; Dasycladacean Nuclear; Hexamita Nuclear"
```
For more complex records, which generate deeply-nested lists, you can use
[XPath expressions](https://en.wikipedia.org/wiki/XPath) along with the function
`XML::xpathSApply` or the extraction operatord `[` and `[[` to extract specific parts of the
file. For instance, we can get the scientific name of each taxon in _T.
thermophila_'s lineage by specifying a path through the XML
```r
tt_lineage <- tax_rec["//LineageEx/Taxon/ScientificName"]
tt_lineage[1:4]
```
```
#> [[1]]
#>