• How to search for and fetch sequences
  • " />

    A downside of the restez approach is we are unable to then search the local database for sequences of interest as the database cannot possibly contain sufficient metadata (most notably taxonomic information) to perform as rich a search as online via NCBI. As a result we must perform the sequence discovery process independently from restez. In this tutorial we will demonstrate how to perform online searches using the rentrez package and then use the search results to look up sequences in a restez database.

    To run all the code in this tutorial you will need to have already set up the rodents database, see Build a database of all rodents.

    Search NCBI

    Let’s pretend we’re interested in COI sequences for a group of rodents called the Sciuromorpha. We can create an NCBI search term; use entrez_search to perform the search; and then retrieve the accession IDs using entrez_fetch with the acc rettype. entrez_fetch returns a single text that will need to be split up by the newline character. Finally, we should drop any version number after the downloaded accessions. This is not strictly necessary, but it makes it easier to check our results from restez later.

    # 33553 - Sciuromorpha - squirrel-like things
    search_term <- 'txid33553[Organism:exp] AND COI [GENE] AND 100:1000[SLEN]'
    search_object <- rentrez::entrez_search(db = 'nucleotide', term = search_term,
                                            use_history = TRUE, retmax = 0)
    accessions <- rentrez::entrez_fetch(db = 'nucleotide',
                                        web_history = search_object$web_history,
                                        rettype = 'acc')
    accessions <- strsplit(x = accessions, split = '\\n')[[1]]
    accessions <- sub(pattern = '\\.[0-9]+', replacement = '', x = accessions)
    #> [1] 455
    #>  [1] "KX859270" "KX859268" "KX859266" "KX859265" "KX859257" "KX859256"
    #>  [7] "KX859255" "KX859254" "JF457164" "JF457162"

    Retrieve sequences

    To fetch the sequences from the rodents database, we can just use the gb_fasta_get function.

    #> -------------
    #> restez v1.0.0
    #> -------------
    #> Remember to restez_path_set() and, then, restez_connect()
    #> Remember to run `restez_disconnect()`
    coi_sequences <- gb_fasta_get(id = accessions)
    #>  chr ">FJ808614.1 Glis glis isolate A1g02_E11-F cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial\n"| __truncated__
    # Are all accessions in results?
    all(accessions %in% names(coi_sequences))
    #> [1] TRUE

    Comparing to Entrez

    Can we not just use rentrez to do the fetching as well? Yes, but restez can be a lot faster. NCBI limits the number of requests per user, often to as little as 100 items per request with varying time delays. Additionally for any programmatic retrieval of sequences using an online server can never be as reliable as a local copy.

    # time via restez
    system.time(expr = {
      coi_sequences <- gb_fasta_get(id = accessions)
    #>    user  system elapsed 
    #>   0.968   2.366   5.865
    # time via Entrez
    system.time(expr = {
      coi_sequences_p1 <- rentrez::entrez_fetch(db = 'nucleotide',
                                                id = accessions[1:100],
                                                rettype = 'fasta')
      coi_sequences_p2 <- rentrez::entrez_fetch(db = 'nucleotide',
                                                id = accessions[101:200],
                                                rettype = 'fasta')
      coi_sequences_p3 <- rentrez::entrez_fetch(db = 'nucleotide',
                                                id = accessions[201:300],
                                                rettype = 'fasta')
      coi_sequences_p4 <- rentrez::entrez_fetch(db = 'nucleotide',
                                                id = accessions[301:400],
                                                rettype = 'fasta')
      coi_sequences_p5 <- rentrez::entrez_fetch(db = 'nucleotide',
                                                id = accessions[401:456],
                                                rettype = 'fasta')
    #>    user  system elapsed 
    #>   0.157   0.020   6.908
    # always disconnect