To export or not export phruta outputs
Source:vignettes/Exporting_data_phruta.Rmd
Exporting_data_phruta.Rmd
Table of Contents
To export or not export from phruta
Assembling a molecular dataset for particular target taxa with
phruta
can be performed almost entirely by saving objects
into your workspace. This topic was covered in the introductory vignette
to phruta
(Using the phruta
R package).
However, in some situations, it might be desirable to save the outputs
of different phruta
functions to particular folders. This
tutorial will cover that specific situation. Specifically, we will be
reviewing how phruta
can be used to export
.fasta
and .csv
files that are generated in
different steps of the pipeline.
From taxonomic names to sequence alignments, exporting data
In the introductory vignette to phruta
(Using the
phruta
R package), we assembled a basic molecular dataset
for inferring the phylogeny among three mammal genera. Let’s recreate
the same tutorial, but in this case, exporting the intermediate files
that are created after using several of the functions. Note that the
structure of this tutorial closely follows that of “Using the
phruta
R
package”. You should be able to
follow this tutorial even without having reviewed the introductory
vignette.
Please assume that we are interested in building a phylogenetic tree for the following three genera: Felis, Vulpes, and Phoca. All these three genera are classified within the Carnivora, a mammalian order. Both Felis and Vulpes are classified in different superfamilies within the Fissipedia. Finally, Phoca is part of another suborder, Pinnipedia. We’re going to root our tree with another mammal species, a Chinese Pangolin (Manis pentadactyla). Users can select additional target species and clades. However, for simplicity, we will run the analyses using three genera in the ingroup and a single outgroup species.
So far, we have decided the taxonomic make of our analyses in
phruta
. We will also need to determine the gene regions to
be used in our analyses. Fortunately, mammals are extensively studied
and a comprehensive list of potential gene regions to be analyzed is
already available. For instance, we could use same gene regions sampled
in Upham
et al (2009). However, for this tutorial, we will simply try to find
the gene regions are well sampled for the target taxa. I believe that
figuring out the best sampled gene regions in genbank, instead of
providing gene names, is potentially more valuable when working with
poorly studied groups (e.g. invertebrates). Before we move on, please
make sure that you you have set a working directory for
this project. All the files will be saved to this directory.
phruta
will politely ask before writing files to your local
directories.
Let’s start by loading phruta
!
Now, let’s look for the gene regions that are sampled for our target
taxa. Again, this step is not always necessary. In some groups,
gene-level sampling is very standard (e.g. COI, 12S). However, the
structure of gene sampling sometimes becomes more blurry as you zoom out
taxonomically. For instance, genes A and B can be extensively sampled in
genus 1. However, genus 2 in the same family has mainly been studied
using genes Y and Z. The idea here is that phruta
will try
to find those gene regions that are extensively sampled across species
in the target taxa. We will use the
gene.sampling.retrieve()
function in phruta
.
The resulting data.frame
, named gs.seqs
in
this example, will contain the list of full names for genes sampled in
genbank for the target taxa.
gs.seqs <- gene.sampling.retrieve(organism = c("Felis", "Vulpes", "Phoca", "Manis_pentadactyla"),
speciesSampling = TRUE)
For the search terms used above, phruta
was able to
retrieve the names for 1594 gene regions. In the table below I summarize
a few of those genes, with sampling frequency calculated at the level of
species (see speciesSampling = TRUE
argument above).
Gene | Sampled in N species | PercentOfSampledSpecies |
---|---|---|
cytochrome b | 25 | 75.75758 |
NADH dehydrogenase subunit 5 | 14 | 42.42424 |
12S ribosomal RNA | 11 | 33.33333 |
cytochrome oxidase subunit 1 | 11 | 33.33333 |
growth hormone receptor | 11 | 33.33333 |
interphotoreceptor retinoid-binding protein | 10 | 30.30303 |
Thus, the gene.sampling.retrieve()
function provides an
estimate of the number of species in genbank that matches the taxonomic
criteria and have sequences for a given gene region. Note that the
estimates recovered by gene.sampling.retrieve()
are only as
good as the annotations that other researchers have provided for
sequences deposited in genbank.
From here, we will generate a preliminary table summarizing accession
numbers for the combination of taxa and gene regions that we’re
interested in sampling. However, note that not all these accession
numbers are expected to be in the final (curated) molecular dataset. For
instance, several sequences might be dropped later after taxonomic
information is curated. Now, we will assemble a species-level summary of
accession numbers using the acc.table.retrieve()
function.
For simplicity, this tutorial will focus on sampling gene regions that
are sampled in >30% of the species (targetGenes
data.frame
).
targetGenes <- gs.seqs[gs.seqs$PercentOfSampledSpecies > 30,]
acc.table <- acc.table.retrieve(
clades = c('Felis', 'Vulpes', 'Phoca'),
species = 'Manis_pentadactyla' ,
genes = targetGenes$Gene,
speciesLevel = TRUE
)
The acc.table
object is a data.frame
that
will be used below for downloading the relevant gene sequences. In this
case, the dataset includes the following information:
Species | Ti | Acc | gene |
---|---|---|---|
Felis silvestris | Felis silvestris silvestris isolate FS_101 NADH dehydrogenase subunit 5 (ND5) gene, partial cds; NADH dehydrogenase subunit 6 (ND6) gene, complete cds; tRNA-Glu gene, complete sequence; and cytochrome b (cytb) gene, partial cds; mitochondrial | OL654361 | cytochrome b |
Felis catus | Felis catus MKRaS008 mitochondrial gene for cytochrome b, partial cds | LC649705 | cytochrome b |
Felis chaus | Felis chaus isolate Jungle Cat 5 cytochrome b (cytb) gene, partial cds; mitochondrial | MN370575 | cytochrome b |
Felis environmental | Felis environmental sample isolate Kw-170 cytochrome b (cytb) gene, partial cds; mitochondrial | MK510873 | cytochrome b |
Felis margarita | Felis margarita haplotype AH NADH dehydrogenase subunit 5 (ND5) and cytochrome b (cytb) genes, partial cds; and tRNA-Thr gene and D-loop, partial sequence; mitochondrial | MK606132 | cytochrome b |
Felis bieti | Felis bieti cytochrome b gene, partial cds; mitochondrial | AY773081 | cytochrome b |
F.domesticus mitochondrial | F.domesticus mitochondrial cytochrome b gene | X82296 | cytochrome b |
Vulpes vulpes | Vulpes vulpes isolate LH198 haplotype FOX14 cytochrome b (CYTB) gene, partial cds; mitochondrial | MK244493 | cytochrome b |
Vulpes corsac | Vulpes corsac isolate SH21 cytochrome b (CYTB) gene, complete cds; mitochondrial | MT795179 | cytochrome b |
Vulpes zerda | Vulpes zerda isolate X161349 cytochrome b (Cytb) gene, partial cds; mitochondrial | MH854561 | cytochrome b |
Vulpes cana | Vulpes cana isolate B.F.Y3 cytochrome b (cytb) gene, partial cds; mitochondrial | KU378587 | cytochrome b |
Vulpes rueppellii | Vulpes rueppellii isolate R.F.Y6 cytochrome b (cytb) gene, partial cds; mitochondrial | KU378373 | cytochrome b |
Vulpes lagopus | Vulpes lagopus haplotype 5 cytochrome b (cytb) gene, partial cds; mitochondrial | KX093945 | cytochrome b |
Vulpes ferrilata | Vulpes ferrilata haplotype 1 cytochrome b (cytb) gene, partial cds; mitochondrial | EU872065 | cytochrome b |
Vulpes macrotis | Vulpes macrotis cytochrome b (cytb) gene, mitochondrial gene encoding mitochondrial protein, partial cds | AF028157 | cytochrome b |
Vulpes pallida | Vulpes pallida haplotype PMa2 cytochrome b (cytb) gene, partial cds; mitochondrial | KJ597964 | cytochrome b |
V.vulpes mitochondrial | V.vulpes mitochondrial DNA for cytochrome b (complete sequence) | X94929 | cytochrome b |
Phoca largha | Phoca largha PLCBRe4 mitochondrial cytb gene for cytochrome b, partial cds | LC466149 | cytochrome b |
Pagophilus groenlandicus | Pagophilus groenlandicus cytochrome b gene, partial cds; mitochondrial gene for mitochondrial product | AF200491 | cytochrome b |
Phoca groenlandica | Phoca groenlandica cytochrome b (cytb) gene, complete cds; mitochondrial | GU174609 | cytochrome b |
Phoca fasciata | Phoca fasciata cytochrome b (cytb) gene, complete cds; mitochondrial | GU167294 | cytochrome b |
Phoca vitulina | Phoca vitulina mitochondrial cytochrome b gene, partial cds | L19127 | cytochrome b |
P.vitulina mitochondrial | P.vitulina mitochondrial cytochrome b gene | X82306 | cytochrome b |
P.largha mitochondrial | P.largha mitochondrial cytochrome b gene | X82305 | cytochrome b |
P.groenlandica mitochondrial | P.groenlandica mitochondrial cytochrome b gene | X82303 | cytochrome b |
P.fasciata mitochondrial | P.fasciata mitochondrial cytochrome b gene | X82302 | cytochrome b |
Manis pentadactyla | Manis pentadactyla isolate ST08 cytochrome b (cytb) gene, partial cds; mitochondrial | MW197469 | cytochrome b |
Felis silvestris | Felis silvestris silvestris isolate FS_101 NADH dehydrogenase subunit 5 (ND5) gene, partial cds; NADH dehydrogenase subunit 6 (ND6) gene, complete cds; tRNA-Glu gene, complete sequence; and cytochrome b (cytb) gene, partial cds; mitochondrial | OL654361 | NADH dehydrogenase subunit 5 |
Felis catus | Felis catus isolate 4709-K NADH dehydrogenase subunit 5 (ND5) gene, partial cds; NADH dehydrogenase subunit 6 (ND6) gene, complete cds; tRNA-Glu gene, complete sequence; and cytochrome b (CYTB) gene, partial cds; mitochondrial | MN313781 | NADH dehydrogenase subunit 5 |
Felis margarita | Felis margarita haplotype AH NADH dehydrogenase subunit 5 (ND5) and cytochrome b (cytb) genes, partial cds; and tRNA-Thr gene and D-loop, partial sequence; mitochondrial | MK606132 | NADH dehydrogenase subunit 5 |
Felis chaus | Felis chaus isolate JCAIZ003 NADH dehydrogenase subunit 5 (ND5) gene, partial cds; mitochondrial | GU561700 | NADH dehydrogenase subunit 5 |
Vulpes lagopus | Vulpes lagopus ATP synthase F0 subunit 6 (ATP6), ATP synthase F0 subunit 8 (ATP8), cytochrome c oxidase subunit I (COX1), cytochrome c oxidase subunit II (COX2), cytochrome c oxidase subunit III (COX3), cytochrome b (CYTB), NADH dehydrogenase subunit 1 (ND1), NADH dehydrogenase subunit 2 (ND2), NADH dehydrogenase subunit 3 (ND3), NADH dehydrogenase subunit 4 (ND4), NADH dehydrogenase subunit 4L (ND4L), and NADH dehydrogenase subunit 5 (ND5) genes, complete cds; mitochondrial | AH014073 | NADH dehydrogenase subunit 5 |
Phoca groenlandica | Phoca groenlandica NADH dehydrogenase subunit 5 (ND5) gene, complete cds; mitochondrial gene for mitochondrial product | AY377376 | NADH dehydrogenase subunit 5 |
Phoca fasciata | Phoca fasciata NADH dehydrogenase subunit 5 (ND5) gene, complete cds; mitochondrial | GU167331 | NADH dehydrogenase subunit 5 |
Felis catus | Felis catus voucher N22b 12S ribosomal RNA gene, partial sequence; mitochondrial | KX786344 | 12S ribosomal RNA |
Felis chaus | Felis chaus isolate G 12S ribosomal RNA gene, partial sequence; mitochondrial | KU963205 | 12S ribosomal RNA |
Felis silvestris | Felis silvestris 12S ribosomal RNA gene, partial sequence; mitochondrial | KX002032 | 12S ribosomal RNA |
Felis bieti | Felis bieti 12S ribosomal RNA gene, partial sequence; mitochondrial | AY773084 | 12S ribosomal RNA |
Vulpes vulpes | Vulpes vulpes Vv1 mitochondrial gene for 12S ribosomal RNA, partial sequence | LC424764 | 12S ribosomal RNA |
Vulpes lagopus | Vulpes lagopus isolate FRT12 12S ribosomal RNA gene, partial sequence; mitochondrial | KM224240 | 12S ribosomal RNA |
Phoca fasciata | Phoca fasciata isolate 5888 12S ribosomal RNA gene, partial sequence; mitochondrial | GU174595 | 12S ribosomal RNA |
Phoca largha | Phoca largha isolate 06spotted03 12S ribosomal RNA gene, partial sequence; mitochondrial | GU174591 | 12S ribosomal RNA |
Manis pentadactyla | Manis pentadactyla 12S ribosomal RNA gene, partial sequence; and tRNA-Val gene, complete sequence; mitochondrial | AY012154 | 12S ribosomal RNA |
Felis catus | Felis catus voucher Cat_KU cytochrome oxidase subunit 1 (COI) gene, partial cds; mitochondrial | MN124254 | cytochrome oxidase subunit 1 |
Felis nigripes | Felis nigripes voucher NZG:BWP38761 cytochrome oxidase subunit 1 (COI) gene, partial cds; mitochondrial | KX012677 | cytochrome oxidase subunit 1 |
Felis margarita | Felis margarita voucher 198_Fe_mar cytochrome oxidase subunit 1 (COI) gene, partial cds; mitochondrial | KF297765 | cytochrome oxidase subunit 1 |
Vulpes vulpes | Vulpes vulpes voucher BIOUG |
JF443560 | cytochrome oxidase subunit 1 |
Vulpes chama | Vulpes chama voucher NZG:BWP38701 cytochrome oxidase subunit 1 (COI) gene, partial cds; mitochondrial | KX012672 | cytochrome oxidase subunit 1 |
Vulpes lagopus | Vulpes lagopus voucher HBL008485 cytochrome oxidase subunit 1 (COI) gene, partial cds; mitochondrial | JF443554 | cytochrome oxidase subunit 1 |
Vulpes velox | Vulpes velox voucher ROM 105399 cytochrome oxidase subunit 1 (COI) gene, partial cds; mitochondrial | JF443557 | cytochrome oxidase subunit 1 |
Phoca vitulina | Phoca vitulina voucher HBL008389 cytochrome oxidase subunit 1 (COI) gene, partial cds; mitochondrial | JF443364 | cytochrome oxidase subunit 1 |
Phoca largha | Phoca largha voucher HBL008423 cytochrome oxidase subunit 1 (COI) gene, partial cds; mitochondrial | JF443363 | cytochrome oxidase subunit 1 |
Phoca groenlandica | Phoca groenlandica voucher HBL008364 cytochrome oxidase subunit 1 (COI) gene, partial cds; mitochondrial | JF443362 | cytochrome oxidase subunit 1 |
Manis pentadactyla | Manis pentadactyla isolate KFBG_HZ0050 cytochrome oxidase subunit 1 gene, partial cds; mitochondrial | KT428152 | cytochrome oxidase subunit 1 |
Felis catus | Felis catus growth hormone receptor (GHR) gene, partial cds | DQ205829 | growth hormone receptor |
Vulpes vulpes | Vulpes vulpes growth hormone receptor gene, exon 10 and partial cds | AY885401 | growth hormone receptor |
Vulpes macrotis | Vulpes macrotis growth hormone receptor gene, exon 10 and partial cds | AY885400 | growth hormone receptor |
Vulpes corsac | Vulpes corsac growth hormone receptor gene, exon 10 and partial cds | AY885399 | growth hormone receptor |
Vulpes zerda | Vulpes zerda growth hormone receptor gene, exon 10 and partial cds | AY885393 | growth hormone receptor |
Alopex lagopus | Alopex lagopus growth hormone receptor gene, exon 10 and partial cds | AY885379 | growth hormone receptor |
Vulpes velox | Vulpes velox growth hormone receptor (GHR) gene, partial cds | DQ205838 | growth hormone receptor |
Vulpes lagopus | Vulpes lagopus growth hormone receptor (GHR) gene, partial cds | DQ205837 | growth hormone receptor |
Phoca vitulina | Phoca vitulina growth hormone receptor (GHR) gene, partial cds | GU931127 | growth hormone receptor |
Phoca largha | Phoca largha growth hormone receptor (GHR) gene, partial cds | DQ205827 | growth hormone receptor |
Phoca groenlandica | Phoca groenlandica growth hormone receptor (GHR) gene, partial cds | DQ205825 | growth hormone receptor |
Manis pentadactyla | Manis pentadactyla growth hormone receptor (GHR) gene, exon 10 and partial cds | EU448992 | growth hormone receptor |
Felis catus | Felis catus interphotoreceptor retinoid binding protein gene, exon 1 | Z11811 | interphotoreceptor retinoid-binding protein |
Vulpes velox | Vulpes velox interphotoreceptor retinoid binding protein gene, partial cds | AF179293 | interphotoreceptor retinoid-binding protein |
Manis pentadactyla | Manis pentadactyla interphotoreceptor retinoid binding protein (IRBP) gene, exon 1 and partial cds | JN414784 | interphotoreceptor retinoid-binding protein |
Felis silvestris | Felis silvestris haplotype W23 tRNA-Pro gene and control region, partial sequence; mitochondrial | MF353436 | tRNA-Pro |
Felis catus | Felis catus mitochondrial tRNA-Pro gene and control region, partial sequence | AF348642 | tRNA-Pro |
Vulpes vulpes | Vulpes vulpes isolate VvUAE056 tRNA-Thr gene, partial sequence; tRNA-Pro gene, complete sequence; and D-loop, partial sequence; mitochondrial | MT955893 | tRNA-Pro |
Vulpes rueppellii | Vulpes rueppellii isolate VrKSAA0005 tRNA-Thr gene, partial sequence; tRNA-Pro gene, complete sequence; and D-loop, partial sequence; mitochondrial | MT955814 | tRNA-Pro |
Vulpes lagopus | Vulpes lagopus haplotype H11 tRNA-Pro gene and D-loop, partial sequence; mitochondrial | KX093931 | tRNA-Pro |
Vulpes macrotis | Vulpes macrotis haplotype Vmac1 tRNA-Pro gene and D-loop, partial sequence; mitochondrial | KJ846673 | tRNA-Pro |
Vulpes zerda | Vulpes zerda haplotype Vzer1 tRNA-Pro gene and D-loop, partial sequence; mitochondrial | KJ846672 | tRNA-Pro |
Vulpes ferrilata | Vulpes ferrilata haplotype 4 D-loop, partial sequence; tRNA-Pro gene, complete sequence; and tRNA-Thr gene, partial sequence; mitochondrial | JF520840 | tRNA-Pro |
Phoca largha | Phoca largha isolate pl27 tRNA-Thr gene, partial sequence; tRNA-Pro gene, complete sequence; and D-loop, partial sequence; mitochondrial | OM967017 | tRNA-Pro |
Phoca vitulina | Phoca vitulina isolate N tRNA-Thr (trnT) gene, partial sequence; tRNA-Pro (trnP) gene, complete sequence; and D-loop, partial sequence; mitochondrial | HQ702987 | tRNA-Pro |
Manis pentadactyla | Manis pentadactyla pentadactyla isolate MPP5 tRNA-Pro gene, partial sequence; D-loop, complete sequence; and tRNA-Phe gene, partial sequence; mitochondrial | GQ232081 | tRNA-Pro |
Feel free to review this dataset, make changes, add new species, samples, etc. The integrity of this dataset is critical for the next steps so please take your time and review it carefully. For instance, let’s just make some minor changes to our dataset:
acc.table$Species <- sub("P.", "Phoca ", acc.table$Species, fixed = TRUE)
acc.table$Species <- sub("F.", "Felis ", acc.table$Species, fixed = TRUE)
acc.table$Species <- sub("V.", "Vulpes ", acc.table$Species, fixed = TRUE)
acc.table$Species <- sub("mitochondrial", "", acc.table$Species)
row.names(acc.table) <- NULL
Let’s check how the new table looks now…
Species | Ti | Acc | gene |
---|---|---|---|
Felis silvestris | Felis silvestris silvestris isolate FS_101 NADH dehydrogenase subunit 5 (ND5) gene, partial cds; NADH dehydrogenase subunit 6 (ND6) gene, complete cds; tRNA-Glu gene, complete sequence; and cytochrome b (cytb) gene, partial cds; mitochondrial | OL654361 | cytochrome b |
Felis catus | Felis catus MKRaS008 mitochondrial gene for cytochrome b, partial cds | LC649705 | cytochrome b |
Felis chaus | Felis chaus isolate Jungle Cat 5 cytochrome b (cytb) gene, partial cds; mitochondrial | MN370575 | cytochrome b |
Felis environmental | Felis environmental sample isolate Kw-170 cytochrome b (cytb) gene, partial cds; mitochondrial | MK510873 | cytochrome b |
Felis margarita | Felis margarita haplotype AH NADH dehydrogenase subunit 5 (ND5) and cytochrome b (cytb) genes, partial cds; and tRNA-Thr gene and D-loop, partial sequence; mitochondrial | MK606132 | cytochrome b |
Felis bieti | Felis bieti cytochrome b gene, partial cds; mitochondrial | AY773081 | cytochrome b |
Felis domesticus | F.domesticus mitochondrial cytochrome b gene | X82296 | cytochrome b |
Vulpes vulpes | Vulpes vulpes isolate LH198 haplotype FOX14 cytochrome b (CYTB) gene, partial cds; mitochondrial | MK244493 | cytochrome b |
Vulpes corsac | Vulpes corsac isolate SH21 cytochrome b (CYTB) gene, complete cds; mitochondrial | MT795179 | cytochrome b |
Vulpes zerda | Vulpes zerda isolate X161349 cytochrome b (Cytb) gene, partial cds; mitochondrial | MH854561 | cytochrome b |
Vulpes cana | Vulpes cana isolate B.F.Y3 cytochrome b (cytb) gene, partial cds; mitochondrial | KU378587 | cytochrome b |
Vulpes rueppellii | Vulpes rueppellii isolate R.F.Y6 cytochrome b (cytb) gene, partial cds; mitochondrial | KU378373 | cytochrome b |
Vulpes lagopus | Vulpes lagopus haplotype 5 cytochrome b (cytb) gene, partial cds; mitochondrial | KX093945 | cytochrome b |
Vulpes ferrilata | Vulpes ferrilata haplotype 1 cytochrome b (cytb) gene, partial cds; mitochondrial | EU872065 | cytochrome b |
Vulpes macrotis | Vulpes macrotis cytochrome b (cytb) gene, mitochondrial gene encoding mitochondrial protein, partial cds | AF028157 | cytochrome b |
Vulpes pallida | Vulpes pallida haplotype PMa2 cytochrome b (cytb) gene, partial cds; mitochondrial | KJ597964 | cytochrome b |
Vulpes vulpes | V.vulpes mitochondrial DNA for cytochrome b (complete sequence) | X94929 | cytochrome b |
Phoca largha | Phoca largha PLCBRe4 mitochondrial cytb gene for cytochrome b, partial cds | LC466149 | cytochrome b |
Pagophilus groenlandicus | Pagophilus groenlandicus cytochrome b gene, partial cds; mitochondrial gene for mitochondrial product | AF200491 | cytochrome b |
Phoca groenlandica | Phoca groenlandica cytochrome b (cytb) gene, complete cds; mitochondrial | GU174609 | cytochrome b |
Phoca fasciata | Phoca fasciata cytochrome b (cytb) gene, complete cds; mitochondrial | GU167294 | cytochrome b |
Phoca vitulina | Phoca vitulina mitochondrial cytochrome b gene, partial cds | L19127 | cytochrome b |
Phoca vitulina | P.vitulina mitochondrial cytochrome b gene | X82306 | cytochrome b |
Phoca largha | P.largha mitochondrial cytochrome b gene | X82305 | cytochrome b |
Phoca groenlandica | P.groenlandica mitochondrial cytochrome b gene | X82303 | cytochrome b |
Phoca fasciata | P.fasciata mitochondrial cytochrome b gene | X82302 | cytochrome b |
Manis pentadactyla | Manis pentadactyla isolate ST08 cytochrome b (cytb) gene, partial cds; mitochondrial | MW197469 | cytochrome b |
Felis silvestris | Felis silvestris silvestris isolate FS_101 NADH dehydrogenase subunit 5 (ND5) gene, partial cds; NADH dehydrogenase subunit 6 (ND6) gene, complete cds; tRNA-Glu gene, complete sequence; and cytochrome b (cytb) gene, partial cds; mitochondrial | OL654361 | NADH dehydrogenase subunit 5 |
Felis catus | Felis catus isolate 4709-K NADH dehydrogenase subunit 5 (ND5) gene, partial cds; NADH dehydrogenase subunit 6 (ND6) gene, complete cds; tRNA-Glu gene, complete sequence; and cytochrome b (CYTB) gene, partial cds; mitochondrial | MN313781 | NADH dehydrogenase subunit 5 |
Felis margarita | Felis margarita haplotype AH NADH dehydrogenase subunit 5 (ND5) and cytochrome b (cytb) genes, partial cds; and tRNA-Thr gene and D-loop, partial sequence; mitochondrial | MK606132 | NADH dehydrogenase subunit 5 |
Felis chaus | Felis chaus isolate JCAIZ003 NADH dehydrogenase subunit 5 (ND5) gene, partial cds; mitochondrial | GU561700 | NADH dehydrogenase subunit 5 |
Vulpes lagopus | Vulpes lagopus ATP synthase F0 subunit 6 (ATP6), ATP synthase F0 subunit 8 (ATP8), cytochrome c oxidase subunit I (COX1), cytochrome c oxidase subunit II (COX2), cytochrome c oxidase subunit III (COX3), cytochrome b (CYTB), NADH dehydrogenase subunit 1 (ND1), NADH dehydrogenase subunit 2 (ND2), NADH dehydrogenase subunit 3 (ND3), NADH dehydrogenase subunit 4 (ND4), NADH dehydrogenase subunit 4L (ND4L), and NADH dehydrogenase subunit 5 (ND5) genes, complete cds; mitochondrial | AH014073 | NADH dehydrogenase subunit 5 |
Phoca groenlandica | Phoca groenlandica NADH dehydrogenase subunit 5 (ND5) gene, complete cds; mitochondrial gene for mitochondrial product | AY377376 | NADH dehydrogenase subunit 5 |
Phoca fasciata | Phoca fasciata NADH dehydrogenase subunit 5 (ND5) gene, complete cds; mitochondrial | GU167331 | NADH dehydrogenase subunit 5 |
Felis catus | Felis catus voucher N22b 12S ribosomal RNA gene, partial sequence; mitochondrial | KX786344 | 12S ribosomal RNA |
Felis chaus | Felis chaus isolate G 12S ribosomal RNA gene, partial sequence; mitochondrial | KU963205 | 12S ribosomal RNA |
Felis silvestris | Felis silvestris 12S ribosomal RNA gene, partial sequence; mitochondrial | KX002032 | 12S ribosomal RNA |
Felis bieti | Felis bieti 12S ribosomal RNA gene, partial sequence; mitochondrial | AY773084 | 12S ribosomal RNA |
Vulpes vulpes | Vulpes vulpes Vv1 mitochondrial gene for 12S ribosomal RNA, partial sequence | LC424764 | 12S ribosomal RNA |
Vulpes lagopus | Vulpes lagopus isolate FRT12 12S ribosomal RNA gene, partial sequence; mitochondrial | KM224240 | 12S ribosomal RNA |
Phoca fasciata | Phoca fasciata isolate 5888 12S ribosomal RNA gene, partial sequence; mitochondrial | GU174595 | 12S ribosomal RNA |
Phoca largha | Phoca largha isolate 06spotted03 12S ribosomal RNA gene, partial sequence; mitochondrial | GU174591 | 12S ribosomal RNA |
Manis pentadactyla | Manis pentadactyla 12S ribosomal RNA gene, partial sequence; and tRNA-Val gene, complete sequence; mitochondrial | AY012154 | 12S ribosomal RNA |
Felis catus | Felis catus voucher Cat_KU cytochrome oxidase subunit 1 (COI) gene, partial cds; mitochondrial | MN124254 | cytochrome oxidase subunit 1 |
Felis nigripes | Felis nigripes voucher NZG:BWP38761 cytochrome oxidase subunit 1 (COI) gene, partial cds; mitochondrial | KX012677 | cytochrome oxidase subunit 1 |
Felis margarita | Felis margarita voucher 198_Fe_mar cytochrome oxidase subunit 1 (COI) gene, partial cds; mitochondrial | KF297765 | cytochrome oxidase subunit 1 |
Vulpes vulpes | Vulpes vulpes voucher BIOUG |
JF443560 | cytochrome oxidase subunit 1 |
Vulpes chama | Vulpes chama voucher NZG:BWP38701 cytochrome oxidase subunit 1 (COI) gene, partial cds; mitochondrial | KX012672 | cytochrome oxidase subunit 1 |
Vulpes lagopus | Vulpes lagopus voucher HBL008485 cytochrome oxidase subunit 1 (COI) gene, partial cds; mitochondrial | JF443554 | cytochrome oxidase subunit 1 |
Vulpes velox | Vulpes velox voucher ROM 105399 cytochrome oxidase subunit 1 (COI) gene, partial cds; mitochondrial | JF443557 | cytochrome oxidase subunit 1 |
Phoca vitulina | Phoca vitulina voucher HBL008389 cytochrome oxidase subunit 1 (COI) gene, partial cds; mitochondrial | JF443364 | cytochrome oxidase subunit 1 |
Phoca largha | Phoca largha voucher HBL008423 cytochrome oxidase subunit 1 (COI) gene, partial cds; mitochondrial | JF443363 | cytochrome oxidase subunit 1 |
Phoca groenlandica | Phoca groenlandica voucher HBL008364 cytochrome oxidase subunit 1 (COI) gene, partial cds; mitochondrial | JF443362 | cytochrome oxidase subunit 1 |
Manis pentadactyla | Manis pentadactyla isolate KFBG_HZ0050 cytochrome oxidase subunit 1 gene, partial cds; mitochondrial | KT428152 | cytochrome oxidase subunit 1 |
Felis catus | Felis catus growth hormone receptor (GHR) gene, partial cds | DQ205829 | growth hormone receptor |
Vulpes vulpes | Vulpes vulpes growth hormone receptor gene, exon 10 and partial cds | AY885401 | growth hormone receptor |
Vulpes macrotis | Vulpes macrotis growth hormone receptor gene, exon 10 and partial cds | AY885400 | growth hormone receptor |
Vulpes corsac | Vulpes corsac growth hormone receptor gene, exon 10 and partial cds | AY885399 | growth hormone receptor |
Vulpes zerda | Vulpes zerda growth hormone receptor gene, exon 10 and partial cds | AY885393 | growth hormone receptor |
Alopex lagopus | Alopex lagopus growth hormone receptor gene, exon 10 and partial cds | AY885379 | growth hormone receptor |
Vulpes velox | Vulpes velox growth hormone receptor (GHR) gene, partial cds | DQ205838 | growth hormone receptor |
Vulpes lagopus | Vulpes lagopus growth hormone receptor (GHR) gene, partial cds | DQ205837 | growth hormone receptor |
Phoca vitulina | Phoca vitulina growth hormone receptor (GHR) gene, partial cds | GU931127 | growth hormone receptor |
Phoca largha | Phoca largha growth hormone receptor (GHR) gene, partial cds | DQ205827 | growth hormone receptor |
Phoca groenlandica | Phoca groenlandica growth hormone receptor (GHR) gene, partial cds | DQ205825 | growth hormone receptor |
Manis pentadactyla | Manis pentadactyla growth hormone receptor (GHR) gene, exon 10 and partial cds | EU448992 | growth hormone receptor |
Felis catus | Felis catus interphotoreceptor retinoid binding protein gene, exon 1 | Z11811 | interphotoreceptor retinoid-binding protein |
Vulpes velox | Vulpes velox interphotoreceptor retinoid binding protein gene, partial cds | AF179293 | interphotoreceptor retinoid-binding protein |
Manis pentadactyla | Manis pentadactyla interphotoreceptor retinoid binding protein (IRBP) gene, exon 1 and partial cds | JN414784 | interphotoreceptor retinoid-binding protein |
Felis silvestris | Felis silvestris haplotype W23 tRNA-Pro gene and control region, partial sequence; mitochondrial | MF353436 | tRNA-Pro |
Felis catus | Felis catus mitochondrial tRNA-Pro gene and control region, partial sequence | AF348642 | tRNA-Pro |
Vulpes vulpes | Vulpes vulpes isolate VvUAE056 tRNA-Thr gene, partial sequence; tRNA-Pro gene, complete sequence; and D-loop, partial sequence; mitochondrial | MT955893 | tRNA-Pro |
Vulpes rueppellii | Vulpes rueppellii isolate VrKSAA0005 tRNA-Thr gene, partial sequence; tRNA-Pro gene, complete sequence; and D-loop, partial sequence; mitochondrial | MT955814 | tRNA-Pro |
Vulpes lagopus | Vulpes lagopus haplotype H11 tRNA-Pro gene and D-loop, partial sequence; mitochondrial | KX093931 | tRNA-Pro |
Vulpes macrotis | Vulpes macrotis haplotype Vmac1 tRNA-Pro gene and D-loop, partial sequence; mitochondrial | KJ846673 | tRNA-Pro |
Vulpes zerda | Vulpes zerda haplotype Vzer1 tRNA-Pro gene and D-loop, partial sequence; mitochondrial | KJ846672 | tRNA-Pro |
Vulpes ferrilata | Vulpes ferrilata haplotype 4 D-loop, partial sequence; tRNA-Pro gene, complete sequence; and tRNA-Thr gene, partial sequence; mitochondrial | JF520840 | tRNA-Pro |
Phoca largha | Phoca largha isolate pl27 tRNA-Thr gene, partial sequence; tRNA-Pro gene, complete sequence; and D-loop, partial sequence; mitochondrial | OM967017 | tRNA-Pro |
Phoca vitulina | Phoca vitulina isolate N tRNA-Thr (trnT) gene, partial sequence; tRNA-Pro (trnP) gene, complete sequence; and D-loop, partial sequence; mitochondrial | HQ702987 | tRNA-Pro |
Manis pentadactyla | Manis pentadactyla pentadactyla isolate MPP5 tRNA-Pro gene, partial sequence; D-loop, complete sequence; and tRNA-Phe gene, partial sequence; mitochondrial | GQ232081 | tRNA-Pro |
Now, since we’re going to retrieve sequences from genbank using an
existing preliminary accession numbers table, we will use the
sq.retrieve.indirect()
function in phruta
. I’m
going to spend some time in here to explain the differences between the
two versions of sq.retrieve.*
in phruta
. The
one that we’re using in this tutorial,
sq.retrieve.indirect()
, retrieves sequences “indirectly”
because it follows the initial step of generating a table summarizing
accession numbers (see the acc.table.retrieve()
function
above). I present the information in this vignette using
sq.retrieve.indirect()
instead of
sq.retrieve.direct()
because the first function is way more
flexible and allows for correcting issues prior to download any
sequence. For instance, you can add new sequences, species, populations
to the resulting data.frame from acc.table.retrieve()
.
Additionally, you could even manually assemble your own dataset of
accession numbers to be retrieved using
sq.retrieve.indirect()
. Instead,
sq.retrieve.direct()
does its best to directly
(i.e. without potential input from the user) retrieve sequences for a
target set of taxa and set of gene regions. In short, you should be able
to catch errors using sq.retrieve.indirect()
but mistakes
will be harder to spot and fix if you’re using
sq.retrieve.direct()
. Note that the functionality of
sq.retrieve.direct()
is outlined in the “Using
phruta
with defined target genes” vignette.
We still need to retrieve all the sequences from the accessions table
that was generated avobe using acc.table
. The
sq.retrieve.indirect()
function will write all the
resulting fasta
files into a newly created folder
0.Sequences
located in our working directory (please check
the download.sqs = TRUE
argument).
sq.retrieve.indirect(acc.table, download.sqs = TRUE)
Next, we’re going to make sure that we include only sequences that
are reliable and from species that we are actually interested in
analyzing. For this, we will be using the sq.curate()
function. We need to provide a list of taxonomic names to filter out
incorrect sequences (filterTaxonomicCriteria
argument). For
simplicity, our criteria can be the genera that we’re interested in
analyzing. Note that the outgroup’s name should also be included in the
list. If the taxonomic information for a sequence retrieved from genbank
does not match with any of these strings, this species will be dropped.
You will have to specify whether sampling is for animals or plants
(kingdom
argument). Finally, you might have already noticed
that the same gene regions can have different names. For instance,
sometimes our searches retrieve both “cytochrome oxidase subunit 1” and
“cytochrome c oxidase subunit I” as widely sampled genes for the target
species. In that case, we can combine the sequences in these two files
into a single file name COI
. To merge gene files, you will
have to provide a named list to the mergeGeneFiles
argument
of the sq.curate
function. This named list
(tb.merged
below) will have a length that corresponds to
the number of final files that should be constructed.
tb.merged <- list('COI' = c("cytochrome oxidase subunit 1", "cytochrome c oxidase subunit I"))
sq.curate(filterTaxonomicCriteria = 'Felis|Vulpes|Phoca|Manis',
mergeGeneFiles = tb.merged,
kingdom = 'animals',
folder = '0.Sequences',
removeOutliers = FALSE)
Running the line of code above will create the a folder
1.CuratedSequences
containing (1) the curated sequences
with original names, (2) the curated sequences with species-level names
(renamed_*
prefix), (3) a table of accession numbers
(0.AccessionTable.csv
), and (4) a summary of the taxonomic
information for all the species sampled in the files
(1.Taxonomy.csv
). We’ll use the renamed_*
and
1.Taxonomy.csv
files in the next steps. Let’s take a look
at the sampling per gene region in the 0.AccessionTable.csv
table.
OriginalNames | AccN | Species | file | OldSpecies |
---|---|---|---|---|
KX786344 Felis catus | KX786344 | Felis_catus | 12S ribosomal RNA.fasta | Felis_catus |
KU963205 Felis chaus | KU963205 | Felis_chaus | 12S ribosomal RNA.fasta | Felis_chaus |
KX002032 Felis silvestris | KX002032 | Felis_silvestris | 12S ribosomal RNA.fasta | Felis_silvestris |
AY773084 Felis bieti | AY773084 | Felis_bieti | 12S ribosomal RNA.fasta | Felis_bieti |
LC424764 Vulpes vulpes | LC424764 | Vulpes_vulpes | 12S ribosomal RNA.fasta | Vulpes_vulpes |
KM224240 Vulpes lagopus | KM224240 | Vulpes_lagopus | 12S ribosomal RNA.fasta | Vulpes_lagopus |
GU174595 Phoca fasciata | GU174595 | Histriophoca_fasciata | 12S ribosomal RNA.fasta | Phoca_fasciata |
GU174591 Phoca largha | GU174591 | Phoca_largha | 12S ribosomal RNA.fasta | Phoca_largha |
AY012154 Manis pentadactyla | AY012154 | Manis_pentadactyla | 12S ribosomal RNA.fasta | Manis_pentadactyla |
OL654361 Felis silvestris | OL654361 | Felis_silvestris | cytochrome b.fasta | Felis_silvestris |
LC649705 Felis catus | LC649705 | Felis_catus | cytochrome b.fasta | Felis_catus |
MN370575 Felis chaus | MN370575 | Felis_chaus | cytochrome b.fasta | Felis_chaus |
MK606132 Felis margarita | MK606132 | Felis_margarita | cytochrome b.fasta | Felis_margarita |
AY773081 Felis bieti | AY773081 | Felis_bieti | cytochrome b.fasta | Felis_bieti |
MK244493 Vulpes vulpes | MK244493 | Vulpes_vulpes | cytochrome b.fasta | Vulpes_vulpes |
MT795179 Vulpes corsac | MT795179 | Vulpes_corsac | cytochrome b.fasta | Vulpes_corsac |
MH854561 Vulpes zerda | MH854561 | Vulpes_zerda | cytochrome b.fasta | Vulpes_zerda |
KU378587 Vulpes cana | KU378587 | Vulpes_cana | cytochrome b.fasta | Vulpes_cana |
KU378373 Vulpes rueppellii | KU378373 | Vulpes_rueppellii | cytochrome b.fasta | Vulpes_rueppellii |
KX093945 Vulpes lagopus | KX093945 | Vulpes_lagopus | cytochrome b.fasta | Vulpes_lagopus |
EU872065 Vulpes ferrilata | EU872065 | Vulpes_ferrilata | cytochrome b.fasta | Vulpes_ferrilata |
AF028157 Vulpes macrotis | AF028157 | Vulpes_macrotis | cytochrome b.fasta | Vulpes_macrotis |
KJ597964 Vulpes pallida | KJ597964 | Vulpes_pallida | cytochrome b.fasta | Vulpes_pallida |
LC466149 Phoca largha | LC466149 | Phoca_largha | cytochrome b.fasta | Phoca_largha |
GU174609 Phoca groenlandica | GU174609 | Pagophilus_groenlandicus | cytochrome b.fasta | Phoca_groenlandica |
GU167294 Phoca fasciata | GU167294 | Histriophoca_fasciata | cytochrome b.fasta | Phoca_fasciata |
L19127 Phoca vitulina | L19127 | Phoca_vitulina | cytochrome b.fasta | Phoca_vitulina |
MW197469 Manis pentadactyla | MW197469 | Manis_pentadactyla | cytochrome b.fasta | Manis_pentadactyla |
MN124254 Felis catus | MN124254 | Felis_catus | cytochrome oxidase subunit 1.fasta | Felis_catus |
KX012677 Felis nigripes | KX012677 | Felis_nigripes | cytochrome oxidase subunit 1.fasta | Felis_nigripes |
KF297765 Felis margarita | KF297765 | Felis_margarita | cytochrome oxidase subunit 1.fasta | Felis_margarita |
JF443560 Vulpes vulpes | JF443560 | Vulpes_vulpes | cytochrome oxidase subunit 1.fasta | Vulpes_vulpes |
KX012672 Vulpes chama | KX012672 | Vulpes_chama | cytochrome oxidase subunit 1.fasta | Vulpes_chama |
JF443554 Vulpes lagopus | JF443554 | Vulpes_lagopus | cytochrome oxidase subunit 1.fasta | Vulpes_lagopus |
JF443557 Vulpes velox | JF443557 | Vulpes_velox | cytochrome oxidase subunit 1.fasta | Vulpes_velox |
JF443364 Phoca vitulina | JF443364 | Phoca_vitulina | cytochrome oxidase subunit 1.fasta | Phoca_vitulina |
JF443363 Phoca largha | JF443363 | Phoca_largha | cytochrome oxidase subunit 1.fasta | Phoca_largha |
JF443362 Phoca groenlandica | JF443362 | Pagophilus_groenlandicus | cytochrome oxidase subunit 1.fasta | Phoca_groenlandica |
KT428152 Manis pentadactyla | KT428152 | Manis_pentadactyla | cytochrome oxidase subunit 1.fasta | Manis_pentadactyla |
DQ205829 Felis catus | DQ205829 | Felis_catus | growth hormone receptor.fasta | Felis_catus |
AY885401 Vulpes vulpes | AY885401 | Vulpes_vulpes | growth hormone receptor.fasta | Vulpes_vulpes |
AY885400 Vulpes macrotis | AY885400 | Vulpes_macrotis | growth hormone receptor.fasta | Vulpes_macrotis |
AY885399 Vulpes corsac | AY885399 | Vulpes_corsac | growth hormone receptor.fasta | Vulpes_corsac |
AY885393 Vulpes zerda | AY885393 | Vulpes_zerda | growth hormone receptor.fasta | Vulpes_zerda |
AY885379 Alopex lagopus | AY885379 | Vulpes_lagopus | growth hormone receptor.fasta | Alopex_lagopus |
DQ205838 Vulpes velox | DQ205838 | Vulpes_velox | growth hormone receptor.fasta | Vulpes_velox |
GU931127 Phoca vitulina | GU931127 | Phoca_vitulina | growth hormone receptor.fasta | Phoca_vitulina |
DQ205827 Phoca largha | DQ205827 | Phoca_largha | growth hormone receptor.fasta | Phoca_largha |
DQ205825 Phoca groenlandica | DQ205825 | Pagophilus_groenlandicus | growth hormone receptor.fasta | Phoca_groenlandica |
EU448992 Manis pentadactyla | EU448992 | Manis_pentadactyla | growth hormone receptor.fasta | Manis_pentadactyla |
OL654361 Felis silvestris | OL654361 | Felis_silvestris | NADH dehydrogenase subunit 5.fasta | Felis_silvestris |
MN313781 Felis catus | MN313781 | Felis_catus | NADH dehydrogenase subunit 5.fasta | Felis_catus |
MK606132 Felis margarita | MK606132 | Felis_margarita | NADH dehydrogenase subunit 5.fasta | Felis_margarita |
GU561700 Felis chaus | GU561700 | Felis_chaus | NADH dehydrogenase subunit 5.fasta | Felis_chaus |
AH014073 Vulpes lagopus | AH014073 | Vulpes_lagopus | NADH dehydrogenase subunit 5.fasta | Vulpes_lagopus |
AY377376 Phoca groenlandica | AY377376 | Pagophilus_groenlandicus | NADH dehydrogenase subunit 5.fasta | Phoca_groenlandica |
GU167331 Phoca fasciata | GU167331 | Histriophoca_fasciata | NADH dehydrogenase subunit 5.fasta | Phoca_fasciata |
MF353436 Felis silvestris | MF353436 | Felis_silvestris | tRNA-Pro.fasta | Felis_silvestris |
AF348642 Felis catus | AF348642 | Felis_catus | tRNA-Pro.fasta | Felis_catus |
MT955893 Vulpes vulpes | MT955893 | Vulpes_vulpes | tRNA-Pro.fasta | Vulpes_vulpes |
MT955814 Vulpes rueppellii | MT955814 | Vulpes_rueppellii | tRNA-Pro.fasta | Vulpes_rueppellii |
KX093931 Vulpes lagopus | KX093931 | Vulpes_lagopus | tRNA-Pro.fasta | Vulpes_lagopus |
KJ846673 Vulpes macrotis | KJ846673 | Vulpes_macrotis | tRNA-Pro.fasta | Vulpes_macrotis |
KJ846672 Vulpes zerda | KJ846672 | Vulpes_zerda | tRNA-Pro.fasta | Vulpes_zerda |
JF520840 Vulpes ferrilata | JF520840 | Vulpes_ferrilata | tRNA-Pro.fasta | Vulpes_ferrilata |
OM967017 Phoca largha | OM967017 | Phoca_largha | tRNA-Pro.fasta | Phoca_largha |
HQ702987 Phoca vitulina | HQ702987 | Phoca_vitulina | tRNA-Pro.fasta | Phoca_vitulina |
GQ232081 Manis pentadactyla | GQ232081 | Manis_pentadactyla | tRNA-Pro.fasta | Manis_pentadactyla |
We’ll now align the sequences that we just curated. For this, we just
use sq.aln()
with default parameters. We need to indicate
that we’re interested in aligning only the "renamed"
fasta
files in our 1.CuratedSequences
folder.
sq.aln(folder = '1.CuratedSequences', FilePatterns = "renamed")
The resulting multiple sequence alignments will be saved to the
2.Alignments
folder. In that new folder, we will have two
types of files: (1) raw alignments (same file names as in
1.CuratedSequences
) and (2) alignments with ambiguous sites
removed (Masked_*
prefix). Masked alignments are only
created if the mask
argument in sq.aln
is set
to TRUE
. In that case, one additional .csv
file is created for each of the alignments
(0.Masked.Information_*
). Each of these datasets list the
number of sites in the masked alignment that (1) are not gaps
(NonGaps
column), (2) if the sequence was removed due to
the elevated number of gaps (removedPerGaps
; controlled
using the threshold
argument in sq.aln
), or
(3) if it was removed directly in the masking step
(removedMasking
). Note that, for some gene regions, making
can fail. In that case, only the original alignment file is saved to the
2.Alignments
folder.
Note that we could use these resulting alignments directly to infer
our phylogenies. We cover these steps within phruta
in
another vignette: “Phylogenetics with the phruta
R
package”. For now, let’s wrap up and plot one of our (cool) alignments.
Let’s first check the raw alignments!
Now, the masked alignments…!!
And we’re done for now!! Thanks for following this tutorial…:)
In total, this vignette took 11 minutes to render in my local
machine. You can now try to run phruta
using your favorite
groups organisms! Don’t forget to check the other tutorials and get in
touch if you find any issues…Buena suerte!