The first step to running a phylogenetic analysis is the identification of overlapping sequences. Often orthology is determined by pairing sequences whose gene names match (e.g. COI sequences with COI sequences, rbcl sequences with rbcl sequences). Problems can arise however if gene names differ between authors, if different gene sections are represented or if sequences are mislabelled. These issues can be especially problematic for large-scale analyses where individual errors cannot be detected.
PhyLoTa is a pipeline that uses an alignment search tool to identify orthologous sequences without the need for gene name matching. For a given parental taxonomic group, the pipeline will search through available sequences hosted on GenBank and identify orthologous sequence clusters. A user is then able to survey the identified clusters and select the ones which best suit their phylogenetic analysis needs, e.g. by selecting the clusters that maximise the number of taxonomic groups.
This R pacakge,
phylotaR, is an R implementation of this pipeline. In this vignette we will demonstrate how to run PhyLoTa using a small taxonomic group. The pipeline is composed of four automated stages (taxise, download, cluster, cluster2) and a final user-performed stage of cluster selection.
The PhyLoTa pipeline uses BLAST to identify orthologous sequence clusters. In order to run phylotaR, a local copy of the BLAST software must be installed on your computer. Installing the phylotaR package does not install BLAST, it must be installed separately. To install BLAST+, please see the NCBI website’s installation instructions.
For demonstration purposes we will run the pipeline on a small taxonomic group. Because they are charismatic and relatively well-studied, we will select the Night Monkey genus, Aotus. Now that we have decided on a taxonomic group we need to find out its unique taxonomic ID. This can be looked up by navigating to the NCBI taxonomy webpage and searching ‘Aotus’. Doing this, we can see that Aotus ID is 9504. We will need this number for specifying the parameters in our pipeline. (Notice, that there is also a plant genus called Aotus.)
To begin a run, we will need to create a new folder that will contain all the output files generated by the
phylotaR pipeline. Since we are running the analysis on the Aotus genus, let’s call the folder
aotus/. Now we have our working directory folder created, we can now open R and run the following code.
The above imports the
phylotaR package and initiates a cache that will contain the pipeline parameters. For this tutorial we will keep the parameters as their default. See the function
parameters() for a complete list and description of all the parameters and their default values. For more detailed information on the parameters please see the publication, phylotaR: An Automated Pipeline for Retrieving Orthologous DNA Sequences from GenBank in R.
wd must be a file path to the folder we called
ncbi_dr must be a file path to the folder containing all the NCBI BLAST+ tools – see above ‘Installing NCBI BLAST+ Tools’. Depending on your system and how you installed the tools, they may be in your system path in which case you can simply supply ‘.’ to the
ncbi_dr argument. On my computer I provide the path to the where the
blastn executable is located, e.g.
setup() will verify whether the BLAST tools are installed correctly.
setup() has been run we can run the pipeline with the following command.
run(wd = wd)
This will run all the automated stages of the pipeline: taxise, download, cluster and cluster2. The first of these stages looks up all the taxonomic information available on the descendants of the parent ID provided,
txid. The second downloads representative sequences for all identified descendants. No additional arguments are required other than
wd which specifies the working directory that contains the cache and all parameters as set up by
setup(). In this folder you will also find a
log.txt that reports detailed information on the progression of the pipeline as well as all the output files generated by each stage. Additionally, you will see session info and a blast version text files. These files, along with the log, can help debugging if any errors occur. The whole pipeline can complete in around 2 minutes for Aotus using default parameters. Aotus, however, is a genus of only 13 taxa, larger clades will take much longer particularly during the download stage.
The pipeline can be halted and restarted. The cache records all downloaded and generated data by the pipeline. If there is a system crash or the user wishes to halt the program, the pipeline can be restarted from the same point it stopped with the function
restart(). Additionally, due to the potential random nature of the pipeline, a user may wish to re-run the pipeline from certain stages. This can be achived by first using
reset() followed by
restart(). For example, in the code below a completed pipeline is reset to ‘cluster’ and then restarted. After running these commands, the pipeline will run as if it has only just completed the download stage. Note, all resets and restarts are recorded in the log.
Paramaters can always be set by a user at the initiation of a folder with the
setup() function. To change the parameter values after a folder has already been set up, a user can use
parameters_reset(). For example, if the download stage is taking particularly long, the
btchsz could be increased. This would raise the number of sequences downloaded per request. (Note, too high a
btchsz may cause your NCBI Entrez access being limited.)
After a pipeline has completed, the identified clusters can be interrogated. We can generate a phylota object using
read_phylota() but in the code below we will load a pre-existing phylota object from the package data. The phylota object contains cluster, sequence and taxonomic information on all the clusters. It has 6 data slots: cids, sids, txids, txdct, sqs, clstrs, prnt_id and prnt_nm. Each of these slots can be accessed with
@, see ?`Phylota-class` for more information. The
phylotaR package has a range of functions for probing clusters in a phylota object. For example, if we want to know how many different taxonomic groups are represented by each cluster we can use
##  "Phylota Table (Aotus)\n-  clusters\n-  sequences\n-  source taxa\n"
# otherwise, run: # all_clusters <- read_phylota(wd) cids <- all_clusters@cids n_taxa <- get_ntaxa(phylota = all_clusters, cid = cids)
We can then drop all the clusters with fewer than 6 taxa and create a new phylota object using the
drop_cls() function. Let’s then take a peak of the now smaller object’s clusters using
keep <- cids[n_taxa > 6] selected <- drop_clstrs(phylota = all_clusters, cid = keep) smmry <- summary(selected) print(smmry)
## ID Type Seed Parent N_taxa N_seqs Med_sql MAD ## 1 4 subtree U36844.1 9504 9 41 549 0.8397112 ## 2 9 subtree AJ489745.1 9504 10 29 1140 1.0000000 ## 3 29 subtree AF338374.1/177..829 9504 8 10 653 0.9658499 ## Definition Feature ## 1 gene (0.1), mitochondrial (0.1) coii (1) ## 2 aotus (0.1), cytochrome (0.1) cytb (0.4), cytochrome (0.3) ## 3 sry (0.2), aotus (0.1) sry (1)
This summary provides information on each cluster in the phylota object, such as median sequence length, MAD score (the mean alignment density, values closer to 1 indicate all the sequences are of a similar length), most common words in the sequence descriptions and feature names. Let’s select the second ID in this table for further investigation. We can extract its cluster and sequences records in the following way.
cid <- smmry[2, 'ID'] # get the cluster record cluster_record <- selected@clstrs[[cid]] # use the seq. IDs to get the sequence records seq_records <- selected@sqs[cluster_record@sids] # extract a single record seq_record <- seq_records[[seq_records@ids[]]] summary(seq_record)
## SeqRec [ID: AJ489745.1]
##  "atgacttctccccgcaaaacacacccactaacaaagatcattaacgaatcattcattgatctacccacaccacccaacat"
We could extract and write out each of the sequences for a cluster in the above manner. Handily, however,
phylotaR comes with some tools to make outputting sequences easier. First because there are multiple sequences per taxon, we need to select a single representative sequence. We can do this with the
drop_by_rank() function. With this function we choose a taxonomic rank at which we would like our sequences to be represented. The function then chooses the ‘best’ sequence representing each taxon for that rank using a range of criteria. With this new phylota object, we can then extract the scientific names and write out the sequences.
# choose best sequence per species reduced <- drop_by_rank(phylota = selected, rnk = 'species', n = 1) # get txids at the species level for each sequence txids <- get_txids(phylota = reduced, cid = cid, rnk = 'species') # look up name for txids scientific_names <- get_tx_slot(phylota = reduced, txid = txids, slt_nm = 'scnm') # clean the names scientific_names <- gsub('\\.', '', scientific_names) scientific_names <- gsub('\\s+', '_', scientific_names) print(scientific_names)
## AJ489745.1 DQ098865.1 DQ098869.1 ## "Aotus_nancymaae" "Aotus_azarai" "Aotus_griseimembra" ## DQ098873.1 HQ005497.1 HQ005502.1 ## "Aotus_trivirgatus" "Aotus_nigriceps" "Aotus_vociferans" ## HQ005506.1 KR528418.1/1..1140 ## "Aotus_lemurinus" "Aotus_sp"
We can sanity check our cluster sequences by running a very quick phylogenetic analysis using mafft and raxml. The below code will use the cluster to generate an alignment and a tree through R. In order for the code to run, it requires the installation of mafft and raxml and, additionally, may require tweaking to work on your system.