What really goes into sausages?
Research Problem Background
General problem with any metagenomics study is that the sequenced sample consists of multiple, sometimes very numerous, complex organisms. Given that sequence reads are relatively short and organisms may be closely related, it becomes clear that accurate mapping is needed in order to determine which read originates from which organism.
Business Problem Background
Malicious or accidental crisis in food processing industry nowadays have cross-regional, national or international reach, and are certainly followed by significant risk to public health and safety along with substantial financial loss to consumers or businesses. Biosurveillance emerged as the first line of defense against threats of potential disease outbreak. But it comes with a cost. Enormous expert man power and computational resources are needed, since multiple DNA references need to be checked against all reads. This is often a very slow process since reference index must be created for each genome of interest, which is both computationally and financially expensive.
Here we demonstrate use of alternative approach, utilizing free Centrifuge toolkit, to map reads to all given references at the same time. Contrary to what one might believe, this process is very fast since this toolkit uses one single index for all references, which is very small (~4GB for all given organisms in this task).
Case
We developed a straightforward workflow for this case (see diagram below). We want to demonstrate that running a well structured underlying pipeline can easily be done in the cloud. We do this by building a web interface for successfully building and executing tools for this case.
Sequenced reads, provided in file Case2.fasta were mapped to hybrid index created by Centrifuge-build tool from six genomes obtained from NCBI. Additional files we used:
- names.dmp and nodes.dmp, obtained from this link.
- taxid2seqid.map, manually created for six genomes. Python jupyter notebook used for generating taxid2seqid.map is presented below link
Following command line was used for this step:
centrifuge-build -p <num_threads> --conversion-table <taxid2seqid.map> --taxonomy-tree <nodes.dmp> --name-table <names.dmp> <fasta.list> <output_base_name>
Once built, we used index to map reads from input fasta file to it. This is done in the matter of minutes. The tool produces classification file along with summary metrics. Command line for running centrifuge is below.
centrifuge --report-file <input_file-1.Centrifuge_report.tsv> -x $basename
-U </path/to/input_file-1.ext>,</path/to/input_file-2.ext> -S <input_file-1.Classification_result.txt>
Results
Classification results file shows that even though majority of reads has been uniquely mapped to specific reference, large number of them are multimapped. We split these results into two sets. One with uniquely mapped reads and other with ambiguous ones. Below are shown read counts from output of Centrifuge classifier.
To improve accuracy of results, we preformed BLAST search of all reads against reference genomes. Ones from set2 were than cross-referenced with BLAST results and this information was used to confirm the correct origin of each read. Finally, we used BLAST report information to further reduce the number of unclassified (unassigned) reads. We observed a more than 3 fold reduction of unassigned reads, thus showing great improvement in analysis sensitivity. The finale summary metrics are shown below.
Python code used for this analysis is available here:
We also provide full bash script for running this workflow.
A Hidden Gem
In today’s increasing computational requirements of bioinformatics tools, we recognize the need for fast and scalable way of executing tasks on large computing instances. Therefore, we made little extra effort to build a simple but scalable web platform that interacts with computing instances required for running tools for this case. The website is available at https://datathon2018.herokuapp.com/. You can find all files and tools we used during Datathon. Please note that due to our limited budget, we have restricted use of storage and resources. However, we have enabled usage of computing resources for the final analysis, which is used for parsing and evaluating data obtained from Centrifuge and BLAST (see section above).
If you have problem with the link above, try deleting “https://” from the address.
Data Section
Analysis Section
Results Section
Links
[2] Centrifuge Paper
[3] Blast Homepage
[4] Blast Paper
6 thoughts on “DNA Case team Backstabbers”
Excellent presentation, I like the diagrams and barplots. Shows good command of visualization tools – such an important aspect of data analysis. Results are plausible. I tried the website, but could not figure out the credentials and it was too late to ask.
Hi, thanks!
Credentials are username: demo; pass: demo
Very good article. Clear, explanatory, high level. The model diagram helps a lot to the understanding; the online demo and the demonstrative video are very impressive and a great addition. The model itself looks very promising. Great work, guys!
Thanks Liad!
Great job guys! Besides covering the task requirements, you took time to add some extra spice.
I especially enjoy the URL: sausage-dna.science! 😀
Thanks man! 🙂