Business Understanding
The food industry is governed by strict laws and regulations, which provide certainty that each product meets health and safety standards. To that end, food inspection is a necessary step that should have two objectives – show that certain product contains its main ingredients in the reported ratios and show that the sample is not contaminated with pathogens, pests and harmful bacteria. Consumers should be aware of each ingredient present in the product. On the other hand, if contaminants are present in the product, it must be properly disposed of. We recommend using a metagenomic approach to identifying all organisms present in the food product, in this case a sausage. Metagenomic analysis is an approach where reads are mapped to multiple genomes of organisms suspected to be present in the sample. The result of the analysis is a ratio of each organism present in the product.
Data Understanding
Next generation sequencing (NGS) of DNA material present in the product is stored in FASTA (or FASTQ) files. It consists of short reads (sequences of DNA nucleotides, A – adenine, C – cytosine, G – guanine, T – thymine). Length of those reads varies between ~30-200 base pairs (bp), and in the case at hand it is 51bp. A total of 1001 reads are present in the supplied file (case2.fasta). Number of unique reads is exactly 1000, due to the fact that one read is duplicated. This dataset is a subsample of an NCBI SRR1745839 dataset, which consists of 181422 reads.
Sampled reads are mapped to a number of reference genomes. As part of this study, six references are suggested:
- Sheep
- Cow
- Pig
- Cockroach
- Soybean
- E. coli
We based our analysis on these references, as well as on the following reference samples:
Other meat products
Horse, Turkey, Chicken
Possible animal contaminants
Human, Mouse, Rat
Invertebrates, plants, fungi, bacteria and viruses
Yeast, Corn, Rice, Potato, Viruses, Taenia solium, Taenia saginata, Echinococcus granulosus, Echinococcus multilocularis, Entamoeba histolytica, Trichinella spiralis, Ascaris suum, Fasciola hepatica, Cryptosporidium parvum, Aspergillus flavus, Aspergillus parasiticus, Staphylococcus aureus, Clostridium botulinum, Clostridium perfringens, Yersinia pestis, Salmonella enterica, Salmonella bongori, Shigella flexneri, Shigella boydii, Shigella sonnei, Shigella dysenteriae, Listeria monocytogenes, Naegleria fowleri, Naegleria gruberi
Data Preparation
Once the data is understood, it is necessary to perform various preparation steps. In case of the sample file, we restricted the analysis to the 1000 unique reads. In order to use genome references as basis for alignment, most software aligners require an index of the reference. In retrospect, building indices was the most time consuming part of the project.
Modeling
Our team devised three approaches to assessing the amount of each organism present in the sample.
BLAST
BLAST [2] toolkit finds regions of similarity between biological sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance.
Toolkit version used here is 2.2.31+. Blastn was used for alignment on custom made database generated by Makeblastdb. Reference used for database building contains among others recommended sequences (Sheep, Cow, Pig, Cockroach, Soybean, E.coli).
Perc_identity parameter was used so that percentage of identical matches during alignment is minimum 90. Numbers given in the evaluation section are obtained by using only the best alignment for each read.
Based on the results obtained by BLAST, the best represented organisms are Ovis aries, Bos taurus, Sus scrofa.
Centrifuge
Using Centrifuge toolkit, one of the most efficient classification tools used for metagenomic WGS analysis. Centrifuge is a novel microbial classification engine that enables rapid, accurate, and sensitive labeling of reads and quantification of species present in metagenomic samples [1]. The system uses a novel indexing scheme based on the Burrows-Wheeler transform (BWT) and the Ferragina-Manzini (FM) index, optimized specifically for the metagenomic classification problem. The reference index is also built using Centrifuge, and it contains all the sequences recommended by this particular case (Sheep, Cow, Pig, Cockroach, Soybean, E. coli), but it also contains genomes of some species that are not traditionally found in sausages.
BWA MEM
BWA MEM [3] is one of the industry standards for alignment and it posed as a natural choice. We ran it with default parameters, and ended up with ~55% of unmapped reads. Tweaking the parameters only reduced the unmapped read percentage to around 53%. High percentage of unmapped reads made us also to try the Bowtie2 aligner. With Bowtie2, number of mapped reads was even lower, but as an advantage in this approach, number of uniquely mapped reads was higher. Decision was made to go back and use the BWA MEM with default parameters for further analysis. Using BWA MEM the reads were mapped to each reference separately, and the resulting BAM files were further analyzed in a custom Jupyter Notebook, using Python 3 with some standard libraries (numpy, pandas, pysam). The results were somewhat satisfactory, but the main issues with BWA MEM approach were long index building times and high percentage of unmapped reads. While building the indices takes a significant amount of time (~1-3 hours), it is a one time task and it can be reused.
Tools used in BWA MEM pipeline include BWA INDEX, BWA MEM (BWA 0.7.13), Sambamba View, Sambamba Sort (Sambamba 0.6.0).
Most represented organism in BWA MEM analysis was sheep (GCF_000298735.2_Oar_v4.0_genomic.fna.gz), with 276 total reads mapping to its genome, with 184 being unique.
datathon2018_case_telelink_team_sevenbridges_bwa
Evaluation
In order to provide reproducibility of the data processing flow we codes most of the analysis as Common Workflow Language JSON files. For the analysis we used the Rabix Suite. The editor used was open source Rabix Composer. The files can be executed locally at any machine using the open source Rabix Executor.
In this step we compared and merged our results.
Centrifuge | BLAST | BWA | ||||
Genome | Unique | Total | Unique | Total | Unique | Total |
Cow | 29 | 713 | 446 | 446 | 112 | 203 |
Pig | 67 | 81 | 74 | 74 | 32 | 36 |
Horse | 7 | 9 | 12 | 12 | 7 | 10 |
Turkey | 0 | 2 | ||||
Chicken | 0 | 1 | 0 | 2 | ||
Sheep | 450 | 604 | 428 | 428 | 184 | 276 |
Human | 0 | 2 | 2 | 4 | ||
Cockroach | ||||||
E.coli | 0 | 1 | ||||
Mouse | 0 | 1 | 1 | 1 | ||
Rat | 1 | 3 | 8 | 9 | ||
Taenia solium | 4 | 4 | ||||
Shigella flexneri | 0 | 1 | ||||
Shigella boydii | 1 | 1 | 0 | 1 | ||
Shigella sonnei | 0 | 1 | ||||
Shigella dysenteriae | 0 | 1 | ||||
Yeast | ||||||
Soybean | ||||||
Corn | ||||||
Rice | ||||||
Potato | ||||||
Viral | ||||||
Taenia saginata | ||||||
Echinococcus granulosus | ||||||
Echinococcus multilocularis | ||||||
Entamoeba histolytica | ||||||
Trichinella spiralis | ||||||
Ascaris suum | ||||||
Fasciola hepatica | ||||||
Cryptosporidium parvum | ||||||
Aspergillus flavus | ||||||
Aspergillus parasiticus | ||||||
Staphylococcus aureus | ||||||
Clostridium botulinum | ||||||
Clostridium perfringens | ||||||
Yersinia pestis | ||||||
Salmonella enterica | ||||||
Salmonella bongori | ||||||
Listeria monocytogenes | ||||||
Naegleria fowleri | ||||||
Naegleria gruberi | ||||||
Cryptosporidium Hominis | ||||||
Sum | 554 | 1414 | 962 | 962 | 349 | 551 |
Deployment
Deploying one or a combination of these analyses would require an NGS laboratory, capable of delivering consistent and timely results. Cost of the laboratory procedure and cluster/cloud computing resources would have to be taken into account. Currently available methods of discovering unwanted organic material in the food products would have to be used as a gold standard. The main advantage of metagenomic analysis is the ability to discover organic material while performing a single NGS procedure, and then do the analysis in silico – where the cost and time increases at most linearly when adding more references of different organisms.
References
[1] Centrifuge (https://ccb.jhu.edu/software/centrifuge/index.shtml)
[2] BLAST Command-line Interface (https://www.ncbi.nlm.nih.gov/books/NBK279690/)
[3] BWA MEM (https://github.com/lh3/bwa)
[4] All-Food-Seq (AFS): a quantifiable screen for species in biological samples by deep DNA sequencing (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4131036/)
9 thoughts on “Team Seven Bridges – Telelink Case: What Really Goes into Sausages?”
Great article! and I can see you’re still altering and adding sections to it. It seems you have dedicated a lot of attention to the tools used to create the model. It would be great to also write about problems you encountered and how you have solved them, as well as open questions/problems that still should be solved.
Looking great!!!
Thanks! We were in a time scramble at the end. This challenge is very interesting, and we think more time should be dedicated to shaping the final solution.
I really liked this article, because they went beyond the given setting and actually suggest that there are potentially other organisms in there, too. Interested whether the Taenia and Rat are true hits and if these few fragments are enough evidence for potential contamination. Great research skills!
Great Viz! 🙂
Congratulations! You have done an amazing job! The results are excellent! Also, you revealed the secret in the 1001th read. I think you deserve the victory.
Questions during the presentation:
1. @drceenish
Can this approach become a real product in the world which one day we can walk with in the supermarket? Your thoughts?
2. @liad
– How would you tackle gene mutations or unknown samples which were not in the gene bank? or (even worse) if the sample in the bank are not accurate
3. @kreez
What time did it take to analyse the whole sample? How did you find out the 1001th read, which was added manually?
For each index it took ~1-2hours, while alignment of the sample data to each reference genome took <30min.
In the attached BWA jupyter notebook, in #data understanding cell you can see the following list comprehension:
weird = [elem for elem in seqs if elem not in seqb]
it lists all the elements in the "small" sample (seqs) which are not present in the "big" sample (seqb).