BioinformaticsLearnTeam solutions

Team Seven Bridges – Telelink Case: What Really Goes into Sausages?

The food industry is governed by strict laws and regulations, which provide certainty that each product meets health and safety standards. In addition to existing biochemical food product analysis, we propose a metagenomic approach. Main benefit of this approach is the ability to perform next generation sequencing as a standard first step and then align the sampled data to genomes references of many organisms suspected to be present in the sample. Additionaly, if another organism is suspected at a later date, it is easy to reause the sampled data set to perform another analysis – in the biochemical analysis this would require expensive sample storage and performing more laboratory tests. We examined three approaches to metagenomic analysis – BLAST, Centrifuge and BWA MEM.


Business Understanding

The food industry is governed by strict laws and regulations, which provide certainty that each product meets health and safety standards. To that end, food inspection is a necessary step that should have two objectives – show that certain product contains its main ingredients in the reported ratios and show that the sample is not contaminated with pathogens, pests and harmful bacteria. Consumers should be aware of each ingredient present in the product. On the other hand, if contaminants are present in the product, it must be properly disposed of. We recommend using a metagenomic approach to identifying all organisms present in the food product, in this case a sausage. Metagenomic analysis is an approach where reads are mapped to multiple genomes of organisms suspected to be present in the sample. The result of the analysis is a ratio of each organism present in the product.

Data Understanding

Next generation sequencing (NGS) of DNA material present in the product is stored in FASTA (or FASTQ) files. It consists of short reads (sequences of DNA nucleotides, A – adenine, C – cytosine, G – guanine, T – thymine). Length of those reads varies between ~30-200 base pairs (bp), and in the case at hand it is 51bp. A total of 1001 reads are present in the supplied file (case2.fasta). Number of unique reads is exactly 1000, due to the fact that one read is duplicated. This dataset is a subsample of an NCBI SRR1745839 dataset, which consists of 181422 reads.

Sampled reads are mapped to a number of reference genomes. As part of this study, six references are suggested:

  1. Sheep
  2. Cow
  3. Pig
  4. Cockroach
  5. Soybean
  6. E. coli

We based our analysis on these references, as well as on the following reference samples:

Other meat products

Horse, Turkey, Chicken

Possible animal contaminants

Human, Mouse, Rat

Invertebrates, plants, fungi, bacteria and viruses

Yeast, Corn, Rice, Potato, Viruses, Taenia solium, Taenia saginata, Echinococcus granulosus, Echinococcus multilocularis, Entamoeba histolytica, Trichinella spiralis, Ascaris suum, Fasciola hepatica, Cryptosporidium parvum, Aspergillus flavus, Aspergillus parasiticus, Staphylococcus aureus, Clostridium botulinum, Clostridium perfringens, Yersinia pestis, Salmonella enterica, Salmonella bongori, Shigella flexneri, Shigella boydii, Shigella sonnei, Shigella dysenteriae, Listeria monocytogenes, Naegleria fowleri, Naegleria gruberi

Data Preparation

Once the data is understood, it is necessary to perform various preparation steps. In case of the sample file, we restricted the analysis to the 1000 unique reads. In order to use genome references as basis for alignment, most software aligners require an index of the reference. In retrospect, building indices was the most time consuming part of the project.


Our team devised three approaches to assessing the amount of each organism present in the sample.


BLAST [2] toolkit finds regions of similarity between biological sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance.

Toolkit version used here is 2.2.31+. Blastn was used for alignment on custom made database generated by Makeblastdb. Reference used for database building contains among others recommended sequences (Sheep, Cow, Pig, Cockroach, Soybean, E.coli).

Perc_identity parameter was used so that percentage of identical matches during alignment is minimum 90. Numbers given in the evaluation section are obtained by using only the best alignment for each read.

Based on the results obtained by BLAST, the best represented organisms are Ovis aries, Bos taurus, Sus scrofa.


Using Centrifuge toolkit, one of the most efficient classification tools used for metagenomic WGS analysis. Centrifuge is a novel microbial classification engine that enables rapid, accurate, and sensitive labeling of reads and quantification of species present in metagenomic samples [1]. The system uses a novel indexing scheme based on the Burrows-Wheeler transform (BWT) and the Ferragina-Manzini (FM) index, optimized specifically for the metagenomic classification problem. The reference index is also built using Centrifuge, and it contains all the sequences recommended by this particular case (Sheep, Cow, Pig, Cockroach, Soybean, E. coli), but it also contains genomes of some species that are not traditionally found in sausages.


BWA MEM [3] is one of the industry standards for alignment and it posed as a natural choice. We ran it with default parameters, and ended up with ~55% of unmapped reads. Tweaking the parameters only reduced the unmapped read percentage to around 53%. High percentage of unmapped reads made us also to try the Bowtie2 aligner. With Bowtie2, number of mapped reads was even lower, but as an advantage in this approach, number of uniquely mapped reads was higher. Decision was made to go back and use the BWA MEM with default parameters for further analysis. Using BWA MEM the reads were mapped to each reference separately, and the resulting BAM files were further analyzed in a custom Jupyter Notebook, using Python 3 with some standard libraries (numpy, pandas, pysam). The results were somewhat satisfactory, but the main issues with BWA MEM approach were long index building times and high percentage of unmapped reads. While building the indices takes a significant amount of time (~1-3 hours), it is a one time task and it can be reused.

Tools used in BWA MEM pipeline include BWA INDEX, BWA MEM (BWA 0.7.13), Sambamba View, Sambamba Sort (Sambamba 0.6.0).

Most represented organism in BWA MEM analysis was sheep (GCF_000298735.2_Oar_v4.0_genomic.fna.gz), with 276 total reads mapping to its genome, with 184 being unique.



In order to provide reproducibility of the data processing flow we codes most of the analysis as Common Workflow Language JSON files. For the analysis we used the Rabix Suite. The editor used was open source Rabix Composer. The files can be executed locally at any machine using the open source Rabix Executor.

In this step we compared and merged our results.

Centrifuge BLAST BWA
Genome Unique Total Unique Total Unique Total
Cow 29 713 446 446 112 203
Pig 67 81 74 74 32 36
Horse 7 9 12 12 7 10
Turkey 0 2
Chicken 0 1 0 2
Sheep 450 604 428 428 184 276
Human 0 2 2 4
E.coli 0 1
Mouse 0 1 1 1
Rat 1 3 8 9
Taenia solium 4 4
Shigella flexneri 0 1
Shigella boydii 1 1 0 1
Shigella sonnei 0 1
Shigella dysenteriae 0 1
Taenia saginata
Echinococcus granulosus
Echinococcus multilocularis
Entamoeba histolytica
Trichinella spiralis
Ascaris suum
Fasciola hepatica
Cryptosporidium parvum
Aspergillus flavus
Aspergillus parasiticus
Staphylococcus aureus
Clostridium botulinum
Clostridium perfringens
Yersinia pestis
Salmonella enterica
Salmonella bongori
Listeria monocytogenes
Naegleria fowleri
Naegleria gruberi
Cryptosporidium Hominis
Sum 554 1414 962 962 349 551



Deploying one or a combination of these analyses would require an NGS laboratory, capable of delivering consistent and timely results. Cost of the laboratory procedure and cluster/cloud computing resources would have to be taken into account. Currently available methods of discovering unwanted organic material in the food products would have to be used as a gold standard. The main advantage of metagenomic analysis is the ability to discover organic material while performing a single NGS procedure, and then do the analysis in silico – where the cost and time increases at most linearly when adding more references of different organisms.


[1] Centrifuge (

[2] BLAST Command-line Interface (

[3] BWA MEM (

[4] All-Food-Seq (AFS): a quantifiable screen for species in biological samples by deep DNA sequencing (

Share this

9 thoughts on “Team Seven Bridges – Telelink Case: What Really Goes into Sausages?

  1. 2

    Great article! and I can see you’re still altering and adding sections to it. It seems you have dedicated a lot of attention to the tools used to create the model. It would be great to also write about problems you encountered and how you have solved them, as well as open questions/problems that still should be solved.

    1. 0

      Thanks! We were in a time scramble at the end. This challenge is very interesting, and we think more time should be dedicated to shaping the final solution.

  2. 3

    I really liked this article, because they went beyond the given setting and actually suggest that there are potentially other organisms in there, too. Interested whether the Taenia and Rat are true hits and if these few fragments are enough evidence for potential contamination. Great research skills!

  3. 1

    Congratulations! You have done an amazing job! The results are excellent! Also, you revealed the secret in the 1001th read. I think you deserve the victory.

  4. 0

    Questions during the presentation:
    1. @drceenish
    Can this approach become a real product in the world which one day we can walk with in the supermarket? Your thoughts?
    2. @liad
    – How would you tackle gene mutations or unknown samples which were not in the gene bank? or (even worse) if the sample in the bank are not accurate

    1. 0

      For each index it took ~1-2hours, while alignment of the sample data to each reference genome took <30min.

      In the attached BWA jupyter notebook, in #data understanding cell you can see the following list comprehension:

      weird = [elem for elem in seqs if elem not in seqb]

      it lists all the elements in the "small" sample (seqs) which are not present in the "big" sample (seqb).

Leave a Reply