The food industry is governed by strict laws and regulations, which provide certainty that each product meets health and safety standards. To that end, food inspection is a necessary step that should have two objectives – show that certain product contains its main ingredients in the reported ratios and show that the sample is not contaminated with pathogens, pests and harmful bacteria. Consumers should be aware of each ingredient present in the product. On the other hand, if contaminants are present in the product, it must be properly disposed of. We recommend using a metagenomic approach to identifying all organisms present in the food product, in this case a sausage. Metagenomic analysis is an approach where reads are mapped to multiple genomes of organisms suspected to be present in the sample. The result of the analysis is a ratio of each organism present in the product.
Next generation sequencing (NGS) of DNA material present in the product is stored in FASTA (or FASTQ) files. It consists of short reads (sequences of DNA nucleotides, A – adenine, C – cytosine, G – guanine, T – thymine). Length of those reads varies between ~30-200 base pairs (bp), and in the case at hand it is 51bp. A total of 1001 reads are present in the supplied file (case2.fasta). Number of unique reads is exactly 1000, due to the fact that one read is duplicated. This dataset is a subsample of an NCBI SRR1745839 dataset, which consists of 181422 reads.
Sampled reads are mapped to a number of reference genomes. As part of this study, six references are suggested:
- E. coli
We based our analysis on these references, as well as on the following reference samples:
Other meat products
Horse, Turkey, Chicken
Possible animal contaminants
Human, Mouse, Rat
Invertebrates, plants, fungi, bacteria and viruses
Yeast, Corn, Rice, Potato, Viruses, Taenia solium, Taenia saginata, Echinococcus granulosus, Echinococcus multilocularis, Entamoeba histolytica, Trichinella spiralis, Ascaris suum, Fasciola hepatica, Cryptosporidium parvum, Aspergillus flavus, Aspergillus parasiticus, Staphylococcus aureus, Clostridium botulinum, Clostridium perfringens, Yersinia pestis, Salmonella enterica, Salmonella bongori, Shigella flexneri, Shigella boydii, Shigella sonnei, Shigella dysenteriae, Listeria monocytogenes, Naegleria fowleri, Naegleria gruberi
Once the data is understood, it is necessary to perform various preparation steps. In case of the sample file, we restricted the analysis to the 1000 unique reads. In order to use genome references as basis for alignment, most software aligners require an index of the reference. In retrospect, building indices was the most time consuming part of the project.
Our team devised three approaches to assessing the amount of each organism present in the sample.
BLAST  toolkit finds regions of similarity between biological sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance.
Toolkit version used here is 2.2.31+. Blastn was used for alignment on custom made database generated by Makeblastdb. Reference used for database building contains among others recommended sequences (Sheep, Cow, Pig, Cockroach, Soybean, E.coli).
Perc_identity parameter was used so that percentage of identical matches during alignment is minimum 90. Numbers given in the evaluation section are obtained by using only the best alignment for each read.
Based on the results obtained by BLAST, the best represented organisms are Ovis aries, Bos taurus, Sus scrofa.
Using Centrifuge toolkit, one of the most efficient classification tools used for metagenomic WGS analysis. Centrifuge is a novel microbial classification engine that enables rapid, accurate, and sensitive labeling of reads and quantification of species present in metagenomic samples . The system uses a novel indexing scheme based on the Burrows-Wheeler transform (BWT) and the Ferragina-Manzini (FM) index, optimized specifically for the metagenomic classification problem. The reference index is also built using Centrifuge, and it contains all the sequences recommended by this particular case (Sheep, Cow, Pig, Cockroach, Soybean, E. coli), but it also contains genomes of some species that are not traditionally found in sausages.
BWA MEM  is one of the industry standards for alignment and it posed as a natural choice. We ran it with default parameters, and ended up with ~55% of unmapped reads. Tweaking the parameters only reduced the unmapped read percentage to around 53%. High percentage of unmapped reads made us also to try the Bowtie2 aligner. With Bowtie2, number of mapped reads was even lower, but as an advantage in this approach, number of uniquely mapped reads was higher. Decision was made to go back and use the BWA MEM with default parameters for further analysis. Using BWA MEM the reads were mapped to each reference separately, and the resulting BAM files were further analyzed in a custom Jupyter Notebook, using Python 3 with some standard libraries (numpy, pandas, pysam). The results were somewhat satisfactory, but the main issues with BWA MEM approach were long index building times and high percentage of unmapped reads. While building the indices takes a significant amount of time (~1-3 hours), it is a one time task and it can be reused.
Tools used in BWA MEM pipeline include BWA INDEX, BWA MEM (BWA 0.7.13), Sambamba View, Sambamba Sort (Sambamba 0.6.0).
Most represented organism in BWA MEM analysis was sheep (GCF_000298735.2_Oar_v4.0_genomic.fna.gz), with 276 total reads mapping to its genome, with 184 being unique.
In order to provide reproducibility of the data processing flow we codes most of the analysis as Common Workflow Language JSON files. For the analysis we used the Rabix Suite. The editor used was open source Rabix Composer. The files can be executed locally at any machine using the open source Rabix Executor.
In this step we compared and merged our results.
Deploying one or a combination of these analyses would require an NGS laboratory, capable of delivering consistent and timely results. Cost of the laboratory procedure and cluster/cloud computing resources would have to be taken into account. Currently available methods of discovering unwanted organic material in the food products would have to be used as a gold standard. The main advantage of metagenomic analysis is the ability to discover organic material while performing a single NGS procedure, and then do the analysis in silico – where the cost and time increases at most linearly when adding more references of different organisms.
 Centrifuge (https://ccb.jhu.edu/software/centrifuge/index.shtml)
 BLAST Command-line Interface (https://www.ncbi.nlm.nih.gov/books/NBK279690/)
 BWA MEM (https://github.com/lh3/bwa)
 All-Food-Seq (AFS): a quantifiable screen for species in biological samples by deep DNA sequencing (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4131036/)