Team name: Cheetahs
Case: Telelink (iGEM)
The task for the Telelink case is to obtain the complete set of genome traces found in a single food sample and ALL organisms that should not be found in the food sample. The business needs a solution to this DNA Sequence identification case for improved quality control to be utilized in supply chains supervision and health care and protection.
The sample taken was statistically representative, i.e. the ratios between the numbers of genome sequence reads from different organisms are the same as the ratios between the amounts of the respective meat types used to produce the sausage.
All the provided files are in the FASTA format.
Full genomеs data for:
Cockroach (Blatella Germanica) – GCA_000762945.2_Bger_2.0_genomic.fna.gz
Cow (Bos taurus) – GCF_000003055.6_Bos_taurus_UMD_3.1.1_genomic.fna.gz
Sheep (Ovis aries) – GCF_000298735.2_Oar_v4.0_genomic.fna.gz
Pig (Sscrofa) – GCF_000003025.6_Sscrofa11.1_genomic.fna.gz
Soybean (Glycine max) – GCF_000004515.4_Glycine_max_v2.0_genomic.fna.gz
Escherichia coli (ASM) – GCF_000005845.2_ASM584v2_genomic.fna.gz
DNA from a sausage meat – case2.fasta
The files are basically a text file of a fastA type. It has the DNA sequences in it.
Every gene record has a:
Description: KZ614359.1 Blattella germanica strain American Cyanamid = Orlando Normal breed German cockroach unplaced genomic scaffold scaffold145, whole genome shotgun sequence
Number of features: 0
Every sequence is represented from four different letters : A (Adenine), T (Thymine), C (Cytosine) and G(Guanine). The sequence may contain lower and upper case, so further in the data preparation this should be to be taken in consideration.
For all the data preparation and analysis, I used Python programing language and especially Biopython library.
In order to analyze the sequences first I parse all files to strings and then make all letter to upper case.
In a sausage there should be DNA from pig, cow and/or sheep (let’s call them “good DNA”). So DNA from cockroach, soybean and/or escherichia coli (let’s call them “bad DNA”) should’n be found in the sausage.
We know that all these DNA are really similar to each other (above 90%), but there should be some
difference between them.
The approach I have chosen is to compare every good with every bad DNA and get the different genes.
After that, I search in the sausage DNA the genes from the bad DNA, that are not presented in the good DNA.
*The code could be found in the attached Notebook. I didn’t have enough time to run every code and get the results.
Another thing, that was interesting to me was to find the distribution of the elements in all DNA. We can see slight difference, but more and deeper analysis should be done for a better conclusion.