Team name: Cheetahs
Case: Telelink (iGEM)
Provider: IBM
Business Understanding
The task for the Telelink case is to obtain the complete set of genome traces found in a single food sample and ALL organisms that should not be found in the food sample. The business needs a solution to this DNA Sequence identification case for improved quality control to be utilized in supply chains supervision and health care and protection.
Data Understanding
The sample taken was statistically representative, i.e. the ratios between the numbers of genome sequence reads from different organisms are the same as the ratios between the amounts of the respective meat types used to produce the sausage.
All the provided files are in the FASTA format.
Full genomеs data for:
Cockroach (Blatella Germanica) – GCA_000762945.2_Bger_2.0_genomic.fna.gz
Cow (Bos taurus) – GCF_000003055.6_Bos_taurus_UMD_3.1.1_genomic.fna.gz
Sheep (Ovis aries) – GCF_000298735.2_Oar_v4.0_genomic.fna.gz
Pig (Sscrofa) – GCF_000003025.6_Sscrofa11.1_genomic.fna.gz
Soybean (Glycine max) – GCF_000004515.4_Glycine_max_v2.0_genomic.fna.gz
Escherichia coli (ASM) – GCF_000005845.2_ASM584v2_genomic.fna.gz
DNA from a sausage meat – case2.fasta
The files are basically a text file of a fastA type. It has the DNA sequences in it.
Every gene record has a:
ID: KZ614359.1
Name: KZ614359.1
Description: KZ614359.1 Blattella germanica strain American Cyanamid = Orlando Normal breed German cockroach unplaced genomic scaffold scaffold145, whole genome shotgun sequence
Number of features: 0
Seq(‘GCATGCCGGGATATGTAGAATTGCCATTGAAACGGGGTATAACGTTGCAGGTAT…TTG’, SingleLetterAlphabet())
Every sequence is represented from four different letters : A (Adenine), T (Thymine), C (Cytosine) and G(Guanine). The sequence may contain lower and upper case, so further in the data preparation this should be to be taken in consideration.
Data Preparation
For all the data preparation and analysis, I used Python programing language and especially Biopython library.
In order to analyze the sequences first I parse all files to strings and then make all letter to upper case.
Analysis
In a sausage there should be DNA from pig, cow and/or sheep (let’s call them “good DNA”). So DNA from cockroach, soybean and/or escherichia coli (let’s call them “bad DNA”) should’n be found in the sausage.
We know that all these DNA are really similar to each other (above 90%), but there should be some
difference between them.
The approach I have chosen is to compare every good with every bad DNA and get the different genes.
After that, I search in the sausage DNA the genes from the bad DNA, that are not presented in the good DNA.
*The code could be found in the attached Notebook. I didn’t have enough time to run every code and get the results.
Team_Cheetahs_Telelink_case_final
Another thing, that was interesting to me was to find the distribution of the elements in all DNA. We can see slight difference, but more and deeper analysis should be done for a better conclusion.
Cow | ||||
G | A | T | C | |
554223892 | 769836909 | 771472991 | 553865587 | |
20.92% | 29.06% | 29.12% | 20.91% | 2649399379 |
Sheep | ||||
G | A | T | C | |
542305407 | 750934925 | 752347283 | 541919758 | |
20.96% | 29.02% | 29.08% | 20.94% | 2587507373 |
Pig | ||||
G | A | T | C | |
517706165 | 717891230 | 719048243 | 517402066 | |
20.94% | 29.04% | 29.09% | 20.93% | 2472047704 |
Escherichia coli | ||||
G | A | T | C | |
1177437 | 1142742 | 1141382 | 1180091 | |
25.37% | 24.62% | 24.59% | 25.42% | 4641652 |
Soybean | ||||
G | A | T | C | |
166165878 | 311753219 | 311830875 | 166175140 | |
17.38% | 32.61% | 32.62% | 17.38% | 955925112 |
Cockroach | ||||
G | A | T | C | |
304828903 | 577018232 | 576590607 | 304878080 | |
17.29% | 32.72% | 32.70% | 17.29% | 1763315822 |
Sausage | ||||
G | A | T | C | |
11230 | 14206 | 14107 | 11508 | |
22.00% | 27.83% | 27.63% | 22.54% | 51051 |
4 thoughts on “Team Cheetahs Case Telelink(iGEM)”
🙂 looking good, write a bit about your model, what you have tried to do, which approaches you were trying to take and if you have – then what were the results. try to add also some ideas for a future research/possible algorithm approaches ideas.
The nucleotide distributions shows interesting results, you can easily see the similarities between mammalitans compared to the e. coli for example.
From the first person singular discourse, I get that this was the effort of one-member team. This is a hard case, that requires enough background in genetics and DNA alignment tools. I congratulate the team for the courage to approach such a specific domain. Data science is first about understanding the domain, and only after that about the ability to do modeling.
Nice tables for distribution, very informative but i miss the final result