Team solutions

Team Cheetahs Case Telelink(iGEM)

3
votes

Team name: Cheetahs

Case: Telelink (iGEM)

Provider: IBM

 

Business Understanding

The task for the Telelink case is to obtain the complete set of genome traces found in a single food sample and ALL organisms that should not be found in the food sample. The business needs a solution to this DNA Sequence identification case for improved quality control to be utilized in supply chains supervision and health care and protection.

Data Understanding

The sample taken was statistically representative, i.e. the ratios between the numbers of genome sequence reads from different organisms are the same as the ratios between the amounts of the respective meat types used to produce the sausage.

All the provided files are in the FASTA format.

 

Full genomеs data for:

Cockroach (Blatella Germanica) – GCA_000762945.2_Bger_2.0_genomic.fna.gz

Cow (Bos taurus) – GCF_000003055.6_Bos_taurus_UMD_3.1.1_genomic.fna.gz

Sheep (Ovis aries) – GCF_000298735.2_Oar_v4.0_genomic.fna.gz

Pig (Sscrofa) – GCF_000003025.6_Sscrofa11.1_genomic.fna.gz

Soybean (Glycine max) – GCF_000004515.4_Glycine_max_v2.0_genomic.fna.gz

Escherichia coli (ASM) – GCF_000005845.2_ASM584v2_genomic.fna.gz

DNA from a sausage meat – case2.fasta

 

The files are basically a text file of a fastA type. It has the DNA sequences in it.

Every gene record has a:

ID: KZ614359.1

Name: KZ614359.1

Description: KZ614359.1 Blattella germanica strain American Cyanamid = Orlando Normal breed German cockroach unplaced genomic scaffold scaffold145, whole genome shotgun sequence

Number of features: 0

Seq(‘GCATGCCGGGATATGTAGAATTGCCATTGAAACGGGGTATAACGTTGCAGGTAT…TTG’, SingleLetterAlphabet())

 

Every sequence is represented from four different letters : A (Adenine), T (Thymine), C (Cytosine) and G(Guanine). The sequence may contain lower and upper case, so further in the data preparation this should be to be taken in consideration.

 

Data Preparation

For all the data preparation and analysis, I used Python programing language and especially Biopython library.

In order to analyze the sequences first I parse all files to strings and then make all letter to upper case.

 

Analysis

In a sausage there should be DNA from pig, cow and/or sheep (let’s call them “good DNA”). So DNA from cockroach, soybean and/or escherichia coli (let’s call them “bad DNA”) should’n be found in the sausage.

We know that all these DNA are really similar to each other (above 90%), but there should be some

difference between them.

The approach I have chosen is to compare every good with every bad DNA and get the different genes.

 

After that, I search in the sausage DNA the genes from the bad DNA, that are not presented in the good DNA.

*The code could be found in the attached Notebook. I didn’t have enough time to run every code and get the results.

Team_Cheetahs_Telelink_case_final

Another thing, that was interesting to me was to find the distribution of the elements in all DNA. We can see slight difference, but more and deeper analysis should be done for a better conclusion.

Cow        
G A T C  
554223892 769836909 771472991 553865587  
20.92% 29.06% 29.12% 20.91% 2649399379
Sheep        
G A T C  
542305407 750934925 752347283 541919758  
20.96% 29.02% 29.08% 20.94% 2587507373
Pig        
G A T C  
517706165 717891230 719048243 517402066  
20.94% 29.04% 29.09% 20.93% 2472047704
Escherichia coli        
G A T C  
1177437 1142742 1141382 1180091  
25.37% 24.62% 24.59% 25.42% 4641652
Soybean        
G A T C  
166165878 311753219 311830875 166175140  
17.38% 32.61% 32.62% 17.38% 955925112
Cockroach        
G A T C  
304828903 577018232 576590607 304878080  
17.29% 32.72% 32.70% 17.29% 1763315822

 

Sausage        
G A T C
11230 14206 14107 11508
22.00% 27.83% 27.63% 22.54% 51051

 

Share this

4 thoughts on “Team Cheetahs Case Telelink(iGEM)

  1. 1
    votes

    🙂 looking good, write a bit about your model, what you have tried to do, which approaches you were trying to take and if you have – then what were the results. try to add also some ideas for a future research/possible algorithm approaches ideas.

    1. 1
      votes

      The nucleotide distributions shows interesting results, you can easily see the similarities between mammalitans compared to the e. coli for example.

  2. 1
    votes

    From the first person singular discourse, I get that this was the effort of one-member team. This is a hard case, that requires enough background in genetics and DNA alignment tools. I congratulate the team for the courage to approach such a specific domain. Data science is first about understanding the domain, and only after that about the ability to do modeling.

Leave a Reply