Datathon Telenor Solution – Exploratory Data & Predictive Analytics -Analogy of Game of Thrones With Telenor Telecommunications

Posted 1 CommentPosted in Datathons Solutions

->This datasets is regarding the time series analysis on the failure rate of RAVENS sending the messages from kings landing to the north.
It depicts the analogy of Telenor communication  and Game of Thrones.
-> Sending ravens is one of the most fundamental parameters in mobile communications engineering.
For land-based mobile communications, the received raven variation is primarily the result of multipath fading caused by obstacles such as buildings (or clutter) or terrain irregularities; the distance between link end points; predatory animals, and interference among multiple transmissions, for example wars.
This inevitable raven variation is the cause of communication dropping, one of the most significant quality of service measure in operative communication. For this reason, various techniques and schemes are employed in the planning, design and optimization of raven networks to combat these propagation effects.
This normally covers the network physical configuration which include all aspects of network infrastructure deployment such as locations of base nests; additional food; sometimes guards, etc.
A typical example of these schemes and techniques is the use of models for flight prediction based on measured data.
Based on one month data with flight fails, the participants have to make time-series analysis and predict the future amount of fails.

Datathon NSI Mentors’ Guidelines – Economic Time Series Prediction

Posted 1 CommentPosted in GD2018 Mentors, Mentors

In this article the mentors give some preliminary guidelines, advice and suggestions to the participants for the case. Every mentor should write their name and chat name in the beginning of their texts, so that there are no mix-ups with the other menthors. By rules it is essential to follow CRISP-DM methodology (http://www.sv-europe.com/crisp-dm-methodology/). The DSS […]

Datathon Telenor Mentors’ Guidelines – On TelCo predictions

Posted Leave a commentPosted in GD2018 Mentors, Mentors

In this article the mentors give some preliminary guidelines, advice and suggestions to the participants for the case. Every mentor should write their name and chat name in the beginning of their texts, so that there are no mix-ups with the other menthors. By rules it is essential to follow CRISP-DM methodology (http://www.sv-europe.com/crisp-dm-methodology/). The DSS […]

Datathon Sofia Air Mentors’ Guidelines – On IOT Prediction

Posted Leave a commentPosted in GD2018 Mentors, Mentors

In this article the mentors give some preliminary guidelines, advice and suggestions to the participants for the case. Every mentor should write their name and chat name in the beginning of their texts, so that there are no mix-ups with the other menthors. By rules it is essential to follow CRISP-DM methodology (http://www.sv-europe.com/crisp-dm-methodology/). The DSS […]

Datathon Kaufland Mentors’ Guidelines – On Predictive Maintenance

Posted Leave a commentPosted in GD2018 Mentors, Mentors

In this article, the mentors give some preliminary guidelines, advice, and suggestions to the participants for the case. Every mentor should write their name and chat name at the beginning of their texts so that there are no mix-ups with the other mentors. By rules, it is essential to follow CRISP-DM methodology (http://www.sv-europe.com/crisp-dm-methodology/). The DSS […]

Datathon Telenor Solution – Ravens for Communication

Posted 1 CommentPosted in Datathons Solutions

It is a very well known fact that Exploratory Data Analysis is cornerstone of Data Analysis.
On the analysis of data it is evident that Brass Raven Birdy as the most failed and the Metallic Raven Sunburst Polly is the most successful raven. Also Targeryan family has the most Raven fails whereas Baelish family has the least failures,and among the family of Baelish, Peter Baelish has the most failure rate and Euron has the least failures.
ARIMA model is used for predicting the number of failures for the next 4 days.

IDENTRICS – Team Vistula

Posted 1 CommentPosted in Team solutions

IDENTRICS

1. Business Understanding

Identrics as R&D unit concentrate on creating well-developed, advanced technology basing on AI and automation. The core activity is creation text mining algorithms and tools that fulfil 3 major company aims: unmatchable quality, optimal approach and advanced flexibility.

The business case presented in this article is to extract keywords (tags) for entities activity and properties in context from news. The result will help in making future predictions for the Identrics customers’ development. The success will be described as a list of entities, documents id and matched tags. The list will help to find out trend of how the entity was described in media in time – in positive/negative way, on which major topics (like stock exchange movements, structural and political changes, redundancy or appointment of employees, and new service announcements) it appeared.

The main goal of the project is to create successful parser that will iteratively add tags to entities in each document. The success will be based both on proper data processing and neural coreference text mining analytics – model.

Further parts of this paper will follow topics of data understanding and the way we used neural coreference algorithms. Next there is a part responsible for data preparation, where is described how we set up the technical environment and how we performed data processing. In 4th there is better detailed modelling process and textrank algorithm. Finally, Evaluation and Deployment paragraphs with great examples coming from modeling and ideas for further development.

2. Data Understanding

Project data package consisted of two csv files: Documents.csv and Entities.csv. First of them covered all articles/news texts, second the list of entities for whom we were looking for keywords from context.

The main challenge with data understanding during project was to discover potential for entities coming from personal pronouns. The easiest examples that presents how successful conversion process works:

Input: Ana has dog. She loves him.
Output: Ana has dog. Ana loves dog.

Thanks to it, we could broaden description for context algorithm from only first sentence to both of them. So keywords for ‘Ana’:

Before data transformation: Ana: ‘have’, ‘dog’
After data transformation: Ana: ‘have’, ‘dog’, ‘love’

That would probably create positive sentiment for entity ‘Ana’, as ‘love’ is one of the strongest determiners of positive sentiment.

3. Data Preparation

Step 1:
Every single document has been sent to neural nets coreference server build as EC2 instance on AWS using tornado python framework. Inside the server a script was run to clone https://github.com/huggingface/neuralcoref repository. One_shot_coref() and get_resolver_utterances() functions were run against the data. In results a document with substituted words such as: “her”, “his”, “it” and other connected with context were replaced with the names of entities.

Step 2:
The next step was to gather all the sentences where an entity appeared. Concatenate sentences into a single string and put all such strings into a list of sentences connected with entities.

Step 3:
Documents.csv and entities.csv were also transformed into more readable format to machine and much more friendly for looping over.

4. Modeling

We have chosen an algorithm implemented as Python library ‘summa’. Summa library uses TextRank algorithm for choosing keywords from given sentences. Is it widely described inside following github repository: http://summanlp.github.io/textrank

Text rank algorithm focuses mostly on nouns thats why it fits perfectly with idea of finding a connected web of different entities interconnected together.

If an idea of building a sentiment measurement tools were evaluated other algorithm should be chosen.

5. Evaluation

Some of the models that were taken under consideration were Stanford Core NLP algorithm and more simple approach using NLTK tokenization. Stanford NLP algorithm is widely used and described in many formal articles – both its speed and reliability are predictable.

Model that was tested against those algorithm were neural coreference algorithm build on top of Spacy. It predicted results that could satisfy business problem presented by Identrics.

6. Deployment

Time requirements:
Whole process of data preparation and computation shouldn’t take longer than 20 minutes.

Hardware requirements:
An algorithm was run on EC2 Machine with 4 cores and 16GB of ram. Every single step of computations could be separated running on different EC2 machines. Also an algorithm could be set as API Endpoint so in final a single document could be send with a list of entities. As a result a file or json response could be provided this a list of entities and contextual words.

The bottleneck of the code might be the python or neural net which should be tested – the solution could be the increase in size of the EC2 machine.

Some output from the model were:
Lukoil,petrom,signed,crude transport
TGN,gas,bse,company
Gazprom,gas,transgaz,old,romania

Model has ability to discover not only places but also situations like trade agreement as it was presented in Lukoil example.

Code for doing analysis is attached to this article. It is divided into separated and described steps.
The output data file is also attached as final_output.txt file

Datathon Ontotext Mentors’ Guidelines – Text Mining Classification

Posted Leave a commentPosted in GD2018 Mentors, Mentors

In this article the mentors give some preliminary guidelines, advice and suggestions to the participants for the case. Every mentor should write their name and chat name in the beginning of their texts, so that there are no mix-ups with the other menthors. By rules it is essential to follow CRISP-DM methodology (http://www.sv-europe.com/crisp-dm-methodology/). The DSS […]

Team Seven Bridges – Telelink Case: What Really Goes into Sausages?

Posted 9 CommentsPosted in Bioinformatics, Learn, Team solutions

The food industry is governed by strict laws and regulations, which provide certainty that each product meets health and safety standards. In addition to existing biochemical food product analysis, we propose a metagenomic approach. Main benefit of this approach is the ability to perform next generation sequencing as a standard first step and then align the sampled data to genomes references of many organisms suspected to be present in the sample. Additionaly, if another organism is suspected at a later date, it is easy to reause the sampled data set to perform another analysis – in the biochemical analysis this would require expensive sample storage and performing more laboratory tests. We examined three approaches to metagenomic analysis – BLAST, Centrifuge and BWA MEM.