Datathon NSI Mentors’ Guidelines – Economic Time Series Prediction

Posted 1 CommentPosted in GD2018 Mentors, Mentors

In this article the mentors give some preliminary guidelines, advice and suggestions to the participants for the case. Every mentor should write their name and chat name in the beginning of their texts, so that there are no mix-ups with the other menthors. By rules it is essential to follow CRISP-DM methodology (http://www.sv-europe.com/crisp-dm-methodology/). The DSS […]

Datathon Telenor Mentors’ Guidelines – On TelCo predictions

Posted Leave a commentPosted in GD2018 Mentors, Mentors

In this article the mentors give some preliminary guidelines, advice and suggestions to the participants for the case. Every mentor should write their name and chat name in the beginning of their texts, so that there are no mix-ups with the other menthors. By rules it is essential to follow CRISP-DM methodology (http://www.sv-europe.com/crisp-dm-methodology/). The DSS […]

Datathon Sofia Air Mentors’ Guidelines – On IOT Prediction

Posted Leave a commentPosted in GD2018 Mentors, Mentors

In this article the mentors give some preliminary guidelines, advice and suggestions to the participants for the case. Every mentor should write their name and chat name in the beginning of their texts, so that there are no mix-ups with the other menthors. By rules it is essential to follow CRISP-DM methodology (http://www.sv-europe.com/crisp-dm-methodology/). The DSS […]

Datathon Kaufland Mentors’ Guidelines – On Predictive Maintenance

Posted Leave a commentPosted in GD2018 Mentors, Mentors

In this article, the mentors give some preliminary guidelines, advice, and suggestions to the participants for the case. Every mentor should write their name and chat name at the beginning of their texts so that there are no mix-ups with the other mentors. By rules, it is essential to follow CRISP-DM methodology (http://www.sv-europe.com/crisp-dm-methodology/). The DSS […]

Hack News Datathon Mentors’ Guidelines – Propaganda Detection

Posted Leave a commentPosted in Learn

In this article, the mentors give some preliminary guidelines, advice, and suggestions to the participants for the Hack the News Datathon case. Every mentor should write their name and chat name at the beginning of their texts so that there are no mix-ups with the other mentors. Introduction to NLP Natural Language Processing (NLP) is […]

IDENTRICS – Team Vistula

Posted 1 CommentPosted in Team solutions

IDENTRICS

1. Business Understanding

Identrics as R&D unit concentrate on creating well-developed, advanced technology basing on AI and automation. The core activity is creation text mining algorithms and tools that fulfil 3 major company aims: unmatchable quality, optimal approach and advanced flexibility.

The business case presented in this article is to extract keywords (tags) for entities activity and properties in context from news. The result will help in making future predictions for the Identrics customers’ development. The success will be described as a list of entities, documents id and matched tags. The list will help to find out trend of how the entity was described in media in time – in positive/negative way, on which major topics (like stock exchange movements, structural and political changes, redundancy or appointment of employees, and new service announcements) it appeared.

The main goal of the project is to create successful parser that will iteratively add tags to entities in each document. The success will be based both on proper data processing and neural coreference text mining analytics – model.

Further parts of this paper will follow topics of data understanding and the way we used neural coreference algorithms. Next there is a part responsible for data preparation, where is described how we set up the technical environment and how we performed data processing. In 4th there is better detailed modelling process and textrank algorithm. Finally, Evaluation and Deployment paragraphs with great examples coming from modeling and ideas for further development.

2. Data Understanding

Project data package consisted of two csv files: Documents.csv and Entities.csv. First of them covered all articles/news texts, second the list of entities for whom we were looking for keywords from context.

The main challenge with data understanding during project was to discover potential for entities coming from personal pronouns. The easiest examples that presents how successful conversion process works:

Input: Ana has dog. She loves him.
Output: Ana has dog. Ana loves dog.

Thanks to it, we could broaden description for context algorithm from only first sentence to both of them. So keywords for ‘Ana’:

Before data transformation: Ana: ‘have’, ‘dog’
After data transformation: Ana: ‘have’, ‘dog’, ‘love’

That would probably create positive sentiment for entity ‘Ana’, as ‘love’ is one of the strongest determiners of positive sentiment.

3. Data Preparation

Step 1:
Every single document has been sent to neural nets coreference server build as EC2 instance on AWS using tornado python framework. Inside the server a script was run to clone https://github.com/huggingface/neuralcoref repository. One_shot_coref() and get_resolver_utterances() functions were run against the data. In results a document with substituted words such as: “her”, “his”, “it” and other connected with context were replaced with the names of entities.

Step 2:
The next step was to gather all the sentences where an entity appeared. Concatenate sentences into a single string and put all such strings into a list of sentences connected with entities.

Step 3:
Documents.csv and entities.csv were also transformed into more readable format to machine and much more friendly for looping over.

4. Modeling

We have chosen an algorithm implemented as Python library ‘summa’. Summa library uses TextRank algorithm for choosing keywords from given sentences. Is it widely described inside following github repository: http://summanlp.github.io/textrank

Text rank algorithm focuses mostly on nouns thats why it fits perfectly with idea of finding a connected web of different entities interconnected together.

If an idea of building a sentiment measurement tools were evaluated other algorithm should be chosen.

5. Evaluation

Some of the models that were taken under consideration were Stanford Core NLP algorithm and more simple approach using NLTK tokenization. Stanford NLP algorithm is widely used and described in many formal articles – both its speed and reliability are predictable.

Model that was tested against those algorithm were neural coreference algorithm build on top of Spacy. It predicted results that could satisfy business problem presented by Identrics.

6. Deployment

Time requirements:
Whole process of data preparation and computation shouldn’t take longer than 20 minutes.

Hardware requirements:
An algorithm was run on EC2 Machine with 4 cores and 16GB of ram. Every single step of computations could be separated running on different EC2 machines. Also an algorithm could be set as API Endpoint so in final a single document could be send with a list of entities. As a result a file or json response could be provided this a list of entities and contextual words.

The bottleneck of the code might be the python or neural net which should be tested – the solution could be the increase in size of the EC2 machine.

Some output from the model were:
Lukoil,petrom,signed,crude transport
TGN,gas,bse,company
Gazprom,gas,transgaz,old,romania

Model has ability to discover not only places but also situations like trade agreement as it was presented in Lukoil example.

Code for doing analysis is attached to this article. It is divided into separated and described steps.
The output data file is also attached as final_output.txt file

Hack the News Datathon Case – Propaganda Detection

Posted Leave a commentPosted in NLP

1. Business Problem Formulation The current political landscape is shaped by extreme polarization of opinions and by the proliferation of fake news. For example, a recent study published in Science has found that rumors and fake news tend to spread six times faster than truthful information. This situation both damages the reputation of respectable news outlets and […]

Datathon Ontotext Mentors’ Guidelines – Text Mining Classification

Posted Leave a commentPosted in GD2018 Mentors, Mentors

In this article the mentors give some preliminary guidelines, advice and suggestions to the participants for the case. Every mentor should write their name and chat name in the beginning of their texts, so that there are no mix-ups with the other menthors. By rules it is essential to follow CRISP-DM methodology (http://www.sv-europe.com/crisp-dm-methodology/). The DSS […]

Ontotext case – Team _A

Posted 13 CommentsPosted in Learn, NLP, Team solutions

The objective of our task is extract parent-subsidiary relationship in text. For example, a news from techcruch says this, ‘Remember those rumors a few weeks ago that Google was looking to acquire the plug-and-play security camera company, Dropcam? Yep. It just happened.’. Now from this sentence we can infer that Dropcam is a subsidiary of Google. But there are million of companies and several million articles talking about them. A Human being can be tired of doing even 10! Trust me 😉 We have developed some cool Machine learning models spanning from classical algorithms to Deep Neural network do this for you. There is a bonus! We just do not give you probabilities. We also give out that sentences that triggered the algorithm to make the inference!  For instance when it says Orcale Corp is the parent of  Microsys it can also return that the sentence in its corpus ‘Oracle Corp’s  Microsys customer support portal was seen communicating with a server’, triggered its prediction.