T.R.O.L – Temporally Recurrent Optimal Learning – Case Telenor SNA

Posted 4 CommentsPosted in Learn, SNA, Team solutions

  Article – SNA-Telenor – Team T.R.O.L. 1.Business Understanding, 2.Data Understanding, 3. Data Preparation TROL-0002 4. Modeling T.R.O.L-0003,4.Modeling, 5. Evaluation T.R.O.L-0004, 4.Modeling, 5. Evaluation T.R.O.L – 6. Deployment         #Case_Telenor  #SNA Microsoft Azure 2 vauchers W62UL1PZZRFHVCBH9R WWZO8QDZDKJK8VG7QB  

IDENTRICS – Team Vistula

Posted 1 CommentPosted in Team solutions

IDENTRICS

1. Business Understanding

Identrics as R&D unit concentrate on creating well-developed, advanced technology basing on AI and automation. The core activity is creation text mining algorithms and tools that fulfil 3 major company aims: unmatchable quality, optimal approach and advanced flexibility.

The business case presented in this article is to extract keywords (tags) for entities activity and properties in context from news. The result will help in making future predictions for the Identrics customers’ development. The success will be described as a list of entities, documents id and matched tags. The list will help to find out trend of how the entity was described in media in time – in positive/negative way, on which major topics (like stock exchange movements, structural and political changes, redundancy or appointment of employees, and new service announcements) it appeared.

The main goal of the project is to create successful parser that will iteratively add tags to entities in each document. The success will be based both on proper data processing and neural coreference text mining analytics – model.

Further parts of this paper will follow topics of data understanding and the way we used neural coreference algorithms. Next there is a part responsible for data preparation, where is described how we set up the technical environment and how we performed data processing. In 4th there is better detailed modelling process and textrank algorithm. Finally, Evaluation and Deployment paragraphs with great examples coming from modeling and ideas for further development.

2. Data Understanding

Project data package consisted of two csv files: Documents.csv and Entities.csv. First of them covered all articles/news texts, second the list of entities for whom we were looking for keywords from context.

The main challenge with data understanding during project was to discover potential for entities coming from personal pronouns. The easiest examples that presents how successful conversion process works:

Input: Ana has dog. She loves him.
Output: Ana has dog. Ana loves dog.

Thanks to it, we could broaden description for context algorithm from only first sentence to both of them. So keywords for ‘Ana’:

Before data transformation: Ana: ‘have’, ‘dog’
After data transformation: Ana: ‘have’, ‘dog’, ‘love’

That would probably create positive sentiment for entity ‘Ana’, as ‘love’ is one of the strongest determiners of positive sentiment.

3. Data Preparation

Step 1:
Every single document has been sent to neural nets coreference server build as EC2 instance on AWS using tornado python framework. Inside the server a script was run to clone https://github.com/huggingface/neuralcoref repository. One_shot_coref() and get_resolver_utterances() functions were run against the data. In results a document with substituted words such as: “her”, “his”, “it” and other connected with context were replaced with the names of entities.

Step 2:
The next step was to gather all the sentences where an entity appeared. Concatenate sentences into a single string and put all such strings into a list of sentences connected with entities.

Step 3:
Documents.csv and entities.csv were also transformed into more readable format to machine and much more friendly for looping over.

4. Modeling

We have chosen an algorithm implemented as Python library ‘summa’. Summa library uses TextRank algorithm for choosing keywords from given sentences. Is it widely described inside following github repository: http://summanlp.github.io/textrank

Text rank algorithm focuses mostly on nouns thats why it fits perfectly with idea of finding a connected web of different entities interconnected together.

If an idea of building a sentiment measurement tools were evaluated other algorithm should be chosen.

5. Evaluation

Some of the models that were taken under consideration were Stanford Core NLP algorithm and more simple approach using NLTK tokenization. Stanford NLP algorithm is widely used and described in many formal articles – both its speed and reliability are predictable.

Model that was tested against those algorithm were neural coreference algorithm build on top of Spacy. It predicted results that could satisfy business problem presented by Identrics.

6. Deployment

Time requirements:
Whole process of data preparation and computation shouldn’t take longer than 20 minutes.

Hardware requirements:
An algorithm was run on EC2 Machine with 4 cores and 16GB of ram. Every single step of computations could be separated running on different EC2 machines. Also an algorithm could be set as API Endpoint so in final a single document could be send with a list of entities. As a result a file or json response could be provided this a list of entities and contextual words.

The bottleneck of the code might be the python or neural net which should be tested – the solution could be the increase in size of the EC2 machine.

Some output from the model were:
Lukoil,petrom,signed,crude transport
TGN,gas,bse,company
Gazprom,gas,transgaz,old,romania

Model has ability to discover not only places but also situations like trade agreement as it was presented in Lukoil example.

Code for doing analysis is attached to this article. It is divided into separated and described steps.
The output data file is also attached as final_output.txt file

case_onto_text_team_a_vicky

Posted Leave a commentPosted in Team solutions

Although the data given to us has several snippets corresponding to each parent-subsidy pair, only some of the snippets reveal actual parent-subsidiary relationship. Therefore we felt that concatenating the snippets corresponding to each pair  into one single article and then training can give the model more information about which text snippet actually reveals the parent-subsidy relationship. A Bidirectional GRU models each sentence into a sentence vector and then two attention networks try to figure out the important words in each sentence and important sentences in each document. In addition to returning the probability of company 2 being a subsidiary of company 1 the model as returns important sentences which triggered its prediction. For instance when it says Orcale Corp is the parent of Microsys it can also return that
Orcale Corp’s Microsys customer support portal was seen communicating with a server known to be used by the carbanak gang, is the sentence which triggered its prediction.

Antelope SAP

Posted 2 CommentsPosted in Team solutions

The current paper examines the factors that influence the increase of the
sales volume of a retailer. The aim of the study is to create an accurate model with high explanatory
power which accounts for the promotional and competitor effects on the quantity sold as well
as to identify the main volume uplift drivers. That information could be useful when designing
marketing strategies in order to gain a competitive advantage over the other market players.

Case_VMWare TEAM anteater

Posted 1 CommentPosted in Team solutions

Tools R: rvest, text2vec, Matrix, textcat, irlba, NNMF Business Understanding Facilitate topic identification for Knowledge Base articles Data Understanding The Knowledge Base consists of 34,646 html files which have mostly homogeneous structure. (example below) The articles are highly domain specific and have a lot of terms which are not present in standard language dictionaries. The […]

Tiny smart data modelled with a not-so-tiny smart model – the Case of SAP

Posted 1 CommentPosted in Team solutions

Tiny smart data modelled with a not-so-tiny smart model Introduction Metadata Business Understanding Data Understanding Data Preparation Modelling Evaluation Deployment Conclusion Metadata Case: The SAP Case – Analyze Sales Team: Chameleon Project URL: https://github.com/Bugzey/Chameleon-SAP Memebers: Stefan Panev (stephen.panev@gmail.com), Metodi Nikolov (metodi.nikolov@gmail.com), Ivan Vrategov (ivanvrategov@gmail.com, Radoslav Dimitrov (rdimitrov@indeavr.com) Mentors: Alexander Efremov(aefremov@gmail.com) Agamemnon Baltagiannis (agamemnon.baltagiannis@sap.com) Team Toolset: […]

Price and promotion optimization for FCMG

Posted 6 CommentsPosted in Learn, Prediction systems, Team solutions

   Introduction Data provided consists of 3 years of weekly volume of sales, price of product in question, prices of main competitors and promotion calendar for a FCMG product. Data is provided by SAP. The task is to identify the volume uplift drivers, measure the promotional effectiveness and measure the cannibalization effect from main competitors. […]

Ontotext case – Team _A

Posted 13 CommentsPosted in Learn, NLP, Team solutions

The objective of our task is extract parent-subsidiary relationship in text. For example, a news from techcruch says this, ‘Remember those rumors a few weeks ago that Google was looking to acquire the plug-and-play security camera company, Dropcam? Yep. It just happened.’. Now from this sentence we can infer that Dropcam is a subsidiary of Google. But there are million of companies and several million articles talking about them. A Human being can be tired of doing even 10! Trust me 😉 We have developed some cool Machine learning models spanning from classical algorithms to Deep Neural network do this for you. There is a bonus! We just do not give you probabilities. We also give out that sentences that triggered the algorithm to make the inference!  For instance when it says Orcale Corp is the parent of  Microsys it can also return that the sentence in its corpus ‘Oracle Corp’s  Microsys customer support portal was seen communicating with a server’, triggered its prediction.