IDENTRICS – Team Vistula
IDENTRICS
1. Business Understanding
Identrics as R&D unit concentrate on creating well-developed, advanced technology basing on AI and automation. The core activity is creation text mining algorithms and tools that fulfil 3 major company aims: unmatchable quality, optimal approach and advanced flexibility.
The business case presented in this article is to extract keywords (tags) for entities activity and properties in context from news. The result will help in making future predictions for the Identrics customers’ development. The success will be described as a list of entities, documents id and matched tags. The list will help to find out trend of how the entity was described in media in time – in positive/negative way, on which major topics (like stock exchange movements, structural and political changes, redundancy or appointment of employees, and new service announcements) it appeared.
The main goal of the project is to create successful parser that will iteratively add tags to entities in each document. The success will be based both on proper data processing and neural coreference text mining analytics – model.
Further parts of this paper will follow topics of data understanding and the way we used neural coreference algorithms. Next there is a part responsible for data preparation, where is described how we set up the technical environment and how we performed data processing. In 4th there is better detailed modelling process and textrank algorithm. Finally, Evaluation and Deployment paragraphs with great examples coming from modeling and ideas for further development.
2. Data Understanding
Project data package consisted of two csv files: Documents.csv and Entities.csv. First of them covered all articles/news texts, second the list of entities for whom we were looking for keywords from context.
The main challenge with data understanding during project was to discover potential for entities coming from personal pronouns. The easiest examples that presents how successful conversion process works:
Input: Ana has dog. She loves him.
Output: Ana has dog. Ana loves dog.
Thanks to it, we could broaden description for context algorithm from only first sentence to both of them. So keywords for ‘Ana’:
Before data transformation: Ana: ‘have’, ‘dog’
After data transformation: Ana: ‘have’, ‘dog’, ‘love’
That would probably create positive sentiment for entity ‘Ana’, as ‘love’ is one of the strongest determiners of positive sentiment.
3. Data Preparation
Step 1:
Every single document has been sent to neural nets coreference server build as EC2 instance on AWS using tornado python framework. Inside the server a script was run to clone https://github.com/huggingface/neuralcoref repository. One_shot_coref() and get_resolver_utterances() functions were run against the data. In results a document with substituted words such as: “her”, “his”, “it” and other connected with context were replaced with the names of entities.
Step 2:
The next step was to gather all the sentences where an entity appeared. Concatenate sentences into a single string and put all such strings into a list of sentences connected with entities.
Step 3:
Documents.csv and entities.csv were also transformed into more readable format to machine and much more friendly for looping over.
4. Modeling
We have chosen an algorithm implemented as Python library ‘summa’. Summa library uses TextRank algorithm for choosing keywords from given sentences. Is it widely described inside following github repository: http://summanlp.github.io/textrank
Text rank algorithm focuses mostly on nouns thats why it fits perfectly with idea of finding a connected web of different entities interconnected together.
If an idea of building a sentiment measurement tools were evaluated other algorithm should be chosen.
5. Evaluation
Some of the models that were taken under consideration were Stanford Core NLP algorithm and more simple approach using NLTK tokenization. Stanford NLP algorithm is widely used and described in many formal articles – both its speed and reliability are predictable.
Model that was tested against those algorithm were neural coreference algorithm build on top of Spacy. It predicted results that could satisfy business problem presented by Identrics.
6. Deployment
Time requirements:
Whole process of data preparation and computation shouldn’t take longer than 20 minutes.
Hardware requirements:
An algorithm was run on EC2 Machine with 4 cores and 16GB of ram. Every single step of computations could be separated running on different EC2 machines. Also an algorithm could be set as API Endpoint so in final a single document could be send with a list of entities. As a result a file or json response could be provided this a list of entities and contextual words.
The bottleneck of the code might be the python or neural net which should be tested – the solution could be the increase in size of the EC2 machine.
Some output from the model were:
Lukoil,petrom,signed,crude transport
TGN,gas,bse,company
Gazprom,gas,transgaz,old,romania
Model has ability to discover not only places but also situations like trade agreement as it was presented in Lukoil example.
Code for doing analysis is attached to this article. It is divided into separated and described steps.
The output data file is also attached as final_output.txt file