The task is to extract words, representative for entity’s activities and properties in а specified context of interest in news articles
Identrics is a dedicated research and development unit focused on AI and automation, and part of A Data Pro group and its 400-strong team.
Through Identrics’ technological advancements, clients get more out of data, do more with it, and access it more easily. By training the technologies with clients’ own datasets, our team can reach an unrivalled precision and quality rate of up to 95%, usually only achievable through manual work.
Identrics extracts knowledge from unstructured text. We apply classifiers to differentiate the content between a set of predefined classes. A common client request is not just to classify the full text, but also the named entities mentioned within the documents.
This problem spans further than classifying entities to basic types, such as person, organisation or geographical location. Entities can be classified with custom categories like positive/negative sentiment for persons/products, financial growth detection for companies, or simply identification of the entities with a central role in the described events.
To achieve this kind of analysis, one sophisticated preprocessing step is needed – to recognise and collect a set of words, which directly describe entity behavior in the context of a document. This set of words actually represents the context of mentions of a particular entity in the document. Then, these contextually related word sets could be used as training instances for the machine learning process if people manually annotate them to certain classes.
Right now we are looking for a solution that can efficiently extract the entities’ contexts in order to apply it to our machine learning process. Context extraction needs some concrete steps like tokenization, lexical and syntactic analysis, coreference chains resolution and dependency parsing. We challenge you to dive deeper into semantic analysis and the complexity of language structure.
Important Problem Specifics
In this Datathon, we are providing you with a database containing news in English, as well as the extracted entities of an “Organisation” type. The desired output is a dataset with contextually relevant words, referring to the individual entities.
Your goal will be to create instances for machine learning with words related to the entity. The type of entities is “Organisation” and the context is stock exchange movements, structural and political changes, redundancy or appointment of employees, and new service announcements.
This context would give us the ability to make future predictions for the company’s development. Every instance in this dataset should be created from the best coreference-chained sentences for the organisation in the document, and we need just the dependent to the entity context-friendly words. In most of the documents the mentioned companies are two and more. There are even review articles containing dozens of entities of the “Organisation” type and the context is different for them.
Effective and precise NLP is essential to the task. The English Universal Dependency TreeBank is a good starting point. Dependency parsing and understanding of the relations between the words in the sentence will give us greater knowledge about which of them are valuable for machine learning.