On Coreference Extraction from Identric’s Documents
Identrics would like to extract knowledge from unstructured text. One use case for that is to automate the process of information extraction from news articles. Some of Indentrics clients are from the finance industry and have a need to understand the impact of news on the valuation of certain, at the stock index listed, companies. To help their clients’ analysts derive faster to insights, Indentrics’ NLP Team uses coreferencing to extract all the mentions of an entity (like a company or person) in a news article and presents these information back to the analysts.
In approaching this problem outlined above, we are presented with two files (i.e. documents.csv and entities.csv). Each line outlined in documents.csv represents one row of the data containing the unprocessed corpus and the corresponding document ID. In entities.csv, the file contains a list of mentions that are manually tagged to indicate entities present in the documents corresponding to the document ID. The contents of the entities.csv file include following columns:
entity_id – column containing generated integer identifiers for entities
doc_id – column containing generated doc_id that references to the same doc_id in table `
word – string representation of mentioned entity
type – mentioned entity type. In this dump all entities are of type “ORGANIZATION”
start_offset – the position number of first entity character in document text
end_offset – the position number of last entity character in document text
Column entity_id has primary key constraint for the table. Column doc_id hasn’t foreign key constraint to doc_id in `
The use case Identrics is indeed not a typical NLP task. This team has shown a lot of courage to pick up this use case. You have taken an interesting approach of using Neural coref server and SpaCy. For this use case it was a good starting point.
While using a Neural coref library is good, it would have been great to also fine tune the model to the specific use case data. Looking at some model performance graphs and understanding how to optimize the NLP algorithm so that it performs the best for this problem are some of the other things the team could have tried. Also, some of the suggested sections of crisp-dm methodology is missed.
In conclusion, I’ll say that team did a fantastic job in boot strapping available resources in minimal time to get the outcome but missed on some important machine learning optimization steps.
Keep up the great work and please feel free to reach out to me if you have specific questions.