Team solutions

CASE Ontotext, Team CENTROIDA

This paper presents a DNN-based approach to learn entities relations from distant-labeled free text. The proposed approach presents task-specific data cleaning, which despite effective in removing textual noise is preserving semantics necessary for the training process. The cleaned-up dataset is then used to build a number of bLSTM attention-based DNN models, hyper-tuned using recall as an optimization objective. The resulting models are then joined into an ensemble that deliver our best result

2
votes

The Team
Stefan Lazov
Rim Mustafin
Alexander Kolev

Source Code
Case implementation can be found here: https://github.com/Centroida/case_ontotext

Business Understanding
Clients of Ontotext often have large text collections that need to be searched efficiently. They are particularly interested in key concepts like Organizations, People, Locations, but also relations between them. ML methods are able to learn from already annotated examples how to extract relations expressed in text. These methods however need large amount of expert annotations, which are quite expensive.

The current case is an exploration of the idea whether an AI algorithm can be trained to infer the relations off a distant-labeled set, auto-generated via DBPedia.

Data Understanding
The dataset provided contained the following structure:

Company1ย  Company2 TextSnippet ย IsParent
Centene Corporation Health Net Centene closed the deal with Health Net Inc. TRUE
Health Net Centene Corporation Centene closed the deal with Health Net Inc. FALSE
Aetna Inc. Health Net Aetna and Health Net are competitors. FALSE

The overall set contained roughly 89,000 records and was obviously biased towards negative examples so we knew we had to do something about this.

Rendering the length histogram of the provided snippets was also needed, so we could get a feel of what the distribution looked like and respectively adapt the inputs of our models.

Generally this was one of the tasks with the highest quality data.

Data Preparation
Looking at the snippets column and browsing through its contents, we realized we had to do a number of things:

  • Do general clean-up: remove new lines, end of lines, punctuation, etc
  • Substitute all company names with predefined tokens, so the model could focus solely on the task at hand – extract & classify relationship context
  • Preserve some of the text that supports the relationship we had to model – i.e. text like Microsoft-owned or Microsoft‘s
  • Do a stop-word filtering, carefully omitting stop words that carry information related to the task at hand – e.g. its, his, has, etc.

Overall we inserted or preserved a total of 329,317 tokens into our ‘clean’ dataset, which after processing consists of 2,152,379 words.

Once we had the clean data, we trained 100 dimensional word2vec embeddings on top of Wikipedia and then fine-tuned these with our clean set, extending the vocabulary with our special tokens and task-specific corpora.

Modeling
For building our model, we were mostly influenced by this paper: Context-Aware Representations for Knowledge Base Relation Extraction, Sorokin D., Gurevych I.

Link to the paper: http://bit.ly/2nTzdrO
Link to the paper’s implementation: https://github.com/ukplab

The ideas we used from the paper were:

  • Use an LSTM cell as an encoder
  • Enrich the text input to the model with the entities positions

Based on that knowledge, we built a number of models that each consumes 2 inputs – the snippets data and the entities postitions. ALthough one of these models closely resembles the baseline LSTM from the paper above, our best-performing models use stacked bLSTM cells instead and combine these with an attention layer for additional gains, esp. helpful in the context of the rather long input sequences.

Our attention implementation is inspired by this paper: Hierarchical Attention Networks for Document Classification, Yang Z. Yang D., Xiaodong H., Smola. A, Hovy E.

Based on our data analysis, we’ve concluded that our input sequences should be 75 words long as that well captures the distribution we’ve seen. To combat bias, all our models make use of dropout.

Other than the input sequence length, the rest of the hyper parameters have been the result of a hyper-paramater search using TPE (hyperas).

The diagram below illustrates the convergence of one of our models during hypertuning:

Evaluation
Here’re the results our ensemble achieves:

 Negative Recall: 0.9557852882703778
 Positive Recall: 0.8780861244019139
 Average Recall: 0.9169357063361459
 Accuracy:  0.9329775280898877

Final submission set: https://github.com/Centroida/case_ontotext/blob/master/final_csv_centroida.csv
Model’s output is in the is_parent column

Deployment
Our ensemble can easily be pruned and distilled into a single, resource-efficient model that’s deployable and servable at scale.

Tech Stack
To solve this case and build the models described above, we’ve extensively used the following frameworks and tools:

  • Keras
  • TensorFlow
  • Pandas
  • Hyperas
  • Gensim
  • PySpark

Share this

7 thoughts on “CASE Ontotext, Team CENTROIDA

    1. 0
      votes

      We could, but.. we’re honestly quite tired, so we’ll skip this now and leave it for later on. It’s not relevant for the overall task anyways:

      From the Ontotext case – ” … The teams will only need to identify if there is a relation of type _is parent of_ . … “.

      We’ve done that – is_parent column (like in the training data) is present ๐Ÿ™‚

  1. 1
    votes

    Good understanding of the business problem and a good choice for an algorithm that can tackle the task. I would have liked to learn a bit more about the nitty-gritty of the solution from the article- what was the structure of the best performing model, how was data split, what data were recall and accuracy calculated on.

    I’m looking forward to see how well your approach did on the test data as the results you’ve gotten are quite promising.

  2. 1
    votes

    Nice work! You have found a very relevant paper, by a world-top research group, and it was further extended, based on ideas from another paper, and using improvements on the network architecture, and based on exploration of the values of the hyper-parameters. The model uses deep learning and state-of-the-art tools and techniques.

    How were the company names normalized exactly?

    Do you do anything special to handle the asymmetricity of the relation?

    The accuracy is very high, but what is the baseline? Also, what is F1?
    Any results on cross-validation based on the training dataset for different choices of the hyperparameters of the network architecture?

    Any thought what can be done next to further improve the model?

    1. 0
      votes

      Thanks for the feedback, Preslav. Much appreciated.

      Normalization – replace the column1/column2 ocurrences in the snippet with . All our extra vocabulary has this syntax

      We do nothing special for the asymmetricity other than the used model architectures (blstms should be quite contributing to tackling this subtask)

      We’ll post the F1 and other data in a separate blog post.. Didn’t have much time to crunch these numbers as well

      As for improvement – yup. Based on gut-feeling confidence, we think can squeeze out at least 2-3% out of this, mostly via larger hyperparam search

      Cheers,
      A

  3. 1
    votes

    Great work! I agree with the mentors above ๐Ÿ™‚ You built a great model and achieved a very high score! Great productivity for a group of 3 people. I would also like to see more graphics, scores, etc.
    Here are some notes I made reading your article:
    – the dataset is not that biased, negative examples are not orders of magnitude larger than positive
    – normalizing the company names is a very good idea
    – stopwords may bring value in some cases, always test and verify if removing them actually helps
    – A_team noted that in many examples the text doesn’t hold enough information about the relation between the two companies (producing erroneous examples). Did you observe that and did you try to handle it?
    – If the results are on the test set, great results! Very good application of neural networks.

Leave a Reply