Our team name is : Team A
Members: @anie @radpet @kuzman @vicky @rseoane @greenhat
The Ontotext Case – ‘Luke, I am Your Father’
“Well, if droids could think, there’d be none of us here, would there?” — Obi-Wan Kenobi
Abstract
The objective of our task is extract parent-subsidiary relationship in text. For example, a news from techcruch says this, ‘Remember those rumors a few weeks ago that Google was looking to acquire the plug-and-play security camera company, Dropcam? Yep. It just happened.’. Now from this sentence we can infer that Dropcam is a subsidiary of Google. But there are million of companies and several million articles talking about them. A Human being can be tired of doing even 10! Trust me 😉 We have developed some cool Machine learning models spanning from classical algorithms to Deep Neural network do this for you. There is a bonus! We just do not give you probabilities. We also give out that sentences that triggered the algorithm to make the inference! For instance when it says Orcale Corp is the parent of Microsys it can also return that the sentence in its corpus ‘Oracle Corp’s Microsys customer support portal was seen communicating with a server’, triggered its prediction.
Introduction
The aim of the current paper is to explore the idea of optimizing searching within large text collections by teaching an AI to read text and identify parent-subsidiary relations between relevant entities. To achieve this goal the data provided by Ontotext is analyzed and incorporated into models in an attempt to extract simple principles and to offer accurate predictions.
1. Business understanding
At present it takes a lot of resources to assist Ontotext’s clients in their search experience with large text collections – it is both expensive and time-consuming to prepare the necessary annotations. The end goal of the paper is to present a simplified process for obtaining text with annotations, which will enable clients to perform searches more efficiently and ultimately the cost of attaining such data selections will be reduced.
Other important question that has to be answered is whether Precision or Recall is important for the current business. In case if the business needs a very high recall and someone can manually verify the relationships suggested by the algorithm, one of our final models also gives Onto Text an opportunity to retrieve those sentences which triggered the algorithm prediction, this will result in a cost cut even if manual intervention is needed.
One of the main lessons that we learned from this project is that the training data is considerably different from the test data especially in terms of the average number of articles per parent-subsidiary pair and issues discussed, which created a lot of trouble in modeling RNNs. Therefore it is better to have a considerable amount of training data from the actual test data source (which is private source that only Onto text has access to) itself labeled (maybe from Amazon Mechanical Turk) to quickly improve the model which will cut down a considerable cost for the company, as the Data science team is more expensive than AMT 😉 .
2. Project Plan
The core idea of the team to start with simpler models which can be used a baseline and then consequently to build upon it more complicated models.
The project spans over multiple stages, the main ones being:
- Data exploration and Preprocessing
- Augmentation of the existing set with the use of NLG APIs as well as Coreference resolution of entities
- Classical machine learning techniques (TF-IDF, Logistic Regression, SVM, Naive Bayes)
- Neural Networks
3. Data description
The given data consists of 4 columns with information on a parent and subsidiary company, a text snippet containing information on the two companies and a final column with true and false indicators, showing whether there is a positive or negative parent-subsidiary relationship between the entities considered.
Data exploration
There are a total of 89 452 observations in the training dataset. In terms of entities – 451 unique companies are included in the dataset as a whole and there is a difference in the number of unique parent companies relative to that of subsidiary ones:
All companies | Parent companies | Subsidiaries |
451 | 440 | 447 |
Several steps were taken in order to clean the data.
Firstly duplicate rows were addressed – their total number is 10 069. For the purposes of the analysis only unique rows were considered and thus these observations were excluded.
Some issues were discovered in the snippet column – company names sometimes appear concatenated with other words, which leads to the formation of non-existent words that pollute the data. The issue was resolved by including extra intervals, preceding and following every company, within the snippet column.
In addition, 200 instances of articles with mention of only one company within the context of a given pair were identified. For simplification, they were ignored when considering their position as a feature.
The length distribution of each snippet is examined – both in terms of characters and words.
There are 62 881 instances with a negative relationship among companies while in 26 571 cases parent-subsidiary relations score as positive.
However, we failed to implement all data insights we wanted but we created a short notebook located here to illustrate the ideas. You can visit it by using the link below.
https://dss-www-production.s3.amazonaws.com/uploads/2018/02/DataVisualization.ipynb
4. Data preparation
Data selection
Part of the data selection process involves augmentation of the existing set via NLG APIs. Given a data row, values in the first three columns are maintained while the fourth column (text snippet) is processed through a “spintext” tool to generate variations that maintain the same relationships between entities, but where sentence order and nouns have been changed. By following this approach the NN may be trained so as to understand the relationships without relying on specific words. Another approach we didn’t try but was appealing was to use various translate API’s and receive paraphrased sentences with more or less same meaning. For example transitions from en->fr->en or en->de->en and en->es->en would generate enough diversity. We can also apply this to a sample from the test set and create additional variants of the sample that may be recognized by the classifier easier. If you are interested you can read more at
https://github.com/radpet/team_a/blob/master/Ruben/Increasing%2BTraining%2BDataset%2B.ipynb
Co-reference resolution of entities was planned to be an additional feature for our models but we have not reached that stage yet.
The idea is to use coreference resolution in the preprocessing in order to replace coreferences like “they” with the respected company names. For example:
“To sweeten its new unlimited data offering, AT&T said that existing DirecTV or U-Verse customers who don’t subscribe to its wireless service can get $500 if they switch to the AT&T Unlimited Plan with an eligible trade-in, and buy a smartphone on AT&T Next.”
could be viewed as:
“To sweeten AT&T new unlimited data offering, AT&T said that existing DirecTV or U-Verse customers who don’t subscribe to AT&T wireless service can get $500 if DirecTV or U-Verse customers who don’t subscribe to its wireless service switch to the AT&T Unlimited Plan with an eligible trade-in, and buy a smartphone on AT&T Next.”
That way the model can detect relationships not only based on the company entity but also based on references elsewhere in the text.
In order to achieve this, a Stanford CoreNLP server(https://stanfordnlp.github.io/CoreNLP/corenlp-server.html) can be utilized. Furthermore, coreferences that do not point to companies are more or less pointless and could be discarded.
https://github.com/radpet/team_a/blob/master/kuzman/coref_resolution.py
5. Classical machine learning
For the purposes of the analysis and training our models the data was split into three – a train set (with 70% of the data), on which models are trained, dev portion (being 20% of the data) – on which hyper parameter optimization is performed and a test set (with 10 % of the data), on which final test is run.
6. The Attentive Recurrent Neural Network
Data Understanding
Although the data given to us has several snippets corresponding to each parent-subsidy pair, only some of the snippets reveal the actual parent-subsidiary relationship. Therefore we felt that concatenating the snippets corresponding to each pair into one single article and then training can give the model more information about which text snippet actually reveals the parent-subsidy relationship. (Please check Data preparation step 1 for more info.). Therefore from 79383 rows of relationship in the training data we end up in 952 rows of relationship.
In each text snippet irrespective of the real name of the companies, the context around it offers information on the directed parent-subsidiary relationship. Hence in each text snippet we replaced the corresponding parent company and subsidiary company with alias ‘company1’ and ‘company2’.
Data Exploration
It is to be noted that the number of text snippets corresponding to each pair in the training data varied largely from some companies like Google and YouTube having approximately 4000 snippets to smaller companies having 2 or 3 snippets. Such a huge variance created big troubles in the test data which will be explained later.
Data Preparation
- Each snippet corresponding to a parent-subsidiary relationship is consider as a single sentence (although there may be grammatically more sentences). And all the snippets (sentences) corresponding to a pair is considered as a document. Therefore this problem is now transformed into a document classification problem.
- As RNNs in tensorflow need sequences of equal length the sequences are padded/truncated to form a constant length.
- Word vectors are trained using gensim from the training corpus sentences.
- We use 750 documents (pairs) for training and 200 for validation
Modelling
- Each word in a sentence is represented by the corresponding word vector and all the words in a sentence is concatenated sequentially as the input to the Bidirectional GRU.
- The GRU outputs a vector at each timestep (in our case for each word).
- Now we use an attention network which learns weights to for a linear combination of the GRU outputs. The output of the attention network is Sentence vector which represents a sentence.
- Now all the sentence vectors from a document is combined again with another attention network to form a document vector. Note there is no RNN at the second level because the order of sentences doesn’t influence the parent-subsidiary relationship
From the two snippets, ‘Google acquired Youtube’ and ‘Google DSS hackathon to watch its YouTube live’, for this particular task we know that the model should learn that the first sentence is more important than the second sentence and hence we can see a thicker arrow (meaning more weights) has to be given to the first sentence and also in the first sentence the word ‘acquired’, is more important than the other words.
The objective of the neural network is to model this.
Evaluation
The model is capable to taking all the articles corresponding to two companies as input and it gives out a probability of company 2 is a subsidiary of company 1 based on the text snippets.
In addition to this the model as returns important sentences which triggered its prediction. For example
from the test set we can see, a sample output
We can see that the model predicted the important sentence amidst several irrelevant sentences
.Results
The predicted parent-subsidiary pairs are:
company1 company2
0 Danaher_Corporation Pall_Corporation
1 Alibaba_Group AutoNavi
2 Berkshire_Hathaway NV_Energy
3 UniCredit HypoVereinsbank
4 Oracle_Corporation MICROS_Systems
5 Boeing Liquid_Robotics
6 RightNow_Technologies Oracle_Corporation
7 Walmart Walmart_de_México_y_Centroamérica
8 Boeing Pfizer
9 CareFusion Becton_Dickinson
10 Unocal_Corporation Chevron_Corporation
11 Marathon_Oil U.S._Steel
12 Citigroup Boeing
13 Citigroup Bank_Handlowy
14 Comcast Paramount_Pictures
15 Morgan_Stanley AT&T
16 Disney_Vacation_Club The_Walt_Disney_Company
17 Tencent HBO
18 AT&T Fullscreen_(company)
19 Singapore_Airlines Boeing
20 Alere Abbott_Laboratories
And the results from the NN that (@radpet) created can be seen here:
This confusion matrix is based on the predictions that were made on the split used for testing (10% of whole train data) and the network haven’t seen them. *Sadly I lost the tokenizer of this model and I attach pretrained model without change (only random luck with the weight initialization that makes the model scores a bit better on test split)
IV. Modeling
-
Model selection
Before model selection took place two approaches were considered for expressing any given company pair – either to present the parent and subsidiary as an ordered pair of 0 and 1, while specifying that different order translates to a different pair; or to display them as “company1” and “company2” and use this representation within the snippet portion.
Model types:
Classical machine learning – the key idea is to vectorize the text (as well as to add a TF-IDF score) and include it as a feature; company names are removed as such from the text and are instead assigned specific labels such as X and Y (in fact this is the chosen solution to the pair dilemma described above); they are then included and it is examined whether overfitting ensues as a result.
Approaches:
- Bag of words or TF-IDF with SVM
- TF-IDF with Naive-Bayes
- TF-IDF with Logistic Regression
More info at https://github.com/radpet/team_a/blob/master/George/TF-IDF_Classical_ML.ipynb
Neural networks
Two architectures were applied:
- Architecture 1: each pair of companies and its respective snippet text is considered as a separate sample, which is fed into the NN. (@radpet) .
More info at https://github.com/radpet/team_a/blob/master/Radi/LSTM%26CNN.ipynb - Architecture 2: grouping is performed – the snippet portions are concatenated based on their underlying unique company pairs and this is how NN results are returned. (@vicky). More info at https://github.com/radpet/team_a/tree/master/vicky
V. Evaluation
In order to evaluate the results obtained the F1 score is considered rather than the accuracy. By doing so the ratio of negative to positive responses does not have such a significant effect as in the accuracy one. Most of the models implemented come with metric evaluation that consist of confusion matrix and score.
VI. Our Contributors
Model Implementers:
- The state of the art man: @vicky
- Iterative approach with classical machine learning techniques to find a fast baseline and neural network enthusiast: @radpet
- The hyperparameter tuner of Logistic Regression and LinearSVM: @greenhat
- Data scrapers: @rseone and @kuzman
- Data support & article expert: @anie
Warning: This article is the result of Team-A’s collective efforts
13 thoughts on “Ontotext case – Team _A”
Overall you show great understanding of the problem and the data itself. The structure of the article is well formed. However there are a few areas in which you can improve.
First in your evaluation don’t just add the list of pairs. It is very difficult to tell if which is the parent of which just by looking at the list. Especially if you’ve never heard of Danaher_Corporation and Pall_Corporation. Who’s the parent and who’s the subsidiary? Was your model correct in predicting this relationship? Add for instance your precision or recall values or add a confusion matrix? Also among these pairs which did the algorithm misclassify and in your opinion why?
Second and this one is really minor so don’t worry much about it. If you add an abstract don’t make it too detailed. Abstracts are for people who are not experts and are not familiar with NLP. It should be a simple layman’s description of what the problem is and how you propose to solve it. Leave the detailed description for the Introduction.
Hello Toney! The main reason why I didn’t update the validation scores of the RNN-Attention model is because initially I trained the model and the best validation score was 93% accuracy. But when I used it on the test set the predictions were terrible that even a layman would ignore it. Later I realized that the problem is because in the training set each pair had an average of 30-40 text snippet whereas in the test set it was 1-2. Hence the validation accuracy was not getting translated. Instead I reduced the number of articles in the training set to lesser than 10 pairs on each pair and retrained the model and it scored a validation accuracy of 83.2% and the top 20 pairs you see in the article are from this model and the results look plausible. What I am essentially trying to tell is unlike other datasets, this particular dataset can be tricky to report the actual validation scores
Really good work and a very detailed description of your efforts. Seems like you got the teamwork aspect down really well because you’ve managed to accomplish an impressive number of tasks over a single weekend.
The existence of so many duplicate entries surprised me but I discussed it with Laura and apparently there really are whole snippets that frequently repeat in articles verbatim when discussing parent-subsidiary companies. Overall the data analysis was good and detailed and it was good to see that you discovered and resolved issues before feeding it to the algorithm.
The task itself can come in two formats in the wild- either with a large corpus of text that needs to be searched for relations or as a streaming platform where each individual snippet is judged as it is received. Your solution is aimed more at the first case but can be applied to the second. I am satisfied that it is a solid algorithm that produces surprisingly good results. Looking through the pairs you’ve extracted from the (true) test set, the F-score you calculated seems to be supported.
My only criticism is that I really would have liked to see something about how the traditional ML approaches fared in comparison to your state-of-the-art algorithm. It’s great to know you tried them but I am still not sure how they rank.
Similarly the idea to generate more training samples with that service is interesting but reading the article, I am not sure if it got anywhere. Did you manage to generate extra training samples? Were they good? Were they used?
A really minor note, but the article mentions the parent_of relation isnt’ transitive. It (sort of) is and I believe you meant it is anti-symmetric (i.e. A is parent_of B means B is NOT parent_of A)
Great job, guys 🙂
Yes Andrey! We meant anti-symmetric, 40 hours of sleeplessness 😛 True, I was also regretting that we should have spent more time in critically analyzing the results of the classical machine learning models with NNs. Somehow these Neural networks are very fancy to quickly grab the attention like our attention model 😉 It was your and Laura’s continued support that helped us reach the finals! Thanks to you 🙂
You have done terrific job at analyzing the data in various ways and at designing a reasonable, directed neural network model for the task. The model uses deep learning and state-of-the-art tools and techniques (but TF.IDF-based SVM solutions have been also tried for comparison).
What is the baseline F1? Also, what is the accuracy?
Any results on cross-validation based on the training dataset for different choices of the hyperparameters of the network architecture?
Any thought what can be done next to further improve the model? Maybe combine TF.IDF with deep learning? Or perform system combination? Did the different systems perform similarly on the training set (e.g., using cross-validation)?
Also, your confusion matrix is non-standard: it should show the raw counts. I wanted to calculate accuracy, but I cannot do it from this matrix.
BTW, it is nice that the network can give an explanation about what triggered the decision.
Yes , Preslav you are right our confusion matrix shows normalized score between 0 and 1, however it can be easily changed by param inside the notebooks we shared.
I updated the paper with conf matrix without normalization. The result on test set is different because I pretrained the model because of issue with the keras tokenizer.
Preslav, every single question of yours make sense and I think we have to address it perfectly! On my part I will try to rerun the algorithm and try to understand even personally the science behind the combination of hyperparameters and write to you the results that I found. Thank you
Great work guys! I’m really happy to see so many graphics and experiments! It’s really important to visualize the data and experiment (I would say more important than achieving top scores) and you did a great work!
Here are some notes I made while reading the article:
– good analysis and visualisation of the data
– data augmentation is a good idea when not enough data is provided, or when training complex NN models, but 80k snippets seems like a big enough corpus already. I wouldn’t give that a high priority.
– you claim there are differences in the text in the train and test sets? It would be nice to see some graphics about accuracy comparisons on the dev set and test set, or some other form of proof.
– coreference resolution was in Identrics’ case, also some dependency parsing, perhaps you could have used their notebooks 🙂
– I don’t understand this: The first one was using function from R*R -> R that holds h(a,b) != h(b,a) and add this as feature.
– normalizing the company names is a very good idea, specially if you only have 400 companies in all examples
– “Now lets preprocess the unlabeled test set in order to use it as corpus for more words and prepare it for input in the models”. You should be very careful not to transfer some knowledge from the test set in the training phase, even through w2v embeddings.
– Your understanding that there are examples in the training data which don’t hold information about the relation between the two mentioned companies, and are yet in the training set, is a serious problem (if the task is to detect relations on sentence level). Also, kudos for finding this! Concatenating the examples to solve the business problem is one option, yes. Also, you could try to handle the problem on its own, I would suggest using diffferent training sets (from the web), clustering of the training examples or any other analysis which would actually clean up the training data. If this is also valid for the test set, it would be very hard to evaluate any model, not knowing which of the test examples actually hold information about the parent-subsidiary relation.
– “It is to be noted that the number of text snippets corresponding to each pair in the training data varied largely from some companies like Google and YouTube having approximately 4000 snippets to smaller companies having 2 or 3 snippets. Such a huge variance created big troubles in the test data which will be explained later.”
– It looks like you introduced this problem yourself by concatenating all training examples for company pairs in single documents 🙂
– great set of useful experiments and results in the linked notebooks
– Also, I agree with Tony about the abstract, keep it simple and let Ontotext sell their case to the audience 🙂
I will try to explain what I meant by “The first one was using function from R*R -> R that holds h(a,b) != h(b,a) and add this as feature”. The idea was to create a function that converts an ordered pair of labels to just one component. and use it as a feature, because some classifiers we tried create more features based on combinations of the initial and thus I decided that they will only create noise.
Hello Yasen!
Regarding your question on the train and test data differences we noted:
The RNN-Attention model is because initially I trained the model and the best validation score was 93% accuracy. But when I used it on the test set the predictions were terrible that even a layman would ignore it. Later I realized that the problem is because in the training set each pair had an average of 30-40 text snippet whereas in the test set it was 1-2. Hence the validation accuracy was not getting translated. Instead I reduced the number of articles in the training set to lesser than 10 pairs on each pair and retrained the model and it scored a validation accuracy of 83.2% and the top 20 pairs you see in the article are from this model and the results look plausible. What I am essentially trying to tell is unlike other datasets, this particular dataset can be tricky to report the actual validation scores. Yes, by the process of concatenating the text snippets we introduced the problem of discrepancy. But we were trying to be creative and thought we would try to see if it works. BTW I just realized that you are Laura 😛 Thanks a lot for your mentorship!
Greetings 😛
You may access our presentation on the Ontotext case here – https://prezi.com/view/zY5oEyTGXow7M8PGN0xs/