The Attentive Recurrent Neural Network
Data Understanding
Although the data given to us has several snippets corresponding to each parent-subsidy pair, only some of the snippets reveal actual parent-subsidiary relationship. Therefore we felt that concatenating the snippets corresponding to each pair into one single article and then training can give the model more information about which text snippet actually reveals the parent-subsidy relationship. (Please check Data preparation step 1 for more info.). Therefore from 79383 rows of relationship in the training data we end up in 952 rows of relationship.
In each text snippet irrespective of the real name of the involved companies, the context around it offers information on the directed parent-subsidiary relationship. Hence in each text snippet we replaced the corresponding parent company and subsidiary company with alias ‘company1’ and ‘company2’.
It is to be noted that the number of text snippets corresponding to each pair in the training data varied largely from some companies like Google and YouTube having approximately 4000 snippets to smaller companies having 2 or 3 snippets. Such a huge variance created big troubles in the test data which will be explained later.
Data Preparation
- Each snippet corresponding to a parent-subsidiary relationship is consider as a single sentence (although there may be grammatically more sentences). And all the snippets (sentences) corresponding to a pair is considered as a document. Therefore this problem is now transformed into a document classification problem.
- As RNNs in tensorflow need sequences of equal length the sequences are padded/truncated to form a constant length.
- Word vectors are trained using gensim from the training corpus sentences.
- We use 750 documents (pairs) for training and 200 for validation
Modelling
- Each word in a sentence is represented by the corresponding word vector and all the words in a sentence is concatenated sequentially as the input to the Bidirectional GRU.
- The GRU outputs a vector at each timestep (in our case for each word).
- Now we use an attention network which learns weights to for a linear combination of the GRU outputs. The output of the attention network is Sentence vector which represents a sentence.
- Now all the sentence vectors from a document is combined again with another attention network to form a document vector. Note there is no RNN at the second level because the order of sentences doesn’t influence the parent-subsidiary relationship.
From the two snippets, ‘Google acquired Youtube’ and ‘Google DSS hackathon to watch its YouTube live’, for this particular task we know that the model should learn that the first sentence is more important than the second sentence and hence we can see a thicker arrow (meaning more weights) has to be given to the first sentence and also in the first sentence the word ‘acquired’, is more important than the other words.
The objective of the neural network is to model this.
Evaluation
The model is capable to taking all the articles corresponding to two companies as input and it gives out a probability of company 2 is a subsidiary of company 1 based on the text snippets.
In addition to this the model as returns important sentences which triggered its prediction. For example
from the test set we can see, a sample output
We can see that the model predicted the important sentence amidst several irrelevant sentences.
Results
The predicted parent-subsidiary pairs are:
company1 company2 0 Danaher_Corporation Pall_Corporation 1 Alibaba_Group AutoNavi 2 Berkshire_Hathaway NV_Energy 3 UniCredit HypoVereinsbank 4 Oracle_Corporation MICROS_Systems 5 Boeing Liquid_Robotics 6 RightNow_Technologies Oracle_Corporation 7 Walmart Walmart_de_México_y_Centroamérica 8 Boeing Pfizer 9 CareFusion Becton_Dickinson 10 Unocal_Corporation Chevron_Corporation 11 Marathon_Oil U.S._Steel 12 Citigroup Boeing 13 Citigroup Bank_Handlowy 14 Comcast Paramount_Pictures 15 Morgan_Stanley AT&T 16 Disney_Vacation_Club The_Walt_Disney_Company 17 Tencent HBO 18 AT&T Fullscreen_(company) 19 Singapore_Airlines Boeing 20 Alere Abbott_Laboratories 21 AT&T CBS_Interactive 22 Chesapeake_Energy Williams_Companies 23 Ingersoll_Rand Trane 24 Simmons_&_Company_International Piper_Jaffray 25 The_Walt_Disney_Company General_Motors 26 The_Walt_Disney_Company The_Walt_Disney_Company_Italy 27 Apple_Inc. Paramount_Pictures 28 Amazon.com Disneyland_Resort 29 Medivation Allergan 30 Apple_Inc. Allergan 31 Apple_Inc. Disneyland_Resort 32 Camping_World Gander_Mountain 33 Berkshire_Hathaway General_Motors 34 Walmart HBO 35 Textron Boeing 36 EnerNOC Enel 37 PayPal Boeing 38 Boeing Standard_&_Poor's 39 American_Capital Ares_Management 40 Centene_Corporation Health_Net 41 Ford_Motor_Company HBO 42 ExxonMobil Standard_&_Poor's 43 Teva_Pharmaceutical_Industries Cephalon 44 General_Electric Standard_&_Poor's 45 Rosneft Yukos 46 Axtel ALFA_(Mexico) 47 British_American_Tobacco Reynolds_American
`