Team solutions

case_onto_text_team_a_vicky

Although the data given to us has several snippets corresponding to each parent-subsidy pair, only some of the snippets reveal actual parent-subsidiary relationship. Therefore we felt that concatenating the snippets corresponding to each pair  into one single article and then training can give the model more information about which text snippet actually reveals the parent-subsidy relationship. A Bidirectional GRU models each sentence into a sentence vector and then two attention networks try to figure out the important words in each sentence and important sentences in each document. In addition to returning the probability of company 2 being a subsidiary of company 1 the model as returns important sentences which triggered its prediction. For instance when it says Orcale Corp is the parent of Microsys it can also return that
Orcale Corp’s Microsys customer support portal was seen communicating with a server known to be used by the carbanak gang, is the sentence which triggered its prediction.

0
votes

The Attentive Recurrent Neural Network

Data Understanding

Although the data given to us has several snippets corresponding to each parent-subsidy pair, only some of the snippets reveal actual parent-subsidiary relationship. Therefore we felt that concatenating the snippets corresponding to each pair  into one single article and then training can give the model more information about which text snippet actually reveals the parent-subsidy relationship. (Please check Data preparation step 1 for more info.).  Therefore from 79383 rows of relationship in the training data we end up in 952 rows of relationship.

In each text snippet irrespective of the real name of the involved companies, the context around it offers information on the directed parent-subsidiary relationship. Hence in each text snippet we replaced the corresponding parent company and subsidiary company with alias ‘company1’ and ‘company2’.

It is to be noted that the number of text snippets corresponding to each pair in the training data varied largely from some companies like Google and YouTube having approximately 4000 snippets to smaller companies having 2 or 3 snippets.  Such a huge variance created big troubles in the test data which will be explained later.

Data Preparation

  1. Each snippet corresponding to a parent-subsidiary relationship is consider as a single sentence (although there may be grammatically more sentences). And all the snippets (sentences) corresponding to a pair is considered as a document. Therefore this problem is now transformed into a document classification problem.
  2. As RNNs in tensorflow need sequences of equal length the sequences are padded/truncated to form a constant length.
  3. Word vectors are trained using gensim from the training corpus sentences.
  4. We use 750 documents (pairs) for training and 200 for validation

Modelling

  1. Each word in a sentence is represented by the corresponding word vector and all the words in a sentence is concatenated sequentially as the input to the Bidirectional GRU.
  2. The GRU outputs a vector at each timestep (in our case for each word).
  3. Now we use an attention network which learns weights to for a linear combination of the GRU outputs. The output of the attention network is Sentence vector which represents a sentence.
  4. Now all the sentence vectors from a document is combined again with another attention network to form a document vector. Note there is no RNN at the second level because the order of sentences doesn’t influence the parent-subsidiary relationship.

 

From the two snippets, ‘Google acquired Youtube’ and ‘Google DSS hackathon to watch its YouTube live’, for this particular task we know that the model should learn that the first sentence is more important than the second sentence and hence we can see a thicker arrow (meaning more weights) has to be given to the first sentence and also in the first sentence the word ‘acquired’, is more important than the other words.

The objective of the neural network is to model this.

Evaluation

The model is capable to taking all the articles corresponding to two companies as input and it gives out a probability of company 2 is a subsidiary of company 1 based on the text snippets.

In addition to this the model as returns important sentences which triggered its prediction. For example

from the test set we can see, a sample output

We can see that the model predicted the important sentence amidst several irrelevant sentences.

Results

The predicted parent-subsidiary pairs are:

                           company1                           company2
0               Danaher_Corporation                   Pall_Corporation
1                     Alibaba_Group                           AutoNavi
2                Berkshire_Hathaway                          NV_Energy
3                         UniCredit                    HypoVereinsbank
4                Oracle_Corporation                     MICROS_Systems
5                            Boeing                    Liquid_Robotics
6             RightNow_Technologies                 Oracle_Corporation
7                           Walmart  Walmart_de_México_y_Centroamérica
8                            Boeing                             Pfizer
9                        CareFusion                   Becton_Dickinson
10               Unocal_Corporation                Chevron_Corporation
11                     Marathon_Oil                         U.S._Steel
12                        Citigroup                             Boeing
13                        Citigroup                      Bank_Handlowy
14                          Comcast                 Paramount_Pictures
15                   Morgan_Stanley                               AT&T
16             Disney_Vacation_Club            The_Walt_Disney_Company
17                          Tencent                                HBO
18                             AT&T               Fullscreen_(company)
19               Singapore_Airlines                             Boeing
20                            Alere                Abbott_Laboratories
21                             AT&T                    CBS_Interactive
22                Chesapeake_Energy                 Williams_Companies
23                   Ingersoll_Rand                              Trane
24  Simmons_&_Company_International                      Piper_Jaffray
25          The_Walt_Disney_Company                     General_Motors
26          The_Walt_Disney_Company      The_Walt_Disney_Company_Italy
27                       Apple_Inc.                 Paramount_Pictures
28                       Amazon.com                  Disneyland_Resort
29                       Medivation                           Allergan
30                       Apple_Inc.                           Allergan
31                       Apple_Inc.                  Disneyland_Resort
32                    Camping_World                    Gander_Mountain
33               Berkshire_Hathaway                     General_Motors
34                          Walmart                                HBO
35                          Textron                             Boeing
36                          EnerNOC                               Enel
37                           PayPal                             Boeing
38                           Boeing                  Standard_&_Poor's
39                 American_Capital                    Ares_Management
40              Centene_Corporation                         Health_Net
41               Ford_Motor_Company                                HBO
42                       ExxonMobil                  Standard_&_Poor's
43   Teva_Pharmaceutical_Industries                           Cephalon
44                 General_Electric                  Standard_&_Poor's
45                          Rosneft                              Yukos
46                            Axtel                      ALFA_(Mexico)
47         British_American_Tobacco                  Reynolds_American

`

 

 

 

Share this

Leave a Reply