A simple solution to the problem of company name deduplication. No machine learning, just data prep.
Overview of the data flow.
Data => Parser => TTM => TF-IDF => Model => Document Clusters
Although the data given to us has several snippets corresponding to each parent-subsidy pair, only some of the snippets reveal actual parent-subsidiary relationship. Therefore we felt that concatenating the snippets corresponding to each pair into one single article and then training can give the model more information about which text snippet actually reveals the parent-subsidy relationship. A Bidirectional GRU models each sentence into a sentence vector and then two attention networks try to figure out the important words in each sentence and important sentences in each document. In addition to returning the probability of company 2 being a subsidiary of company 1 the model as returns important sentences which triggered its prediction. For instance when it says Orcale Corp is the parent of Microsys it can also return that
Orcale Corp’s Microsys customer support portal was seen communicating with a server known to be used by the carbanak gang, is the sentence which triggered its prediction.
This paper presents a DNN-based approach to learn entities relations from distant-labeled free text. The proposed approach presents task-specific data cleaning, which despite effective in removing textual noise is preserving semantics necessary for the training process. The cleaned-up dataset is then used to build a number of bLSTM attention-based DNN models, hyper-tuned using recall as an optimization objective. The resulting models are then joined into an ensemble that deliver our best result
A brief beginner’s guide for Natural Language Processing (NLP)