|For how many years have you been experimenting with data?
Popular articles by andreytagarev
Popular comments by andreytagarev
Really good work and a very detailed description of your efforts. Seems like you got the teamwork aspect down really well because you’ve managed to accomplish an impressive number of tasks over a single weekend.
The existence of so many duplicate entries surprised me but I discussed it with Laura and apparently there really are whole snippets that frequently repeat in articles verbatim when discussing parent-subsidiary companies. Overall the data analysis was good and detailed and it was good to see that you discovered and resolved issues before feeding it to the algorithm.
The task itself can come in two formats in the wild- either with a large corpus of text that needs to be searched for relations or as a streaming platform where each individual snippet is judged as it is received. Your solution is aimed more at the first case but can be applied to the second. I am satisfied that it is a solid algorithm that produces surprisingly good results. Looking through the pairs you’ve extracted from the (true) test set, the F-score you calculated seems to be supported.
My only criticism is that I really would have liked to see something about how the traditional ML approaches fared in comparison to your state-of-the-art algorithm. It’s great to know you tried them but I am still not sure how they rank.
Similarly the idea to generate more training samples with that service is interesting but reading the article, I am not sure if it got anywhere. Did you manage to generate extra training samples? Were they good? Were they used?
A really minor note, but the article mentions the parent_of relation isnt’ transitive. It (sort of) is and I believe you meant it is anti-symmetric (i.e. A is parent_of B means B is NOT parent_of A)
Great job, guys 🙂
Good understanding of the business problem and a good choice for an algorithm that can tackle the task. I would have liked to learn a bit more about the nitty-gritty of the solution from the article- what was the structure of the best performing model, how was data split, what data were recall and accuracy calculated on.
I’m looking forward to see how well your approach did on the test data as the results you’ve gotten are quite promising.