|For how many years have you been experimenting with data?
Popular articles by yasen
Popular comments by yasen
Great work guys! I’m really happy to see so many graphics and experiments! It’s really important to visualize the data and experiment (I would say more important than achieving top scores) and you did a great work!
Here are some notes I made while reading the article:
– good analysis and visualisation of the data
– data augmentation is a good idea when not enough data is provided, or when training complex NN models, but 80k snippets seems like a big enough corpus already. I wouldn’t give that a high priority.
– you claim there are differences in the text in the train and test sets? It would be nice to see some graphics about accuracy comparisons on the dev set and test set, or some other form of proof.
– coreference resolution was in Identrics’ case, also some dependency parsing, perhaps you could have used their notebooks 🙂
– I don’t understand this: The first one was using function from R*R -> R that holds h(a,b) != h(b,a) and add this as feature.
– normalizing the company names is a very good idea, specially if you only have 400 companies in all examples
– “Now lets preprocess the unlabeled test set in order to use it as corpus for more words and prepare it for input in the models”. You should be very careful not to transfer some knowledge from the test set in the training phase, even through w2v embeddings.
– Your understanding that there are examples in the training data which don’t hold information about the relation between the two mentioned companies, and are yet in the training set, is a serious problem (if the task is to detect relations on sentence level). Also, kudos for finding this! Concatenating the examples to solve the business problem is one option, yes. Also, you could try to handle the problem on its own, I would suggest using diffferent training sets (from the web), clustering of the training examples or any other analysis which would actually clean up the training data. If this is also valid for the test set, it would be very hard to evaluate any model, not knowing which of the test examples actually hold information about the parent-subsidiary relation.
– “It is to be noted that the number of text snippets corresponding to each pair in the training data varied largely from some companies like Google and YouTube having approximately 4000 snippets to smaller companies having 2 or 3 snippets. Such a huge variance created big troubles in the test data which will be explained later.”
– It looks like you introduced this problem yourself by concatenating all training examples for company pairs in single documents 🙂
– great set of useful experiments and results in the linked notebooks
– Also, I agree with Tony about the abstract, keep it simple and let Ontotext sell their case to the audience 🙂
Great work! I agree with the mentors above 🙂 You built a great model and achieved a very high score! Great productivity for a group of 3 people. I would also like to see more graphics, scores, etc.
Here are some notes I made reading your article:
– the dataset is not that biased, negative examples are not orders of magnitude larger than positive
– normalizing the company names is a very good idea
– stopwords may bring value in some cases, always test and verify if removing them actually helps
– A_team noted that in many examples the text doesn’t hold enough information about the relation between the two companies (producing erroneous examples). Did you observe that and did you try to handle it?
– If the results are on the test set, great results! Very good application of neural networks.
An interesting and original idea to apply clustering for price prediction. It would be nice to elaborate more on the results.
Thank you for the good introduction. Also, the data overview and exploratory analysis look good.
question: How did you split the data in train / test in terms of time periods, did you get random points or a longer period for testing?
It would be nice to compare the reported results to a simple baseline, for a better perspective of the achieved results.
For image similarity it would be nice to present some results with a sample set of images.
It is good that you tried different algorithms, it is however not clear how was the data split in train / test, based on the results it looks like there is a data leakage between train – test sets.