Team solutions

Datathon 2018- Receipt Bank Solution

1
votes

3 thoughts on “Datathon 2018- Receipt Bank Solution

  1. 0
    votes

    Good job. I really like most of the work you’ve done especially when considering the somewhat difficult dataset and task. Here are a few, I hope constructive remarks:

    * Why are you dropping specific words like ‘insertsomenumber’, ‘ce’,’cid’,’skype’,’www’,’com’? If these are very common words that you want removed, I think a better approach would be to tweak the `max_df` parameter of TfidfVectorizer.
    * You fit the tfidf on the entire dataset and only then you split to do the cross validation. This is not very clean as the tfidf actually learns stuffs about the data (unlike a count/hashing vectorizer for example) and by fitting and then splitting you are probably getting slightly better results as the tfidf vectoriser is seeing part of the test data as well.
    * I guess this wasn’t included in the submission template but I would have liked to see `Future Ideas` or something like this.

    Again, good work. I hope you enjoyed it.

Leave a Reply