# Datathon 2018- Receipt Bank Solution

#### 3 thoughts on “Datathon 2018- Receipt Bank Solution”

Good job. I really like most of the work you’ve done especially when considering the somewhat difficult dataset and task. Here are a few, I hope constructive remarks:

* Why are you dropping specific words like ‘insertsomenumber’, ‘ce’,’cid’,’skype’,’www’,’com’? If these are very common words that you want removed, I think a better approach would be to tweak the max_df parameter of TfidfVectorizer.
* You fit the tfidf on the entire dataset and only then you split to do the cross validation. This is not very clean as the tfidf actually learns stuffs about the data (unlike a count/hashing vectorizer for example) and by fitting and then splitting you are probably getting slightly better results as the tfidf vectoriser is seeing part of the test data as well.
* I guess this wasn’t included in the submission template but I would have liked to see Future Ideas or something like this.

Again, good work. I hope you enjoyed it.

