By Asma Afzal ([email protected])
Mentor: Marin Delchev
Platform: Python Jupyter notebook, Scikit-learn
Business Understanding
Receipt bank manages bookkeeping for businesses. It makes use of powerful machine learning algorithms to extract useful information from receipts and invoices of many different formats.
Data Understanding
A large set of client data is a collection of receipts in a single portable document format (PDF). In order to correctly log data, it is important that the individual receipts from this document are identified so they can be processed. Our task was to calculate the probability of a page in the PDF file to be the beginning of a receipt.
The labeled data set given to us contained receipts both in text format and scanned images. To carry this task we decided to go with the NLP approach, where we used supervised learning based on the featured embedded in the text present in the receipts.
Data Preparation
- Stage 1-
- Go through each page of a PDF file and use pdfminer.six library to obtain text from pages with text represented in ASCII or Unicode strings.
- Where no text is found, use textract library to obtain text via OCR of the images.
- A page is classified as 1 if it is the beginning of a receipt and 0 otherwise. ( Read from the provided JSON file).
- Create a data frame of the type shown below and extract raw text from each PDF file
- Stage 2- Feature extraction from the text in each page
- Remove stop words from the text
- Use tf-idf vectorizer from Scikit-learn to extract important words for every page
- Splitting dataset
- The final data set is split into training and test sets using SciKit Learn’s train_test_split function from model_selection.
Modeling
- Classification using Neural network
- We used the Multi-Layer Perceptron (MLP) Classifier model by Scikit-learn in python. We chose a 3 layer neural network with the same number of neurons as there are features in our data set. The model is fit onto the training set. The test set was used to predict the probabilities for the beginning of a page.
- Classification using Random forest
- Similar to the MLP classifier, we used 100 estimators and the usual parameters to train the random forest classifier
Evaluation
- Precision, recall and F1 score-
- Both the predictive classification models give 90+% of precision and recall with random forest outperforming the MLP classifier for the given number of layers and hidden nodes.
- Log-loss
- Log-loss score ranges between 0.1-0.2, which is a significant improvement from the starting log-loss of 0.69 where both classes are equally likely.
- k-fold cross-validation
- We performed 5-fold cross validation to study the accuracy of the models.
- Feature importance ranking for random forest classifier
*Deployment
We have written a script which takes a folder path as input, goes through all pdf files in it, processes it and performs prediction on its pages using our trained model.
*Jupyter Notebooks
There are 3 notebooks.
- Extract Text from PDF file
- Extract Features and Train Model
- Deployment Using Saved Model
3 thoughts on “Datathon 2018- Receipt Bank Solution”
Good job. I really like most of the work you’ve done especially when considering the somewhat difficult dataset and task. Here are a few, I hope constructive remarks:
* Why are you dropping specific words like ‘insertsomenumber’, ‘ce’,’cid’,’skype’,’www’,’com’? If these are very common words that you want removed, I think a better approach would be to tweak the `max_df` parameter of TfidfVectorizer.
* You fit the tfidf on the entire dataset and only then you split to do the cross validation. This is not very clean as the tfidf actually learns stuffs about the data (unlike a count/hashing vectorizer for example) and by fitting and then splitting you are probably getting slightly better results as the tfidf vectoriser is seeing part of the test data as well.
* I guess this wasn’t included in the submission template but I would have liked to see `Future Ideas` or something like this.
Again, good work. I hope you enjoyed it.
Thank you. I couldn’t spend much time on refining Tfidf implementation but I would definitely like to explore it further.
Very well done! Clean work and clean presentation!