By Asma Afzal ([email protected])
Mentor: Marin Delchev
Platform: Python Jupyter notebook, Scikit-learn
Receipt bank manages bookkeeping for businesses. It makes use of powerful machine learning algorithms to extract useful information from receipts and invoices of many different formats.
A large set of client data is a collection of receipts in a single portable document format (PDF). In order to correctly log data, it is important that the individual receipts from this document are identified so they can be processed. Our task was to calculate the probability of a page in the PDF file to be the beginning of a receipt.
The labeled data set given to us contained receipts both in text format and scanned images. To carry this task we decided to go with the NLP approach, where we used supervised learning based on the featured embedded in the text present in the receipts.
- Stage 1-
- Go through each page of a PDF file and use pdfminer.six library to obtain text from pages with text represented in ASCII or Unicode strings.
- Where no text is found, use textract library to obtain text via OCR of the images.
- A page is classified as 1 if it is the beginning of a receipt and 0 otherwise. ( Read from the provided JSON file).
- Create a data frame of the type shown below and extract raw text from each PDF file
- Stage 2- Feature extraction from the text in each page
- Remove stop words from the text
- Use tf-idf vectorizer from Scikit-learn to extract important words for every page
- Splitting dataset
- The final data set is split into training and test sets using SciKit Learn’s train_test_split function from model_selection.
- Classification using Neural network
- We used the Multi-Layer Perceptron (MLP) Classifier model by Scikit-learn in python. We chose a 3 layer neural network with the same number of neurons as there are features in our data set. The model is fit onto the training set. The test set was used to predict the probabilities for the beginning of a page.
- Classification using Random forest
- Similar to the MLP classifier, we used 100 estimators and the usual parameters to train the random forest classifier
- Precision, recall and F1 score-
- Both the predictive classification models give 90+% of precision and recall with random forest outperforming the MLP classifier for the given number of layers and hidden nodes.
- Log-loss score ranges between 0.1-0.2, which is a significant improvement from the starting log-loss of 0.69 where both classes are equally likely.
- k-fold cross-validation
- We performed 5-fold cross validation to study the accuracy of the models.
- Feature importance ranking for random forest classifier
We have written a script which takes a folder path as input, goes through all pdf files in it, processes it and performs prediction on its pages using our trained model.
There are 3 notebooks.
- Extract Text from PDF file
- Extract Features and Train Model
- Deployment Using Saved Model