Business problem

Receipt Bank provides technology that unlocks the value of accounting data and automates the bookkeeping process. Our AI and automation technologies are used by over 5,000 accounting & bookkeeping firms and tens of thousands of small business customers globally. The huge amount of documents that our clients produce and the diversity of these documents introduces complex machine learning challenges. In order to efficiently extract the valuable information contained in the documents we must know how many items are present in a client’s file. For example, an image contains 2 receipts, a pdf file consisting of 6 pages contains 4 different invoices. Applying ML algorithm to each item independently is more efficient and reduces business costs.
In this competition your task will be to develop an algorithm that detects how many documents are contained in a PDF file. More precisely, you would need to think of a model that outputs probability score for each page in a PDF file being the beginning of a new document.

