Case study


Design and development of end-to-end receipt analysis using OCR and machine learning
PolyBox Founders 20

The company

PolyBox Solutions is a small technology start-up, set up by two former students of Newcastle University Business School, and their aim is to bring receipts into the digital era by eliminating them physically.

Founded by Digital Business MSc graduates Nikolaos Benopoulos and Rojin Yarahmadi, PolyBox provides practical support for retail brands and their customers by offering an "all-in-one" solution, including a reporting system to aid customer engagement and retention, and a receipt function to give their customers budgeting and analysis tools.

The goal

The goal of this project was to support businesses by creating a scalable reporting system that would allow them to have a more thorough understanding of their expenses and budgeting. The idea was to achieve this by applying machine learning techniques.

There are more than 11 billion receipts printed in the UK each year. Many of these will be used for expense returns, and this is before we’ve accounted for online invoices and e-receipts. Solutions in this area are currently scarce and mostly limited to financial applications that do not include real-time reporting and analysis.

PolyBox sought to improve the expense returns process for businesses by minimising the amount of manual work involved and by automatically scanning receipts using Optical Character Recognition (OCR) technology.

"PolyBox provided a real-world data challenge with a need to analyse paper receipts in an automated fashion and in real-time. Our team analysed the technical requirements and then designed and implemented an end-to-end scalable solution built on the Azure platform with serverless compute to optimise the costs and provide businesses with a robust solution."

Peter Michalák, Edge Solution Specialist, NICD

PolyBox Founders 30

The results

The primary focus of this project was to find the best data analysis process and the best method for scanning receipts. Thanks to the significant data science expertise brought to the table by our team here at the National Innovation Centre for Data, PolyBox was able to develop a deeper understanding of how OCR systems work, and what the best options for a scalable solution with suitable databases and data analysis might be.

We helped PolyBox to achieve this by:

  • Analysing and testing different OCR readers
  • Designing a scalable serverless cloud architecture
  • Reviewing different databases and data processing techniques
  • Testing the OCR model on different kinds of receipts
  • Developing an end-to-end receipt processing solution

"Without the National Innovation Centre for Data, we could never have reached the technical structure solution in this timeframe. Their team are friendly, focussed and accurate. We are lucky to have them in the North East."

Rojin Yarahmadi, Co-Founder, PolyBox Solutions

Collecting and analysing data

PolyBox tested low-code minimum viable product (MVP) before starting their project with us, gathering data from ten different companies and analysing it carefully. Based on this data collection and analysis, together with interviews and other market research, PolyBox knew they needed more assistance from data specialists.

PolyBox initially approached the Arrow programme at Newcastle University, which pairs fledgling businesses seeking to innovate and grow with the specific kind of expertise they need, and they were quickly matched with our team at the National Innovation Centre for Data.

This project was led by our Edge Solution Specialist Dr Peter Michalák, who is particularly skilled in the areas PolyBox needed assistance with.

PolyBox Founders 40

Data solution

The ultimate aim of this project was to design and rapidly prototype a scalable solution that would automate the analysis of pictures of paper receipts. The format of the receipts analysed varied greatly and encompassed everything from more standard supermarket and restaurant receipts to tickets for events, and many more.

The technical solution we prepared was based on several cloud services, including Azure Blob Storage, Azure Functions, Service Bus, and PortgresSQL database with PostgREST. But the key component in the receipt analysis undertaken was Form Recognizer from Azure Cognitive Services, which is an AI service which utilises an advanced machine learning algorithm to extract text from documents.

The prototype also included two custom build Python scripts, which were: (i) receipt upload, and (ii) data retrieval. The first script allows the user to upload a picture of a receipt to a blob store and submit a receipt processing request via message bus. This scalable solution, based on serverless architecture, then automatically extracts relevant information from the receipt and stores it in a database. The second script serves to demonstrate the ability of the user to query the database and retrieve extracted receipt details.

During the project, we also identified several areas for potential further improvement of the implemented solution, which would likely focus on improving automated receipt quality assessment and analysis performance through use of a set of pre-trained and fine-tuned deep learning models.

Explore the next chapter in Polybox’s innovation story and uncover the outcomes of their partnership with the National Innovation Centre for Data.
Read the full story here.

To find out more about PolyBox, visit their website. You can also find out more about Newcastle University's Arrow programme here.


Our Discovery workshop

Our Discovery workshops enable you to explore the potential of your data and understand the benefit you could gain before committing to a full-scale project.