Track Travel Expenses From Train Tickets

Extract information from train ticket billing receipts

Hey there,

Welcome to the second episode of the AI For BI Rocks! Newsletter. Each week I'll share a practical example of how AI can be used to improve analytics.

This week I wrote a post about how I updated my AI for BI framework. Here's how the framework looks like:

As you can see from the picture, AI can support BI across three main themes:

  • READ: Make more data accessible for analysis, e.g., by using computer vision to extract information from PDF documents.

  • SPEED: Reduce time to insight by using machine learning to automatically extract patterns/signals from data

  • LEAD: Let AI take over automated decision making by becoming the core of a recommendation service.

In today's example, I'll show you how we can use AI to extract information from train ticket billing receipts (READ) to use this data for further downstream analysis.

Problem Statement

Imagine we work for a mid-sized consulting firm that wants to figure out what drives travel costs. Let's assume that most travel is done via high-speed trains, called InterCity Express (ICE) in Germany.

The process for business travel looks like this:

  • Sales reps book their train tickets over a self-service portal from the train operator (Deutsche Bahn).

  • These tickets are paid for with a company credit card.

  • The travel management team monitors the entire credit card statement at the end of each month.

  • However, the credit card statements don't contain any details other than the amount spent and the date.

The travel management team needs more data to better understand which business trips are costing the most, including a breakdown of:

  • booking date

  • trip origin

  • trip destination

  • ticket price

Now we'll try to extract this information automatically from all booking receipts.

We've received a sample of 162 receipts from a single sales rep to work on a first prototype. A receipt looks like this:

Solution Overview

To quickly build a prototype and test feasibility, we're using an existing computer vision AI service to extract structured text elements from the PDF files (see Resources for a list of such services).

The overall use case architecture looks like this:

The solution architecture includes the following steps:

  1. Deploy an out-of-the box AI service —in this case, Cognitive Services for Computer Vision on Microsoft Azure (analysis layer).

  2. Load the data into a staging area—in this case, again using Azure Blob Storage (data layer).

  3. Prepare a small ETL script that loads the data from the staging area, applies the AI service, and transforms it to a flat CSV (data layer).

  4. Upload the CSV file to a location from where it can be easily accessed and visualized with our BI tool, in this case Power BI (user layer).

Let’s go!

Walk-Through

We'll walk through these layers step by step.

Analysis Layer

Enable the AI service - in this case Azure Cognitive Services for OCR (alternatives in the resources below) and remember your credentials (key and endpoint).

That’s it!

Data Layer

First, we upload the PDF files (see resources at the bottom) to an Azure Blob Storage container so we can access them easily.

All PDF files should be unzipped and stored in a dedicated “pdfs”' container on Azure Blob Storage.

Next, we’ll apply some ETL steps (find the full script at the end):

  • Download the pdf files from the Azure Blob Storage.

  • Call the AI service

  • Fetch the results from the analysis as an asynchronous operation.

What is an asynchronous operation?

While last week we retrieved the results immediately from the API, in this computer vision case the process is more complex and the results may take a second.

Therefore, for each document, we need to make two API calls: a POST request to send the document to the AI service, and a GET request to fetch the results.

Considering the limitations of the free plan (max. 20 requests per minute), the total analysis of the 162 PDF documents should take about 30 minutes.

Restrictions: Check the GDPR/data protection compliance of your data and your region. It's generally advisable to make sure you remove ANY PII from documents or get permissions to process this information.

Data Preprocessing

Calling the AI service has been pretty easy so far.

Extracting the information from the response, however, can be a bit tricky.

To understand why, let's see what the AI result looks like (truncated for brevity):

Compare this to the original document where I highlighted the elements that we are interested in (Date, Price, Origin, Destination):

You can see that the AI parsed the document line by line.

It starts with “DB” and “Online-Ticket” from the top left and ends with “Seite 1/1” on the bottom right.

Each line of text is an element of the result object, and the numbers that follow are the bounding boxes for these lines, indicating where the line was positioned in the document.

The bounding boxes are represented by x and y coordinates starting at the top left of the document with coordinates x = 0 and y = 0. Each bounding box is mapped as [x1, y1, x2, y2, x3, y3, x4, y4] like so:

We'll use these coordinates along with some keywords to extract the desired information from the document. For this purpose, there are three functions in the code:

get_text_by_keyword(result, keyword, n) → Returns the next n words after a keyword

get_text_between_keywords(result, keyword1, keyword2) → Returns everything between two keywords

get_text_by_position(result, filter_box) → Returns every texts inside a bounding box (for example to get text that is supposed to be in the left half of the document)

These three functions will be enough to extract almost every information we want from the PDF.

Now we just need to merge our results from several PDFs, combine them in a nice table and write a flat CSV file so the result looks something like this:

We can upload this CSV to another container on Azure Blob Storage from where we can easily pull it into any BI tool we want.

User Layer

Let's get to the fun part.

Open your favorite BI tool (I chose Power BI ) and load the CSV we just created to get more insights.

I imported the file using PowerQuery and made sure that the data was parsed correctly with the right data types:

Now it's easy to create a simple report that shows not only the total travel expenses per year, but also the most popular destinations and the most popular routes (combinations of origins and destinations).

I chose a tree chart as a visual, with sizes corresponding to the money spent:

With these visuals, the travel management team can clearly see the overall cost trend and absolute trip numbers.

The team can also see which routes contribute the most to overall travel costs (in our case, it's the connection from Hamburg to Leipzig and back).

We can also see that most trips start in Hamburg, while Bonn and Frankfurt are the second and third most popular destinations from there.

With this little exercise, we were able to create a quick prototype, probably within a day or two.

Of course, this is still a prototype and there are still many things to consider, but at least we can validate the following points:

  1. Is this really the information the travel team is looking for?

  2. Did we find anything surprising?

  3. Would we be able to act on this information?

Check out the resources below to create this dashboard yourself!

Did you like this use case? Drop me a reply and let me know!

And feel free to forward this newsletter to your colleagues!

See you next week!

Resources

Computer Vision AI Services

Data preparation code

Power BI dashboard

AI-Powered Business Intelligence Book Cover

This use case was adapted from my book AI-Powered Business Intelligence (O’Reilly). You can read it in full detail here: https://www.aipoweredbi.com

Reply

or to participate.