How To Predict Flight Delays with AutoML

Or: Will your plane land on time?

Read time: 7 minutes

Hey there,

Get ready for takeoff!

Today we take a look under the hood of a use case that helps airlines predict whether or not their planes will land on time.

We use Azure AutoML to train a machine learning model, but the same concepts work on other cloud platforms like GCP or AWS as well.

Let's go!

Want to find good data & AI use cases for your business? Then join my upcoming live cohort in December: Book a free 1:1 call with me for more details!

Problem Statement

Imagine we're a businesss analyst at an airline such as AA, and our task is to maintain a dashboard for reporting on a business-critical metric.

In our case, this metric is the percentage of flights that are currently in the air and are expected to be delayed by more than 15 minutes.

For our example, let's assume that every hour we receive data with all flights that have departed in the last 60 minutes.

These hourly batches contain information about the flight details (origin, destination, flight number, etc.) and the flight delay at departure.

So, for example, if we look at a dataset between 9:00 and 9:59, the information in that dataset will contain the flights that departed between 8:00 and 8:59.

To make this example less complicated, we ignore time zones and assume that all timestamps are in coordinated universal time (UTC).

Our current approach to calculate the proportion of delayed flights at arrival is to simply take the proportion of delayed flights at departure.

After all, if a plane took off late, it'll probably arrive late, right?

Well, not always.

It turns out that this approach isn't bad, but not very accurate either.

If we use only the departure delay data to predict whether there will be a delay in arrival, we achieve a precision of 74% and a recall of 70%, or an F1-score of 0.72 when we combine thw two.

Our goal is to come up with a new classifier that manages to beat this 0.72 F1-score, so we can make even more accurate predictions about flight delays on arrival.

Solution Overview

To improve the prediction for late arrival, we use a machine learning model trained by Auto ML on historical data.

Here's a conceptual overview of the architecture for this use case:

In the Data layer we have two data sources:

  1. Historical flight data provided as a CSV file

  2. Live data in the form of hourly batches that are consumed by our BI system

In the Analysis layer, we use Azure ML studio's AutoML service to learn patterns from historical flight data and use this information to classify new data points (hourly flight batches ) whether or not these flights are likely to be delayed.

At the User layer, we use a small Python (or R) script in Power Query to retrieve the online predictions from the Azure ML Studio API and feed the results back into Power BI and present the results in a Power BI report.

Note: Power BI and Azure ML can also be integrated natively, without R or Python programming. However, this requires a Pro or Premium Power BI licence. Also, the R/Python script gives you more flexibility if you want to connect another AI/AutoML service that's not hosted on Azure.

Model training with AutoML

If you want to follow allong, check out the resource links below!

To train our model on historic data, we'll visit Azure ML Studio and start a new Automated ML job:

To get started, we need to provide some data.

On Azure, data is organized in data assets that are linked to your workspace.

Data assets (datasets) will ensure that your data is correctly formatted and can be consumed by ML jobs:

You can import data from various sources. In this case, we will upload a local file from CSV.

After verifying the CSV import settings with regards to the delimiter, encoding and column headers, we need to define the data schema.

The schema is important for two reasons. It allows you to define:

  1. the columns (features) that should be considered by the AutoML algorithm

  2. the data type of each column

This is important because these things are very difficult for an Auto ML system to automatically guess.

The data type of your target variable also determines which AutoML task is possible for your data: classification, regression, or time series forecasting.

In our example, we don't need all the data from the dataset - some values don't give any information at all and would only increase the complexity (e.g. the year).

Here's a list of all the columns we're considering for our use case:

  • DayofWeek: The flight's weekday

  • Origin: The flight's origin airport

  • Dest: The flight's destination airport

  • DepDelay: Flag if the flight's departure delay was > 0

  • DepDelayMinutes: The flight's departure delay in minutes

  • DepDel15: Flag if the flight's departure delay was >= 15 minutes

  • DepartureDelayGroups: Different departure delay bins

  • DepTimeBlk: Departure time window

  • TaxiOut: Time in minutes between a flight's off-block time and actual take-off time.

  • ArrDel15: Flag if the flight's arrival delay was >= 15 minutes (our target!)

  • ArrTimeBlk: Arrival time window

  • Distance: Flight distance in miles

  • DistanceGroup: Different flight distance bins

Now, we'll set up an experiment for our training.

An experiment in Azure AutoML is a training job on a specific dataset. It helps us to keep things organized.

We can select a very cheap ($0.6/hour) compute resource here as our dataset isn't really that big.

Now it is time to specify what the AutoML job should do:

In our case we select classification because we want to predict a categorical variable (ArrDel15 - Yes/No).

Since most of our flights actually land on time, we'll choose a primary metric that works for imbalanced datasets, such as the precision score.

In general, the way Auto ML works is that it keeps trying to find better models until (A) there are no more performance improvements or (B) the time budget is used up.

Here we set the time budget to a maximum of 1 hour. In fact, however, it should only take about 15 minutes.

We can confirm everything with save and our Auto ML job should be running.

While Auto ML is trying to find the best model for us, let's check a feature called Data Guardrails.

Data guardrails on Azure AutoML are a series of automated checks over the training data to increase the chances of a high-quality training outcome.

These checks include splitting the data into training and validation sets automatically, or checking for missing values in the training columns.

Azure AutoML also checks for some assumptions related to the training task at hand—for example, it recognizes the imbalanced classes in the label and flags them as a potential problem area.

The suggested action here is to choose a performance metric that can work with imbalanced classes, such as precision - glad we did that!

Evaluate the Model

When our Auto ML training is complete, we'll see a list with all the models that were trained:

In our case, the best model was a Voting Ensemble with a Precision score of 96.69%.

Sounds promising!

For the best model, we can automatically take a look at "View explanation.”

The model exaplanations help us understand our model works, especially which features were important.

From here, we can see that besides the variable DepDelay, also the variables TaxiOut, the destination airport (Dest) and the origin airport (Origin) had a significant effect.

That's interesting and makes sense intuitively as busy airports probably increase the risk of arrival delay.

Let's explore our model performance using the Metrics tab:

From here we can inspect the confusion matrix that shows the true data labels compared to the model's predicted labels on our validation data.

We can compare this table with the confusion matrix of our baseline classifier (using only DepDel15 as an input feature):

As you can see, our Auto ML model makes significant gains compared to the baseline when it comes to flights predicted to be delayed (1) but were actually on time (0).

The Auto ML model misclassified only 0.71% of these observations incorrectly, whereas our baseline had an error of 2.9% in this category.

Likewise, for flights that were predicted to be on time (0) but actually landed with more than a 15-minutes delay (1), we could reduce the error from the baseline from 30.0% to 24.94%.

The resulting F1-score of our Auto ML model is 83% compared to 72% of our baseline.

As you can see, we could improve our baseline prediction by 11 percentage points, thanks to the approach of AutoML.

With the most important variables that influence the outcome of a prediction being DepDelay, TaxiOut, Dest, and Origin, the AutoML process can play out its strengths by coming up with handcrafted rules for different combinations of departure and arrival airports.

Note: Don't be fooled by the much higher F1_score_macro metric that Azure ML Studio outputs above. The macro F1 score is defined as the average F1 score of all classes, so in this case the positive and negative classes = (0.83 × 0.98) / 2 = 0.905.

Let's deploy the best model so we can use it in Power BI!

Model Deployment

Model deployment in Azure ML Studio is really easy.

With just one click on "Deploy → Deploy to web service" we can quickly create a REST endpoint and make our model available to authorized users - that's one benefit of having training and prediciton workflows integrated in one platform.

When the model is deployed, we get a REST endpoint we can now use the run preditctions.

In fact, let's try it our right inside Azure ML Studio:

As a further convenience, Azure also provides the appropriate code snippets for R and Python to make these predictions from anywhere outside the platform.

Let's do that from Power BI!

Get Model Predictions Within Power BI

Time to put our knowledge into action in Power BI.

Our approach is quite simple: we take the R/Python snippet from Azure and plug it into Power Query to send the respective data to the API, fetch the prediction, and integrate it back into our data model:

When Power Query runs this pipeline, we will see a new column in our data table that indicated the flag for the predicted arrival delay:

With that, let's update our dashboard!

Building the Dashboard in Power BI

Since we already have a dashboard with an existing baseline, we only need to change the way the "Proportion of Delayed Flights" metric is calculated.

So click on the respective visual on the dashboard and update the field from "Average of DepDel15" to "Average of ArrDel15_Prediction" - the new column we just created in Power Query.

After that, the value for that metric should update immediately:

Remember that the original metric was 0.1308.

Based on the total number of delayed flights alone, we might intuitively assume that the predictions of our base model and the AI model aren't that different.

But don't fall into this trap!

Just because the Auto ML model predicts one more flight to be delayed, the actual expected delays are quite different.

Look at the flight table on the left:

In the AI-driven dashboard, we see some flights that did not have a departure delay of 15 minutes.

Examples include flights AA2378 from IAH to DFW, flight AA2776 from OKC to DFW, and flight AA2828 from AUS to DFW that even departed well ahead of schedule!

As we know from the demo dataset shown below, we can confirm that flights AA2378 and flights AA2828 were in fact delayed on arrival by more than 15 minutes.

Flight AA2776 did not have the ArrivalDelay15 flag, but the plane landed 12 minutes late, despite having a head start of 10 minutes.

While technically the AI prediction here was wrong, it was definitely a good guess.


I hope you enjoyed this use case!

Did you see how Auto ML can help you to make better predictions for your dataset?

Remember to keep track of a baseline and see if and how well Auto ML can possibly outperform it.

Feel free to recreate this exercise using the resources provided below.

See you next Friday!


    AI-Powered Business Intelligence Book Cover

    This content was adapted from my book AI-Powered Business Intelligence (O’Reilly). You can read it in full detail here: