Keeping Your AI Alive Using the S-O-T Approach

A quick guide to production AI monitoring and observability

Everyone can bring an AI solution into production, but few can keep it there. Unlike classical IT systems, AI doesn’t crash with a bang. It drifts quietly over time. Predictions get a little worse. Answers get funky. And often, the first to notice… are your customers. Ooops!

So how do you know if your AI system is actually performing well? One critical element that’s often missing is proper monitoring. But not just any monitoring. AI needs a different approach. One that goes beyond server uptime or latency dashboards.

Today, I’ll walk you through a simple framework that I’ve been using to keep AI solutions healthy — whether they’re powering real-time chatbots or monthly churn predictions. And the best part: You don’t need any fancy tech to get started with it today.

Let’s dive in!

🚫 I’m currently at full capacity and not taking on new clients.

But I’m still on a mission to help 100 people add $10K in profit with AI — which is why I built something new:

👉 Introducing the AI 10K Club
The easiest way to access the exact tools, templates, and techniques I’ve used over the past 5 years to launch AI solutions that drive $10K+ in recurring profit — monthly, quarterly, or annually depending on your business.

Inside, you’ll get full access to the AI 10K System — my proprietary 6-phase method for launching profitable AI use cases.

Every template, tool, and framework is included.
These are the same resources I’ve used to:

  • Uncover $400K+ in annual AI opportunity in a single day (enterprise client)

  • Save 5-figures per month without writing a single line of code (SaaS)

  • Generate $40K/month in new business through AI-powered lead gen (B2B service)

  • Help solo founders replace $50K/year in outsourced work with AI systems

🎁 Plus: Get access to my entire workshop library, 10+ pre-built AI solutions, weekly office hours, and may other perks.

50+ members already joined. Only 100 spots available for 2025.
New enrollments are open until May 18.

Why watching AI systems hits differently

Before we dive into the framework, let's clarify an important difference: monitoring vs. observability.

Monitoring asks basic questions like "Is this system running?" and "Is it responding within time limits?" Traditional monitoring focuses on tracking predefined metrics and alerting when thresholds are crossed. This works fine for conventional software where behavior is mostly deterministic, but it falls short for AI.

Observability goes deeper by answering questions like "Why is the system behaving this way?" It gives you the ability to investigate issues you didn't anticipate when building the system. This distinction becomes crucial for AI, where models can silently drift or degrade in unexpected ways while still appearing to function normally.

For AI systems, we need to consider four key pillars:

  1. Metrics – Numerical measurements of system performance

  2. Logs – Detailed records of events and decisions

  3. Traces – Request paths through your system

  4. Data – Input distributions, predictions, and ground truth

That fourth pillar — data — is what makes AI observability unique. Your standard software dashboards might show all green lights even while your model is quietly making increasingly poor predictions. They're completely blind to model drift, data quality issues, or adversarial attacks.

Think of it this way: traditional software fails loudly with crashes and errors. AI systems often fail silently, continuing to run perfectly well while gradually becoming less accurate or relevant.

So just call your BI team? Not really. Most BI dashboards track outcomes (e.g., revenue, churn) after the fact. AI observability, however, tracks model behavior in real time — helping you catch issues before they impact those outcomes. Like a smoke detector that goes off before your house is on fire.

That’s why we need a different approach.

The S-O-T Framework – A Simple Approach for Complex Systems

To tackle the unique challenges of AI observability, I use a lightweight but powerful framework: S-O-T which looks at an AI solution through a Strategic, Operational, and Technical lens. Each lens gives you a different view into your AI system’s health, and together they help you move from reactive fire-fighting to proactive oversight.

Strategic

This lens focuses on outcomes that matter to the business. Think metrics like:

  • Conversion rate

  • Customer satisfaction

  • Campaign ROI

  • Lift

These are the “so what?” metrics — they tell you if your AI is actually moving the needle in ways that justify its existence (= hitting your 10K Threshold)

Operational

Here, we're looking at the system’s day-to-day behavior:

  • Are inference jobs finishing on time?

  • Is uptime consistent?

  • Are costs spiking?

This lens helps you understand whether your AI system is reliable and sustainable in production.

Technical

The final lens gets into the model’s internals:

  • Data drifts

  • Confidence distribution

  • Security guardrails for LLMs

  • Calibration error

  • etc.

These are your checks to ensure that your model keeps touch with reality — and notify you to make changes (e.g., re-train the model, update the prompt, etc.) even if no operational alarms are going off yet.

The beauty of the S-O-T framework is that it works for both real-time and batch systems — you just need to tailor the metrics slightly depending on the context.

Here’s a simple comparison to illustrate:

Lens

Real-Time AI
(Ex: Chatbot)

Batch AI Service
(Ex: Churn Prediction)

Strategic

Case resolution rate

Churn prevention

Operational

Response time, uptime

Job completion time

Technical

ROUGE score

Feature drift

Tip: Start with just one or two metrics per lens. The goal isn’t to boil the ocean — it’s to track what’s most critical for your use case. Over time, you can layer in more detail as your system and needs evolve.

So how often should you review or update these metrics?
Here’s a simple rule of thumb:

  • Review operational and technical metrics weekly, or after each batch run

  • Review strategic metrics monthly or quarterly, aligned with business reporting cycles

  • Update your metrics when you retrain the model, shift business goals, or discover new risks

  • Set alerts for critical outliers — depending on what “normal” looks like in your system

Getting Started – Practical Next Steps

You don’t need a massive observability stack or a team of MLOps engineers to make progress here. In fact, the best way to start is by picking 1-2 metrics per S-O-T lens that makes sense for your use case — and tracking it consistently.

1. Start Small

  • Strategic: Choose a metric that reflects the business outcome your AI system supports. This might relate to revenue impact, customer satisfaction, conversion lift, or process efficiency.

  • Operational: Pick a metric that helps you understand system performance and reliability. Common ones include latency, uptime, cost per run, or data freshness.

  • Technical: Monitor for model quality and safety. This could mean tracking drift, confidence distribution, relevance scores, or security guardrails depending on your model type.

The main goal is to build the habit — and make model behavior visible before it becomes a problem.

2. Use Simple Tools

You likely already have what you need:

  • Spreadsheets for tracking

  • Alerts via Slack or email using scripts or cron jobs

  • Visualization tools like Grafana, or even a simple Jupyter notebook

The key is visibility, not perfection.

3. Scale Gradually

As your system grows, so will your observability needs. You can:

  • Add a second metric per lens

  • Set alert thresholds

  • Integrate monitoring tools like Prometheus or observability platforms like OpenTelemetry

Just like your model, your monitoring stack should evolve with usage.

4. Learn and Iterate

Treat observability as a living system. Use every incident as a feedback loop to improve your metrics and understanding of the system’s behavior.

Conclusion

AI systems don’t fail like traditional software — they degrade, drift, and quietly mislead. That’s why building in monitoring and observability from day one is essential to keeping them alive and useful in the real world — even if that means starting with a simple spreadsheet to track inputs and outputs.

The Strategic-Operational-Technical framework helps you to cut through metric overload and focus on what actually matters: not just whether your AI is running, but whether it’s delivering the outcomes you built it for.

If you haven’t already, pick one metric per S-O-T lens and start tracking this week.

Keep your AI systems close!

See you next Friday,
Tobias

Reply

or to participate.