The Augmented Advantage
Posts
Keeping Your AI Alive Using the S-O-T Approach

Keeping Your AI Alive Using the S-O-T Approach

A quick guide to production AI monitoring and observability

Tobias Zwingmann
May 09, 2025

Everyone can bring an AI solution into production, but few can keep it there. Unlike classical IT systems, AI doesn’t crash with a bang. It drifts quietly over time. Predictions get a little worse. Answers get funky. And often, the first to notice… are your customers. Ooops!

So how do you know if your AI system is actually performing well? One critical element that’s often missing is proper monitoring. But not just any monitoring. AI needs a different approach. One that goes beyond server uptime or latency dashboards.

Today, I’ll walk you through a simple framework that I’ve been using to keep AI solutions healthy — whether they’re powering real-time chatbots or monthly churn predictions. And the best part: You don’t need any fancy tech to get started with it today.

Let’s dive in!

🚫 I’m currently at full capacity and not taking on new clients.

But I’m still on a mission to help 100 people add $10K in profit with AI — which is why I built something new:

👉 Introducing the AI 10K Club
The easiest way to access the exact tools, templates, and techniques I’ve used over the past 5 years to launch AI solutions that drive $10K+ in recurring profit — monthly, quarterly, or annually depending on your business.

Inside, you’ll get full access to the AI 10K System — my proprietary 6-phase method for launching profitable AI use cases.

Every template, tool, and framework is included.
These are the same resources I’ve used to:

Uncover $400K+ in annual AI opportunity in a single day (enterprise client)
Save 5-figures per month without writing a single line of code (SaaS)
Generate $40K/month in new business through AI-powered lead gen (B2B service)
Help solo founders replace $50K/year in outsourced work with AI systems

🎁 Plus: Get access to my entire workshop library, 10+ pre-built AI solutions, weekly office hours, and may other perks.

50+ members already joined. Only 100 spots available for 2025.
New enrollments are open until May 18.

Click here and I’ll send you the details.

Why watching AI systems hits differently

Before we dive into the framework, let's clarify an important difference: monitoring vs. observability.

Monitoring asks basic questions like "Is this system running?" and "Is it responding within time limits?" Traditional monitoring focuses on tracking predefined metrics and alerting when thresholds are crossed. This works fine for conventional software where behavior is mostly deterministic, but it falls short for AI.

Observability goes deeper by answering questions like "Why is the system behaving this way?" It gives you the ability to investigate issues you didn't anticipate when building the system. This distinction becomes crucial for AI, where models can silently drift or degrade in unexpected ways while still appearing to function normally.

For AI systems, we need to consider four key pillars:

Metrics – Numerical measurements of system performance
Logs – Detailed records of events and decisions
Traces – Request paths through your system
Data – Input distributions, predictions, and ground truth

That fourth pillar — data — is what makes AI observability unique. Your standard software dashboards might show all green lights even while your model is quietly making increasingly poor predictions. They're completely blind to model drift, data quality issues, or adversarial attacks.

Think of it this way: traditional software fails loudly with crashes and errors. AI systems often fail silently, continuing to run perfectly well while gradually becoming less accurate or relevant.

So just call your BI team? Not really. Most BI dashboards track outcomes (e.g., revenue, churn) after the fact. AI observability, however, tracks model behavior in real time — helping you catch issues before they impact those outcomes. Like a smoke detector that goes off before your house is on fire.

That’s why we need a different approach.

The S-O-T Framework – A Simple Approach for Complex Systems

To tackle the unique challenges of AI observability, I use a lightweight but powerful framework: S-O-T which looks at an AI solution through a Strategic, Operational, and Technical lens. Each lens gives you a different view into your AI system’s health, and together they help you move from reactive fire-fighting to proactive oversight.

Strategic

This lens focuses on outcomes that matter to the business. Think metrics like:

Conversion rate
Customer satisfaction
Campaign ROI
Lift

These are the “so what?” metrics — they tell you if your AI is actually moving the needle in ways that justify its existence (= hitting your 10K Threshold)

Operational

Here, we're looking at the system’s day-to-day behavior:

Are inference jobs finishing on time?
Is uptime consistent?
Are costs spiking?

This lens helps you understand whether your AI system is reliable and sustainable in production.

Technical

The final lens gets into the model’s internals:

Data drifts
Confidence distribution
Security guardrails for LLMs
Calibration error
etc.

These are your checks to ensure that your model keeps touch with reality — and notify you to make changes (e.g., re-train the model, update the prompt, etc.) even if no operational alarms are going off yet.

The beauty of the S-O-T framework is that it works for both real-time and batch systems — you just need to tailor the metrics slightly depending on the context.

Here’s a simple comparison to illustrate:

Lens	Real-Time AI (Ex: Chatbot)	Batch AI Service (Ex: Churn Prediction)
Strategic	Case resolution rate	Churn prevention
Operational	Response time, uptime	Job completion time
Technical	ROUGE score	Feature drift

Tip: Start with just one or two metrics per lens. The goal isn’t to boil the ocean — it’s to track what’s most critical for your use case. Over time, you can layer in more detail as your system and needs evolve.

So how often should you review or update these metrics?
Here’s a simple rule of thumb:

Review operational and technical metrics weekly, or after each batch run
Review strategic metrics monthly or quarterly, aligned with business reporting cycles
Update your metrics when you retrain the model, shift business goals, or discover new risks
Set alerts for critical outliers — depending on what “normal” looks like in your system

Getting Started – Practical Next Steps

You don’t need a massive observability stack or a team of MLOps engineers to make progress here. In fact, the best way to start is by picking 1-2 metrics per S-O-T lens that makes sense for your use case — and tracking it consistently.

1. Start Small

Strategic: Choose a metric that reflects the business outcome your AI system supports. This might relate to revenue impact, customer satisfaction, conversion lift, or process efficiency.
Operational: Pick a metric that helps you understand system performance and reliability. Common ones include latency, uptime, cost per run, or data freshness.
Technical: Monitor for model quality and safety. This could mean tracking drift, confidence distribution, relevance scores, or security guardrails depending on your model type.

The main goal is to build the habit — and make model behavior visible before it becomes a problem.

2. Use Simple Tools

You likely already have what you need:

Spreadsheets for tracking
Alerts via Slack or email using scripts or cron jobs
Visualization tools like Grafana, or even a simple Jupyter notebook

The key is visibility, not perfection.

3. Scale Gradually

As your system grows, so will your observability needs. You can:

Add a second metric per lens
Set alert thresholds
Integrate monitoring tools like Prometheus or observability platforms like OpenTelemetry

Just like your model, your monitoring stack should evolve with usage.

4. Learn and Iterate

Treat observability as a living system. Use every incident as a feedback loop to improve your metrics and understanding of the system’s behavior.

Conclusion

AI systems don’t fail like traditional software — they degrade, drift, and quietly mislead. That’s why building in monitoring and observability from day one is essential to keeping them alive and useful in the real world — even if that means starting with a simple spreadsheet to track inputs and outputs.

The Strategic-Operational-Technical framework helps you to cut through metric overload and focus on what actually matters: not just whether your AI is running, but whether it’s delivering the outcomes you built it for.

If you haven’t already, pick one metric per S-O-T lens and start tracking this week.

Keep your AI systems close!

See you next Friday,
Tobias

Reply

or to participate.