I get between 80 and 150 emails a day.
These emails live in six different inboxes across business, personal, and partnership accounts. Most messages don't really matter. Some matter a lot. Like a client request, an inbound lead, or just a notification about a failed payment because the credit card expired.
And the effort to keep on top of everything keeps climbing.
Because it's one thing to separate important from unimportant (I've been doing this automatically for a while). But really keeping track, are items pending, done, do I need to follow up, is a whole different challenge.
And for the longest time, building and maintaining an AI system for this simply wasn't worth the effort (because in the end, it's still a productivity use case for me). Until now. Because a) the tech got better and b) I changed the way I think about this use case.
Today, I want to share this with you.
Let's dive in!
How I thought about it
The obvious move for solving my email overload was Gemini for Google Workspace, because my main account lives on Gmail. But two problems quickly disqualified this idea:
First, Gemini only covered one of my six inboxes.
And second, on that one inbox it was pretty underwhelming, because I couldn't really customize it to how I work.
Ok, I thought. Fine. I'll build it myself. I'd built dozens of email classification systems before, so why not just extend what I have and plug different workflows together.
But the "stitching a few workflows together" part quickly turned out to be harder than I thought.
Solving my email problems wasn't just about "is this important." I needed to know whether a message had been answered. Whether I'd replied, or pinged someone else, or I was still waiting on a reply I'd asked for. Tracking who owes whom across a thread with five people in it.
I wasn't looking at a pure classification problem anymore. I was looking at a management problem.
The obvious version for tackling this is "reasoning over my inbox". Hand an AI agent my IMAP logins and let it run the whole thing. Read, sort, reply, send, delete.
It's also the last thing I would ever do.
Because when the agent fires off a bad reply to a client, the damage is done. Or when it deletes the wrong thread, I wouldn't notice. Email is far too important for me to just hand it off to a black-box AI system I don't supervise.
So I was stuck with two things that didn't fit. The problem screamed agentic AI. And I refused to let an agent loose on my live inboxes.
What changed
Two things flipped this for me.
First, the technical capabilities. Out-of-the-box ChatGPT and Claude were simple chatbots 6 months ago. Today, they're extremely capable agentic systems (or at least can be used like this if you know how). Especially Claude Code feels less like a simple chatbot these days and more like an orchestrator, where the LLM really just controls a bunch of non-AI tools, such as using a grep-search to find relevant information across my files. (There's a good breakdown here on what Claude does behind the scenes – a lot isn’t actually AI.)

What an agentic system like Claude is actually doing under the hood
(adapted from VILA-Lab)
Second, non-AI was really the missing piece. In order to give an agentic system like Claude Code access to my inbox, I had to replicate my inbox (or inboxes) into a "sandbox", a safe space that the agent has full access to.
And for this replication I didn't need AI. I needed a simple, stupid workflow on a platform I've been using anyway for a long time. In my case, that's n8n.
How I think about this now
So I split the job into two.
Part 1 became the data layer. n8n fetches my emails from the different providers and brings them into a shared table. In there, context is built. Thread, recipients, outbound, inbound, etc. There's also a little script that ignores all emails with an "Unsubscribe" in it. Emails from mailing lists didn't have to be managed and were not worth AI tokens.
This context-rich, high quality dataset was the basis for the agentic system on top. (If you wonder what "you need high data quality for high-performing AI systems" means, this is it.)
Part 2 was the agent. I picked Claude because I already had a subscription and it's in my opinion the best agentic platform right now. But the same thing could be done with Codex or ChatGPT.
All I had to do now was write the main system instructions in a CLAUDE.md file and define some tools, like fetching emails or updating status, that would be served by n8n.
The agent never works on anything other than the sandboxed table. It reads it, makes notes, keeps a log, drafts a reply when I ask. It cannot delete anything from my real mailbox. It cannot send. Those actions aren't within its reach.
That's the whole trick. Give an agentic system a clean dataset with all the context, then give it room to figure things out, inside a box where it can't break anything.
Capability and blast radius are two separate dials. Most setups turn them with one knob: more power means more access. They don't have to move together. I want my agent maximally free to reason and maximally unable to destroy.

So the question for every part of the system isn't "can AI do this." It's "should this part be AI at all." Fetching mail: no. Structuring it: no. Deciding what matters today and why: yes. Sending: never.
The payoff is mostly time. Sorting that used to eat an hour a day now takes ten minutes. I wouldn't pay $10K for this system every year. But I also wouldn't want to be without it.
The whole thing now runs on my laptop as my personal AI Mailroom – with no new subscriptions stacked on top. Every morning I type /organize and /brief to know what's up. Whenever I'm overwhelmed I type /prio and Claude tells me the top 3 things to do now. Every now and then I type /follow-up to check if there are some open loops to close. Claude gives me a draft that I just need to copy and send.

My AI Mailroom in action
Three design choices carry the system
Separate the context layer from the reasoning layer. Cheap deterministic code builds the dataset. The expensive, fallible part only ever sees a clean copy.
Give the agent a scratchpad. Notes and a running log are what let it reason across threads instead of judging each email cold. Memory makes all the difference between classification and management.
Hard-code what it can't touch. Not "please don't send." Cannot send. The boundary lives in the structure, not in an instruction the model might decide to ignore.
Before you point AI at a messy, important system, split the work in two. The boring part prepares the ground so the "cool" part can play to its strengths.
Isn't that the case everywhere?
See you next Saturday,
Tobias