What Breaks When You Let AI Run Without You

Sunday morning. I’m at brunch. My phone buzzes. Telegram: “Agent failed: content_pipeline.”

I ignore it. Probably transient. I’ll check later.

Two hours later: “Agent failed: calendar_sync.” Then: “Agent failed: scoring_pipeline.” Then silence. Not the good kind — the kind where everything is broken and nothing is left to report the breakage.

I spent that afternoon on my laptop at the kitchen table — kids watching cartoons in the next room, coffee going cold — restarting 14 agents one by one, figuring out which outputs were corrupted, which were salvageable, which needed to be re-run from scratch. Four hours. On a Sunday.

The Sunday cascade: one notification snowballs into a full outage over six hours.

That was the first month. It got worse before it got better.

The blank page

My local model had been humming along for weeks. Simple tasks — format this, classify that. No problems.

Then I asked it something harder. Compare two options, think it through, recommend one.

It came back empty. No error. No crash. The system said “success” and handed me nothing.

I spent five days on this. Rewrote prompts. Reinstalled the model. Cleared caches. Questioned my career at 1 AM on a Tuesday.

The problem was one number.

These newer AI models think before they answer — like a person who mutters to themselves before speaking. That thinking burns through their word budget.

I’d given the model 200 words. It spent all 200 thinking. Zero left for the answer.

I changed 200 to 800. Everything worked.

Five days. One number. The system never showed an error because technically there wasn’t one. It did its best within the budget I gave it. I just gave it a budget that was too small for the job.

The month I didn’t notice

This one still bothers me.

I had multiple agents sharing the same conversation thread. Calendar agent, product scorer, content writer — all running in sequence, same context window. Efficient, right? Less overhead. Clever.

One Tuesday I’m reading a calendar summary and I see the phrase “prediction tier: A.”

My calendar agent doesn’t know what a prediction tier is. That’s vocabulary from my product scoring agent. But because they shared the same conversation, my calendar agent had picked up leftover context from the scoring run before it.

I started checking other outputs. The content writer was referencing meeting times. The accountability scorer was using phrases from content drafts. Everyone was leaking into everyone else.

The outputs looked professional. Good formatting, confident tone. You had to read carefully to notice that paragraph two mentioned something from a completely different system.

Shared context causes contamination between agents. Isolated conversations keep each agent's work clean.

I’d been sending these reports to myself for a month. A month of contaminated data that I trusted because it looked clean.

The fix was expensive and non-negotiable: every agent gets its own fresh conversation. No sharing. Ever. My API costs went up about 15%. My outputs became actually correct.

The six-hour Sunday

Back to that brunch disaster. What actually happened was simpler than it seemed.

One AI provider went down. Returned errors for six hours straight. Half my agents hit the wall and just… stopped.

No fallback. No plan B. They tried the one provider I’d configured and gave up.

So I built a failover chain. Four levels: Ollama running on my laptop (free, instant), then Gemini Flash (free tier), then DeepSeek (fractions of a cent), then Claude (the expensive one). If one breaks, the next one catches it.

The failover chain: Local Model (free) → Free Cloud → Paid Cloud → Frontier Model, each catching the one above.

Paranoid? Sure. But in the fourteen months since, I’ve had zero complete outages. Individual levels go down all the time. The whole system hasn’t stopped once.

The silent poison

This one took six months to catch.

I needed training data for my prediction engine. Hand-scoring each case takes an hour. I had 200 cases and needed more. So I gave an AI my rubric, showed it examples, and asked it to generate 400 more.

The output was impressive. Coherent analysis. Reasonable numbers. Plausible rankings.

Six months later I spot-checked an AI score against one of mine. User frustration: I’d said 50. The AI said 70. I checked another. Mine: 45. AI: 68. Another. Mine: 52. AI: 74.

Every AI-generated score was inflated by 15-20 points. Consistently. Across hundreds of cases. That blank response? It took me five days to find. The contamination took a month. But this one — six months of poisoned training data that I trusted because the numbers looked plausible and the reasoning sounded professional.

Why we let this happen

Here’s the part I keep thinking about.

Every one of these failures had a simple fix. Check the word budget. Isolate conversations. Build failover. Spot-check AI data.

None of this is complicated. I could have caught each one on day one.

I didn’t — because the outputs looked professional.

Clean formatting made me trust content I should have verified. Confident tone made me skip spot-checks I should have run. A system that says “success” made me assume success had actually occurred.

We confuse presentation with accuracy. AI hits this blind spot every single time, without even trying.

A professional-looking output with one wrong line buried inside. The output looks perfect. One line is wrong.

We do this with people too. A well-dressed advisor, a polished slide deck, a confident handshake — same blind spot. AI just triggers it on every single output, without even trying.

Script crashes restart from step 1. State machines resume from exactly where they stopped.

The actual goal

Running AI agents in production is mostly boring now. I check the dashboard, see green lights, and close the laptop.

That’s the goal. Not a demo that impresses people for five minutes at a meetup. A quiet system that works at 3 AM on a Sunday when I’m asleep and my phone doesn’t buzz.

Getting there cost me a Sunday brunch, five sleepless nights, a month of contaminated data, and six months of poisoned training scores. Not because the problems were hard. Because I trusted outputs I should have checked.

Building something with AI and hitting weird problems? I’ve probably hit the same ones — mo@fadaly.net.