Back to Notes
February 2026

Single Error Boundary: A Radical Approach to Debugging AI Systems

Why we use one catch block in the entire codebase, and why it's the most productive debugging approach we've found

Standard advice for production systems: handle errors gracefully. Catch exceptions at every level. Provide fallbacks. Degrade elegantly. Never let the user see a stack trace.

We did the opposite.

One catch block in the entire codebase. Everything else raises. When something breaks, it breaks loudly, with a full stack trace, at a single predictable location.

This sounds reckless. It's actually the most productive debugging approach we've found for AI systems in active development.

AI System Debugging Challenges: Long Tool Chains, Partial Failures, Debugging Archaeology, LLM Unpredictability, Errors Swallowed, Silent Failures

Why standard error handling fails for AI systems

The try/catch-everywhere pattern makes sense for stable systems with well-understood failure modes. For AI systems under active development, it creates a debugging nightmare.

Here's what happens with distributed error handling:

Tool A failscatch & logreturn None
Tool B gets Nonedoes something weirdcatch & log
Tool C gets emptyMysterious wrong output

Original error buried in logs. Final output wrong. Nobody knows why.

  • Errors get swallowed and logged, not surfaced. The original exception is caught, logged, and converted to a return value. The calling code doesn't know something went wrong.
  • Silent failures compound. Tool A fails, returns None. Tool B receives None, does something weird, catches that exception, returns an empty result. Tool C gets empty input. The final output is wrong, but why?
  • Debugging becomes archaeology. You grep through log files. You reconstruct the call sequence. You find the original error buried under three layers of "handled gracefully."

AI systems are particularly vulnerable because:

  • LLM responses are unpredictable. The model might return malformed JSON, unexpected content, or complete nonsense. You can't anticipate every failure mode.
  • Tool chains are long. A typical agentic workflow involves multiple steps: data loading, planning, generation, validation, possibly multiple revisions. Errors anywhere corrupt everything downstream.
  • Partial failures are common. The plan is fine but generation fails. Generation succeeds but validation fails. Validation passes but post-processing makes things worse.

When something goes wrong (and with LLMs, things go wrong in surprising ways), you need to know immediately and completely. Not "something failed somewhere, check the logs."

The single boundary pattern

Our approach: one location catches exceptions. Everything else raises.

Tools (all raise)

load_context
create_plan
generate
validate
post_process

Services (all raise)

External API
Storage
LLM Client

Single Boundary

agent/core.py:291
↓ Log full stack trace
↓ Convert to ToolResult
↓ Agent reasons about failure

The boundary is in the agent's _act() method, the place where the agent executes a tool. Here's the simplified pattern:

# agent/core.py - THE ONLY TRY/CATCH IN THE SYSTEM
async def _act(self, action: str, parameters: dict) -> ToolResult:
    tool = self.tools.get(action)
    try:
        result = await tool["execute"](parameters, self.context)
        return result
    except Exception as e:
        # Log EVERYTHING
        logger.error(f"Tool {action} failed: {e}")
        logger.error(traceback.format_exc())

        # Convert to structured result the agent can reason about
        return ToolResult(
            success=False,
            error=str(e),
            traceback=traceback.format_exc()
        )

That's it. One boundary. Everything upstream (tools, services, utilities) just raises:

# tools/generate/tool.py - NO TRY/CATCH
async def execute(params: dict, context) -> ToolResult:
    response = await context.llm.call(prompt)  # Might raise
    output = parse_response(response)           # Might raise
    context.state.save_artifact(output)         # Might raise
    return ToolResult(success=True, data={"output": output})

If llm.call() fails, the exception propagates. If parse_response() fails, the exception propagates. The boundary catches it, logs everything, and converts it to a result the agent can reason about.

What happens when something breaks

Let's trace through a real failure: an external API times out mid-request.

With distributed error handling:

[DEBUG] Processing item 4 of 12
[ERROR] API timeout on item 4, retrying...
[ERROR] Retry failed, skipping item
[DEBUG] Processing item 5 of 12
...
[INFO] Processing complete
[INFO] Output saved to results.json

The result: output with mysterious gaps or corrupted data. The error was "handled." The item was skipped. Nobody noticed until a user reported something was wrong.

With single error boundary:

[ERROR] Tool 'process' failed: HTTPTimeoutError: Request timed out after 30s
[ERROR] Traceback:
  File "agent/core.py", line 295, in _act
    result = await tool["execute"](parameters, self.context)
  File "tools/process/tool.py", line 87, in execute
    response = await self._call_api(item)
  File "tools/process/tool.py", line 143, in _call_api
    return await self.client.request(...)
  File "httpx/_client.py", line 412, in request
    raise HTTPTimeoutError("Request timed out after 30s")

HTTPTimeoutError: Request timed out after 30s

The result: the agent knows processing failed. It can decide whether to retry, abort, or try a different approach. There's no mysterious partial output. The failure is visible, complete, and actionable.

The debugging experience is completely different. With distributed handling, you're searching logs for clues. With single boundary, the error is right there, complete with the exact line that failed.

The intentional evolution

This isn't permanent architecture. It's phase-appropriate architecture.

Current: Active Development

  • Goal: Understand failure modes
  • Approach: Visibility over resilience
  • Result: Fast debugging, clear understanding

Future: Production Hardening

  • Goal: Handle known failure modes
  • Approach: Resilience where proven
  • Result: Graceful degradation

The key insight: you can't add graceful handling until you understand what fails. And you can't understand what fails if errors are being swallowed and logged.

The trap is adding "resilience" before you understand failure modes. You end up catching errors you don't understand, handling them in ways that seem reasonable but create subtle bugs. Silent failures. Mysterious behaviors. Users reporting bugs you can't reproduce.

We add retry logic for API timeouts once we know they happen. We add fallbacks for specific LLM parsing failures once we've seen them. But we don't add generic try/catch blocks that hide problems we haven't discovered yet.

When to use this pattern

Good fit:

  • • Early development of complex systems
  • • AI/ML pipelines with many failure modes
  • • Systems where silent failures are worse than crashes
  • • Teams that need to understand behavior before hardening

Bad fit (for now):

  • • User-facing production APIs
  • • Well-understood systems with known failure modes
  • • Situations where partial results beat no results

The qualifier "for now" is important. As we move toward production, we add targeted error handling. But we add it based on observed failures, not anticipated ones.

The implementation

If you want to adopt this pattern:

  1. 1. Identify your boundary. Where does the orchestrator call into components? That's where you catch.
  2. 2. Remove all other try/catch blocks. This feels scary. Do it anyway. Leave only cleanup patterns (try/finally for resource management).
  3. 3. Log completely at the boundary. Full stack trace. All parameters. Timestamp. Everything you'll need to understand what happened.
  4. 4. Return structured errors. The orchestrator needs to reason about failures. Give it typed error results, not just boolean success flags.
  5. 5. Track your failures. Every time you debug something, note the failure mode. When you've seen the same failure multiple times, that's when you add targeted handling.
# The pattern
async def orchestrator_boundary(action, params):
    try:
        return await execute_action(action, params)
    except Exception as e:
        log_complete_failure(action, params, e)
        return StructuredError(
            action=action,
            error=str(e),
            traceback=traceback.format_exc(),
            timestamp=now()
        )

The deeper lesson

Error handling strategy should match your development phase.

"Fail gracefully" is advice for stable systems with well-understood failure modes. You know what breaks. You know how to handle it. Graceful degradation improves user experience.

But unstable systems (systems under active development, systems where you're still discovering failure modes) need different medicine. Fail loudly. Fail completely. Make failures impossible to miss.

You can always add grace later. You can't retroactively add visibility to failures that were silently swallowed months ago.

We've been running this pattern across multiple AI projects. Every failure we see, we understand completely. We know exactly which API calls timeout, which LLM responses parse incorrectly, which edge cases cause validation failures.

That knowledge is invaluable. And we only have it because we refused to hide failures behind try/catch blocks.

The Bottom Line

This is how we build AI systems: with approaches that match the development phase, not cargo-culted "best practices" that hide problems. If you're building AI systems and struggling with mysterious failures, we should talk.