Back to Blog

Token Limits: The Hidden Cost of Building Production-Grade AI Workflows

I built an AI orchestration system that produces amazing code, and I knew token limits would eventually become a constraint. What I didn't anticipate was how quickly I'd hit that wall once the system matured. After 25 years of building software, I've learned to plan for bottlenecks—database constraints, memory leaks, network latency, you name it. Token limits were always on my radar as something I'd need to manage eventually. But the timing caught me off guard: just as my workflow started producing consistently excellent output, I found myself racing against API rate limits faster than I'd projected. The success of the system accelerated my timeline for dealing with this problem.

Here's what happened. I built an increasingly sophisticated agentic workflow for code generation that includes quality gates on the front end to verify tickets are properly specified, execution agents to build the actual functionality, review agents to critique the code, and verification steps to ensure everything meets our standards. The system loops until all checks pass, with orchestration sessions managing the entire workflow end-to-end. Multiple agents spin up simultaneously, each churning through their designated tasks. It works beautifully—the output quality is genuinely impressive, and the quality gates have been a massive improvement over previous iterations. But here's the kicker: every single step consumes tokens, and Anthropic has been getting increasingly conservative with their limits right when I need them most.

The irony is sharp. I spent months refining this orchestration approach, layering in safeguards and checks that ensure the code meets production standards. The very improvements that made the system valuable are the same ones consuming tokens at an unsustainable rate. It's like finally building the sports car of your dreams and then discovering there's a fuel shortage. I knew I'd need to optimize token usage eventually, but the system's success moved "eventually" much closer than I'd planned. This collision between my system's maturity and tightening token constraints has created an urgent need to understand exactly where my tokens are going and how to optimize without sacrificing the quality that makes the whole endeavor worthwhile.

The Observability Problem

The first challenge I'm tackling is visibility. When you're running a complex multi-agent workflow with orchestration, execution, review, quality gates, and verification all happening in sequence and sometimes in parallel, you lose track of which steps are the token hogs. Without observability, optimization is just guesswork. I could make changes that feel logical but actually move the needle backward, or I could obsess over optimizing steps that represent a tiny fraction of overall consumption. You can't optimize what you can't measure, and right now I don't have the measurement infrastructure I need to make smart decisions about where to invest optimization effort.

So I'm building observability into every step of the workflow. This means instrumenting each phase—orchestration decisions, execution attempts, review cycles, quality gate evaluations, and verification checks—to capture exact token usage. The goal is to create a clear picture of where the budget actually goes. Is it the initial planning? The code generation itself? The review loops that happen when something doesn't pass? Once I have real data, I can make informed decisions about where to invest optimization effort. This isn't just about cutting costs; it's about understanding the economics of AI-assisted development well enough to make this approach sustainable at scale.

The observability layer also helps me identify patterns I wouldn't see otherwise. For instance, certain types of tickets might trigger more review loops, consuming exponentially more tokens than others. Or perhaps the verification steps are more expensive than I realized because they're re-analyzing large contexts repeatedly. Without measurement, I'm flying blind. With it, I can turn token optimization into an engineering problem rather than a guessing game. The data will reveal which optimizations deliver the biggest impact and which ones are just nibbling around the edges.

Model Selection as a Strategic Decision

Once I have visibility into token consumption, the next lever is model selection. Not every step in this workflow requires the same level of reasoning capability. Using Claude Opus for everything is like hiring a senior architect to pour the concrete foundation—technically they could do it, but it's expensive overkill. The breakthrough insight for me has been recognizing that different phases of the workflow have different cognitive requirements. This realization opens up significant optimization opportunities without compromising quality where it matters.

I've started with a hypothesis that's showing real promise after a couple of days of testing: use Opus for the upfront planning and thinking, where the hard cognitive work happens, then switch to Sonnet for execution. The logic is straightforward. If I invest in thorough planning with quality gates ensuring the ticket is properly specified and the approach is sound, then the actual code generation becomes more mechanical. Sonnet handles this beautifully—it can one-shot the implementation without problems when the groundwork has been laid properly. The expensive thinking has already happened; now we're just translating well-formed plans into code. This approach lets me deploy the most expensive model only where its advanced reasoning capabilities are truly necessary.

This tiered approach extends further down the capability stack too. Some verification steps and simpler workflow tasks don't even need Sonnet—Haiku can handle them just fine. Checking that a file was created correctly, verifying basic formatting standards, or confirming that certain prerequisite steps completed successfully don't require advanced reasoning. They're binary checks or simple pattern matching. By routing these tasks to Haiku, I can optimize token usage even more aggressively without compromising the quality of the work that actually matters. The key is matching the model's capability to the task's cognitive demand, creating a graduated system where each component uses the minimum necessary reasoning power to accomplish its function reliably.

Quality Gates as Token Savers

Here's something counterintuitive I've learned: the quality gates I added to the front end aren't just improving output quality—they're actually helping with token efficiency. At first glance, adding more checks seems like it would consume more tokens, and in isolation it does. But the system-level effect is the opposite. By ensuring tickets are well-specified before any execution begins, I dramatically reduce the number of review-and-retry loops downstream. This upfront investment pays for itself multiple times over by preventing expensive rework cycles later in the workflow.

Think about it this way. If an execution agent starts working on a poorly specified ticket, it might generate code that technically works but doesn't actually solve the right problem. The review agent catches this, and we loop back. That loop is expensive—we've consumed tokens for execution, review, and now we'll consume them again for re-execution and re-review. If that cycle happens two or three times, the token cost multiplies. But if I spend tokens upfront ensuring the ticket is crystal clear about requirements, acceptance criteria, and technical constraints, the execution agent hits the target on the first try. One execution, one review, done. The math is compelling: spending 10% more tokens on specification to avoid 50% of rework loops is an excellent trade.

This front-loaded investment in clarity pays dividends throughout the entire workflow. It's similar to the old software engineering wisdom that finding bugs early in the development cycle is exponentially cheaper than finding them in production. In this case, finding ambiguity or gaps in the ticket specification before execution starts saves massive token expenditure downstream. The quality gates aren't overhead—they're insurance against expensive rework loops. They shift token consumption from reactive correction to proactive prevention, which is always more efficient in any system.

The Path Forward

I'm still early in this optimization journey, but the direction is clear. The combination of observability, strategic model selection, and front-loaded quality gates represents a coherent approach to making AI-assisted development economically sustainable. This isn't about cutting corners or accepting lower quality output. It's about being intentional with where we deploy expensive cognitive resources and where we can get away with more efficient options. The goal is to build a system that's both high-quality and cost-effective, not to compromise one for the other.

The urgency is real—I'm running through token limits fast enough that this isn't a "nice to have" optimization. It's a requirement for this workflow to remain viable. But that pressure is pushing me to think more systematically about the architecture of agentic workflows. How do we design systems that are both high-quality and cost-effective? How do we instrument them well enough to optimize intelligently? How do we match computational resources to task requirements? These aren't just my problems to solve; they're emerging as fundamental questions for anyone building production-grade AI workflows.

These questions feel fundamental to the next phase of AI-assisted software development. As these tools mature and more developers build sophisticated workflows, we're all going to hit these same constraints. The teams that figure out how to navigate token economics while maintaining quality will have a significant advantage. This isn't just about saving money—it's about building workflows that can scale to handle real production workloads without bankrupting the operation. The technical capability exists; now we need to develop the engineering discipline to deploy it sustainably.

Lessons for Builders

If you're building AI workflows, here's what I'd recommend based on what I'm learning. First, build observability in from the start. Don't wait until you hit token limits to understand where your consumption is happening. Instrument every step and track costs religiously. Second, think hard about model selection. The most powerful model isn't always the right choice—match capability to task complexity. Third, invest in quality at the input stage. Clear specifications and well-formed problems reduce expensive downstream loops. These aren't just cost-saving measures; they're architectural decisions that determine whether your workflow can scale sustainably.

Most importantly, treat token consumption as a first-class architectural concern, not an afterthought. Just as we learned to design for scalability, security, and performance, we now need to design for token efficiency. The constraints are different, but the discipline is the same: measure, analyze, optimize, and repeat. The goal isn't to build the cheapest system—it's to build the most effective system within economic constraints that make it sustainable. This requires the same engineering rigor we apply to any other critical system component, with the same commitment to measurement, iteration, and continuous improvement.

The AI tooling landscape is evolving rapidly, and token pricing and limits will change over time. But the fundamental principle will remain: building production-grade AI workflows requires the same engineering rigor we apply to any other critical system. We need to understand the costs, optimize intelligently, and build systems that deliver value sustainably over the long term. The developers who treat token economics as seriously as they treat performance optimization or security will be the ones building systems that can actually scale to production workloads and stay there.