Back to Being a Bottleneck: Why I'm Okay With Slowing Down My AI Software Factory
I've been iterating on Sandstorm Desktop, my AI-powered software factory project, for quite some time now. The original concept was simple: build a system that could spin up multiple AI agents in isolated Docker containers to tackle development tickets in parallel. I didn't want to pay for yet another cloud-based service when I could build something that worked exactly how I needed it to work. For a while, this approach was incredibly successful. I could grab tons of tickets, spawn agents to work on them simultaneously, and generate pull requests in parallel. The velocity was exhilarating—it felt like I'd unlocked a superpower that every developer dreams about.
But there was a problem I couldn't ignore. While the quantity of output was impressive, the quality wasn't consistently meeting the standards I needed. Code reviews became battlegrounds of requested changes and refinements. I was moving fast, but I wasn't necessarily moving well. This created a fork-in-the-road moment: continue optimizing for parallel execution and speed, or pivot to optimizing for quality and accuracy. After dealing with too many pull requests that required significant rework, I chose quality.
The Quality Gate Experiment
My first attempt at improving quality involved adding internal review processes before code ever reached a pull request. I implemented a system where an execution agent would complete the work, then a review agent would evaluate that work against our standards. If the review agent found issues, it would pass the work back to the execution agent for fixes. This internal iteration loop happened before any code left the sandbox environment. Think of it as a pre-flight check—catching problems before they become public embarrassments in your actual repository.
This helped somewhat, but it wasn't enough. I realized I was still treating the symptoms rather than the disease. The real problem wasn't just in the execution phase—it was upstream in the planning and ticket definition phase. When tickets were ambiguous or lacked sufficient context, even the best execution agents would make incorrect assumptions. Garbage in, garbage out, as the old saying goes. So I added another layer: ticket quality gates. Before an agent could even begin execution, the ticket itself had to meet certain quality thresholds. It needed clear acceptance criteria, specific implementation details, context about which parts of the codebase would be affected, and explicit testing requirements.
The result of these changes has been a noticeable improvement in first-attempt success rates. But—and this is the critical part—it's also created a new bottleneck, and that bottleneck is me. Instead of casually creating tickets in 10-15 seconds and letting agents figure out the details, I'm now spending 2-5 minutes upfront using Sandstorm itself to refine each ticket until it's comprehensive and unambiguous. I'm doing the thinking work in advance, clarifying intent, eliminating ambiguity, and creating a solid plan before any code gets written. Yes, this means I can't spawn as many parallel agents. Yes, this means my total throughput has decreased. And that's when I realized: I'm not actually solving the right problem here.
This Isn't Waterfall (But It's Not Cowboy Coding Either)
Some people might look at this approach and claim I've just reinvented waterfall development. I disagree, and here's why: waterfall involves extensive upfront planning for work that spans weeks or months. What I'm doing is spending a few extra minutes on planning for work that might take an agent 15-30 minutes to execute. The scale is completely different. I'm not writing detailed design documents or creating comprehensive specifications for an entire feature. I'm simply being clear and thorough about what needs to happen for a single, focused task.
The philosophy here is simple: do the work where it makes the most sense. Iterating on pull requests is expensive—it involves context switching, review cycles, and potential impacts on other team members. Iterating during the planning phase is cheap. It's just me and Sandstorm's refine skills, working through the problem before committing resources to a solution. By front-loading the thinking, I ensure that when execution happens, it happens right the first time. The goal is the one-shot solution: an agent executes a ticket, produces high-quality code that conforms to our conventions, and the resulting pull request gets approved without requested changes.
When that doesn't happen—when a pull request comes back with feedback—I treat it as a learning opportunity. I examine why the one-shot failed. Was the ticket ambiguous? Did I miss important context that should have been in our Claude.md file? Did the internal review process fail to catch something? Every piece of feedback becomes data that informs improvements to the system. This iterative refinement of the process itself is where real progress happens—and it's also what revealed the real problem I'm facing.
The Planning Problem: It's the Harness, Not the Model
Here's what I've discovered: current AI models actually can plan reasonably well for complex tasks when given the right context. Sonnet, for example, writes excellent code when given a well-defined ticket. Give it clear instructions, proper context, and explicit requirements, and it'll knock the implementation out of the park. But getting to that well-defined ticket is the hard part—and that's where I initially misdiagnosed the problem. I thought the models weren't good enough at planning. Opus is better at planning than Sonnet, but it still requires human refinement. There are always little ambiguities that need clarification, small tweaks that make a big difference, and context that needs to be made explicit rather than left implicit.
But after spending all this time refining tickets, I've realized the model can actually make decent decisions once it has all the information it needs. The problem isn't the model's planning capability—it's that the harness does a poor job of discovering and surfacing the information the model needs to plan effectively. When I manually refine a ticket, what am I really doing? I'm identifying which parts of the codebase are relevant, surfacing past decisions and patterns, providing explicit context about architectural constraints, and connecting the current task to related work. That's not planning—that's context discovery and assembly. The model could do that planning work if the harness gave it the right information to work with.
This realization completely changes the problem I'm solving. I'm not a bottleneck because I'm better at planning than AI—I'm a bottleneck because I'm currently the best context-discovery mechanism in my workflow. I'm the one who can take a rough idea, understand what information is needed to execute it properly, find that information across the codebase and project history, and assemble it into a coherent context for the agent. The harness should be doing this work, but current harnesses—including the ones I'm using—simply aren't sophisticated enough.
The Context Accumulation Challenge: Beyond Orchestration
What I really want is something that mimics how humans work together over time. When you work with a team week after week, you don't start every conversation from scratch. You build on previous discussions, reference shared context, and communicate efficiently because everyone understands the broader picture. If I say "let's refactor the property validation logic like we did with the tenant validation," my human colleagues would immediately understand the pattern I'm referencing and what I'm asking for.
AI agents, at least in my current setup, don't work this way. Each ticket exists somewhat in isolation. The agent doesn't carry forward the accumulated context of every conversation, every decision, every pattern we've established. Sure, features like Claude's memory and the Claude.md file are designed to address this problem, but they're not ideal solutions. Claude's memory is limited and doesn't capture the nuanced understanding that comes from working on a project over time. The Claude.md file requires me to explicitly document everything I want the agent to remember, which means I'm constantly making judgment calls about what's important enough to include. Claude Code has tried to address this with its Claude.md approach and memory feature that creates little notes along the way, and while it works, it's not a great solution. It fundamentally hasn't solved the problem—it's just created a slightly better workaround.
This context accumulation challenge is really a harness problem, not an orchestration problem. Sandstorm Desktop is an agent orchestration system—it manages how agents are spawned, how they execute tickets, how they coordinate their work. But the harness is what provides the actual interface between the agent and the code, and that's where context needs to live. The canned approaches we get from current AI code harnesses aren't solving this problem well. Right now, I'm the translation layer between my rough, conversational ideas and the explicit, detailed instructions that agents need. I take my shorthand—"do that property thing like we did before"—and use Sandstorm's refine skills to convert it into structured tickets with all the context spelled out. It's manual work, and it's time-consuming, but it's necessary given the current state of both the models and the harnesses.
The ideal future state would involve a harness sophisticated enough to maintain rich contextual memory, understand project patterns, and interpret conversational intent without requiring me to explicitly document every detail. A harness that could automatically identify relevant code sections, surface related past work, understand architectural patterns from the codebase itself, and assemble all of that into context for the model to use in planning. That would eliminate me as the bottleneck entirely—not by making the models better at planning, but by giving them the context they need to plan effectively. Of course, there's also the question of whether more work needs to be done in defining things better in a Claude.md file or leveraging memory features more effectively. That's part of the exploration—figuring out what combination of approaches might actually move the needle.
Time to Investigate the Harness Layer (But Not Rebuild the World)
This brings me to where I am now, and it's not a place of comfortable acceptance. I'm not okay with being a permanent bottleneck, even if the current workflow produces better results than the previous high-velocity chaos. The solution isn't to wait for the next generation of models—Mythos or whatever comes next—hoping they'll be better at interpreting rough intent without context. The solution is to fix the actual problem: the harness's inability to provide rich, relevant context for planning.
That means it's time to start investigating whether there are better approaches to the harness layer. Now, I'm not ignorant to the fact that building an entire harness from scratch would be an absolute massive undertaking. It's going to require a significant amount of effort and research up front before I could even begin something like that. I'm not jumping into a full custom harness build, and I'm definitely not committing to it right now. I'm just suggesting it as a possibility—something I'm exploring as part of understanding the landscape. The whole thing would be a different ball game entirely.
The more practical approach—and probably the one I'll take first—is experimenting with different harnesses for different phases of the workflow. For example, maybe I could use a different harness like Aider for the planning phase, where context discovery and ticket refinement happen, and then continue using Claude Code for execution. That's probably going to be the easiest path forward. I could test whether other harnesses handle context management better than what I'm currently using, without having to rebuild everything from the ground up. It's about finding the right combination of tools rather than necessarily building everything myself.
This is about investigation and experimentation, not making massive commitments to build custom infrastructure for problems that someone else might have already solved. Do existing harnesses like Aider handle context discovery better than what I'm currently using? Are there open-source harness projects with interesting approaches to context accumulation that I could learn from or integrate? Can I mix and match different harnesses for different workflow phases to get better overall results? These are the questions I need to answer before deciding whether custom harness work is even necessary.
What I've Learned About AI-Assisted Development
Building Sandstorm Desktop has taught me that AI agent workflows for software development are absolutely viable, but they require different thinking than traditional development. You can't just throw poorly-defined work at AI agents and expect great results, no matter how many agents you spawn. Quality inputs produce quality outputs—but ensuring quality inputs isn't about better planning from the model, it's about better context from the harness. It's also revealing the boundaries between different problem domains—orchestration versus harness, execution versus planning, speed versus quality—and forcing me to think carefully about where the actual bottlenecks are.
I've also learned that optimizing for the wrong metric can be counterproductive. Early on, I was optimizing for parallelization and throughput—how many tickets could I work simultaneously? But the real metric that matters is successful, high-quality completions that require no rework. It's better to complete five tickets perfectly than to complete twenty tickets that each need multiple rounds of revision. This shift in perspective has implications beyond just my workflow—it's changing how I think about what problems are worth solving in the AI tooling space.
But the most important lesson is this: when you identify a bottleneck, don't just accept it and optimize around it. Investigate whether you're actually looking at the right problem. For weeks, I thought I was the bottleneck because I was better at planning than AI. That led me down a path of accepting manual ticket refinement as necessary overhead. Only by really examining what I was doing during that refinement did I realize I wasn't planning—I was doing context discovery. And context discovery is absolutely something a well-designed harness should handle.
Moving Forward: The Harness Investigation Begins
The software factory vision is still alive and well, but it's evolved significantly. It started as "spin up as many agents as possible in parallel." It matured into "ensure every agent execution produces production-ready code on the first attempt." Now it's becoming "build the infrastructure that enables agents to plan effectively by giving them the context they need." That's a more sophisticated vision, and it requires solving harder problems than just orchestration.
The immediate next step is investigating harness solutions—not building from scratch, but exploring what's already out there. That means experimenting with different harnesses like Aider to understand their approaches to context management. It means testing whether using one harness for planning and another for execution produces better results than a single-harness approach. It means diving into what Claude Code's memory and Claude.md features can actually do when pushed harder, and whether there are ways to leverage them more effectively. This isn't about committing to a massive rebuild—it's about research and experimentation to find the best path forward.
I'm being realistic about what I'm taking on here. A full custom harness would be a huge undertaking, and I'm not rushing into that. But I am committed to finding better solutions to the context problem, whether that means mixing different tools, using existing harnesses in smarter ways, or potentially building custom components if that turns out to be necessary. The key is to investigate first, experiment second, and only commit to big builds when I'm sure there's no better alternative.
The frontier keeps moving, and that's exactly where I want to be. Not using AI tools as they're given to me, but finding the combination of tools and approaches that make AI-assisted development actually work at the level of quality and efficiency I know is possible. The harness problem is the next challenge, and I'm ready to tackle it—methodically, experimentally, and without biting off more than I can chew.