[{"data":1,"prerenderedAt":111},["ShallowReactive",2],{"blog-regression-tests":3},{"id":4,"title":5,"body":6,"date":102,"description":103,"extension":104,"meta":105,"navigation":106,"path":107,"seo":108,"stem":109,"__hash__":110},"blog/blog/regression-tests.md","Regression Tests for Non-Deterministic Skills: Why Your AI Code Reviews Need Behavioral Testing",{"type":7,"value":8,"toc":92},"minimark",[9,13,16,21,24,27,31,34,37,40,44,47,50,53,57,60,63,66,70,73,76,79,83,86,89],[10,11,12],"p",{},"I've been working on improving automated code reviews in my agent orchestration systems, and I keep running into the same frustrating problem: every single automated code review tool misses things that we think it should have caught. It doesn't matter which project or what requirements we're working with—the issue persists. This isn't just an occasional hiccup; it's a fundamental challenge that reveals something important about how we're building AI-powered development tools.",[10,14,15],{},"If you're careful and paying attention, you'll notice this pattern everywhere in agentic development. When we get output that's wrong or suboptimal, we follow a predictable process: we try to identify why the output was produced incorrectly, then we figure out how to prevent it from happening again. We modify prompts, adjust our Claude configurations, tweak skills, and bring that learning back into the system. It's an iterative process that feels productive in the moment, but there's a hidden trap waiting for us.",[17,18,20],"h2",{"id":19},"the-typical-code-review-learning-cycle","The Typical Code Review Learning Cycle",[10,22,23],{},"Here's how this plays out with code reviews specifically. We get a pull request with an associated commit, and that pull request goes through our automated review process. Sometimes we identify problems with the pull request—maybe the AI allowed all this logic to be crammed into one spot when it should have suggested breaking it apart and moving it to a better location. When this happens, we take that learning back into our system. We modify our code review prompt to incorporate this new knowledge, hoping to catch similar issues next time and prevent poorly organized code from slipping through.",[10,25,26],{},"This is just one example among many, but the pattern is consistent. Generally speaking, we're getting really good at code reviews through this iterative improvement process. But recently, I noticed two issues that came up on the same day that should have been caught by our code review system, and they weren't. This could potentially be related to model regressions from updates, but regardless of the cause, the result was the same: I found myself sitting down to modify the code review prompt yet again.",[17,28,30],{"id":29},"the-realization-that-changed-everything","The Realization That Changed Everything",[10,32,33],{},"Then something occurred to me that seems obvious in hindsight but isn't widely discussed: we can't just modify the code review prompt and assume it works. We need to verify that the new code review prompt would have actually caught the original problems we identified. So I started running the modified prompt through the scenarios that had failed, and guess what? The changes I was going to make wouldn't have actually caught the problem. The new prompt failed the same way the old one did.",[10,35,36],{},"This kicked off an iterative process where I kept adjusting the prompt over and over again until I found a change that would have caught the problem. Only then could I say, \"Okay, let's bring this in and take that learning.\" Now I had a very specific change with a demonstrated improvement in the code review capability. This felt like progress—real, measurable progress backed by evidence rather than hope.",[10,38,39],{},"But when I thought about this process playing out over time, I realized we had an even bigger problem lurking. Here's what happens: we make a prompt change and prove that it fixes a specific issue. Later, we find another issue with a different pull request. We do some iterating and discover we need to tweak the prompt a little bit to fix that new issue. Then—whoops—we've accidentally created a regression. The prompt no longer catches the original problem that we first fixed. We've essentially robbed Peter to pay Paul, solving one problem while reintroducing another.",[17,41,43],{"id":42},"why-we-need-behavioral-regression-tests-for-code-reviews","Why We Need Behavioral Regression Tests for Code Reviews",[10,45,46],{},"This example demonstrates a need that I don't see many people talking about in the agentic AI development space: we need behavioral regression tests for our code reviews themselves. Not evals in the traditional sense, but something more specific and targeted to the actual output behavior we expect from our AI agents.",[10,48,49],{},"The way this process would work is straightforward but powerful. We would maintain a specific pull request and a specific commit within that pull request (to be more precise) that we've identified as containing something that should have been caught. For instance, a class that should have been flagged by the code review as needing abstraction or being relocated to a different architectural layer. We document this: this specific pull request, this specific commit, and this specific flag that should have been raised about the bad code.",[10,51,52],{},"When we start building a list of these documented cases—each one capturing a pull request, a commit, and an expected flag—we're essentially creating regression tests for our pull request review system. These aren't tests of our application code; they're tests of the AI agent's behavior itself. Over time, as we modify our prompts to make our pull request reviews more aligned with our expectations and higher in quality, we have a safety net that prevents us from creating regressions by drifting away from fixes we've already implemented.",[17,54,56],{"id":55},"beyond-traditional-evals","Beyond Traditional Evals",[10,58,59],{},"There may also be a need for improving the prompt itself in other ways—making it more concise, for example, or addressing inefficiencies in how it's structured. But what I'm describing here isn't quite the same as a traditional eval. We're not evaluating whether the specific words we use are triggering the skill correctly. Instead, we're evaluating the output of a skill to verify that the output remains consistent over time as we modify the underlying prompt.",[10,61,62],{},"This is behavioral regression testing for non-deterministic systems. It acknowledges that LLMs don't produce identical output every time (hence \"non-deterministic\"), but it also recognizes that we can and should test for consistency in the types of issues they catch and flag. If our code review agent caught a poorly abstracted class three weeks ago after we tuned the prompt, it should still catch that same pattern today after we've made additional modifications for other issues.",[10,64,65],{},"The implications of this approach extend beyond just code reviews. Any AI agent or skill that we're continuously improving through prompt engineering needs this kind of behavioral regression testing. Whether you're building agents that write documentation, analyze security vulnerabilities, or generate test cases, the same problem applies: iterative improvements without regression testing lead to two steps forward, one step back—or sometimes one step forward, two steps back.",[17,67,69],{"id":68},"building-systems-that-actually-improve","Building Systems That Actually Improve",[10,71,72],{},"What makes this particularly interesting from a software engineering perspective is that it forces us to think about AI agents the same way we think about traditional software systems. We wouldn't dream of continuously modifying application code without a comprehensive test suite to catch regressions. Yet somehow, many teams are doing exactly that with their AI agents and prompts, making changes based on the most recent failure without systematically verifying that previous fixes remain intact.",[10,74,75],{},"In my work with agent orchestration systems, I've seen this pattern repeatedly: teams achieve impressive initial results with AI-powered tools, then struggle to maintain and improve those systems over time because they lack the proper testing infrastructure. They're essentially flying blind, making changes reactively without the feedback loops that would tell them whether they're actually moving forward or just trading one set of problems for another.",[10,77,78],{},"The solution I'm currently working on builds exactly this kind of behavioral regression test framework for code reviews. Each time we identify a code review that should have caught something but didn't, we add that case to our regression test suite. Before any prompt modification goes into production, it has to pass all existing regression tests—proving not just that it fixes the new problem, but that it maintains all the previous capabilities we've built up.",[17,80,82],{"id":81},"the-path-forward","The Path Forward",[10,84,85],{},"This is an interesting idea that I haven't heard many people talking about yet, but I believe it's going to become increasingly critical as more teams adopt AI agents in their development workflows. As these systems become more sophisticated and more deeply integrated into how we build software, we need equally sophisticated ways to ensure they improve over time without regressing.",[10,87,88],{},"The reality is that AI is fundamentally transforming the software development lifecycle and engineering practice. But transformation doesn't mean we abandon the principles that make software reliable and maintainable. If anything, working with non-deterministic AI systems requires us to be even more rigorous about testing, validation, and systematic improvement. Behavioral regression tests for AI agents are one piece of that puzzle—a way to bring the discipline of traditional software engineering to the new frontier of agentic development.",[10,90,91],{},"Right now, I'm actively building this regression test framework for our code review agents, and I'm documenting what works and what doesn't. The early results are promising: we're catching prompt regressions before they hit production, and we're building confidence that our improvements actually stick. More importantly, we're developing a methodology that can be applied to any AI agent or skill that requires ongoing refinement and improvement.",{"title":93,"searchDepth":94,"depth":94,"links":95},"",2,[96,97,98,99,100,101],{"id":19,"depth":94,"text":20},{"id":29,"depth":94,"text":30},{"id":42,"depth":94,"text":43},{"id":55,"depth":94,"text":56},{"id":68,"depth":94,"text":69},{"id":81,"depth":94,"text":82},"2026-06-12","Every fix to an AI code-review prompt risks silently breaking a previous one. Here's the case for behavioral regression tests — pinning a PR, a commit, and an expected flag — so your agents actually improve over time instead of trading one caught issue for another.","md",{},true,"/blog/regression-tests",{"title":5,"description":103},"blog/regression-tests","aGMfGtYTIUPMBmnQSmI8zcrJDjXN-6Oy3pujRuCWT2I",1781404852561]