Build the guardrails
The most important code you write when working with a language model is not the feature. It is the system that tells you whether the feature works. Without it, you are optimising on faith.
Early in a recent project I needed to build a library for processing Tailwind CSS classes. Parsing, simplifying, resolving conflicts, merging. The kind of work that is fiddly, combinatorial, and unforgiving of edge cases. A perfect candidate for LLM-assisted development — and a perfect candidate for disaster if approached carelessly.
So before I wrote a single line of application code, I went looking for test suites.
I found open-source projects with MIT licences that solved adjacent problems. I studied their tests. Not their implementations — their expectations. What inputs did they consider interesting? What edge cases did they bother to specify? What did correctness look like, expressed not as code but as a contract?
From this I assembled a conformance suite. Hundreds of examples. Inputs and expected outputs, harvested from the accumulated judgment of other engineers who had thought carefully about the same problem. This suite became the spine of the project. Every change, every generation, every refactoring — all of it ran against those hundreds of cases before I would even look at it.
The language model did not write this suite. I did. That distinction matters.
When you work with a model, you face a problem that is easy to ignore and fatal to forget: the model will produce code that looks correct. It will name variables sensibly. It will follow patterns it has seen in training. It will pass the three tests you wrote because you only thought of three cases. And it will fail — silently, invisibly — on the fourth case, the one you did not think of, the one that only matters in production on a Tuesday afternoon when nobody is watching.
The guardrails are what save you. Not the model’s judgment. Yours, encoded in advance, expressed as a system that can evaluate without mercy.
A conformance suite is one guardrail. There are others.
I added a benchmark. Not for premature optimisation — for detection. When the model refactors a function, does it still perform within acceptable bounds? Without a benchmark you will not know until a customer tells you. The benchmark does not need to be sophisticated. It needs to exist, it needs to run automatically, and it needs to scream when something changes.
I added generative tests. Property-based testing, if you prefer the formal name — tests that produce random inputs according to rules and verify that invariants hold. These find the cases you did not think of. They are particularly devastating when aimed at AI-generated code, because the model’s failure modes are not the same as yours. You would never write a function that breaks on empty input. The model might, confidently, with a helpful comment explaining why.
I added mutation testing. Take the code the model wrote. Change a condition, flip a sign, delete a line. If the tests still pass, the tests are lying to you. Mutation testing finds the gaps in your safety net. When you are generating code at speed, your safety net had better be honest.
Each of these guardrails has the same purpose: to make quality observable. You cannot review every line of AI-generated code with the same attention you would give to code you wrote yourself. That is the truth of working at this speed. But you can build a system that reviews it for you — not with understanding, but with rigour. The model generates. The guardrails evaluate. You make the judgment call based on evidence, not hope.
This is where the scientific method enters the practice, though I will save that argument for the next post. For now, the principle is simpler: before you ask a model to build anything, build the system that will tell you whether it worked. Invest in the test suite before the implementation. Invest in the benchmark before the optimisation. Invest in the contract before the code.
The model is prolific. It will produce more than you can personally verify. That is the point — and that is the danger. The guardrails are how you scale your judgment to match the model’s output. Without them you are not engineering. You are generating, and hoping, and calling it progress.
Build the guardrails first. Then let the machine run.
This is the fourth in a series on LLM-assisted engineering practices. Previously: Your Screen Is the Bottleneck, Generate Less, and Stop Typing.
Next and final: why the scientific method is the only framework that survives contact with AI-assisted engineering — and why everything else is cargo cult.
Find me on LinkedIn or at [email protected].