The Scientific Method

Every framework for AI-assisted engineering will be obsolete within a year. The scientific method will not. It is the only discipline that survives contact with a field that changes faster than anyone can document.

The tools will change. I say this with the certainty of someone who has watched four generations of AI coding assistants arrive, be declared transformative, and recede into the background noise of daily practice — all within two years. The model you use today will not be the model you use in six months. The interface will shift. The capabilities will expand in directions nobody predicted and fail to expand in directions everybody expected. Whatever specific workflow you have built will need rebuilding. This is not a complaint. It is the condition.

So what survives?

Not prompting techniques. Not tool-specific configurations. Not the particular arrangement of panes in your editor, useful as that is. What survives is the method beneath the practice — the habit of mind that remains constant while everything else churns.

I have, throughout this series, described a set of practices. See what the model produces. Generate less per cycle. Speak your reasoning rather than compress it. Build evaluation systems before you build features. Each of these is practical and specific and — I must be honest — each of these may be wrong within a year. The tools may evolve to make large generations safe. Voice interfaces may become native to every editor. Test generation may become so reliable that hand-built conformance suites feel quaint.

But the method that produced these practices will still work. And the method is not complicated. It is the oldest reliable technique we have for finding out what is true.

1Observe. Hypothesise. Test. Measure. Revise.

That is all. That is everything.

When I began working with LLM-assisted development, I did not start with best practices. There were none worth trusting. I started with questions. What happens if I generate a larger change? What happens if I generate a smaller one? Is the code better when I describe the reasoning or when I give terse instructions? Can I measure the difference, or am I just narrating a preference?

Each question became an experiment. Each experiment produced evidence. The evidence accumulated into practice — not because someone authoritative recommended it, but because I tested it against reality and reality cooperated.

The conformance suite from the previous post is a perfect expression of this. I did not decide that hundreds of test cases would be the right approach. I hypothesised that a rich test suite would allow the model to iterate more effectively. I built it. I measured the results — not in feelings, in passing rates across generations of code. The hypothesis held. So I kept the practice.

The benchmark is another. I hypothesised that models would sometimes produce functionally correct but dramatically slower code. I built a measurement. I was right — not always, but often enough that the benchmark earned its place. Had I been wrong, I would have discarded it. The willingness to discard is the part most people skip.

This is where AI-assisted engineering departs from most of what passes for software methodology. The industry is addicted to frameworks. Imported practices, adopted wholesale, applied uniformly. Agile. SAFe. Shape Up. Each one useful in context, each one eventually calcified into dogma by people who adopted the form and never questioned the substance. They are cargo cults in the precise sense: they replicate the visible rituals of success without understanding the mechanism that made them work.

AI-assisted engineering cannot afford this. The field moves too fast for received wisdom. By the time a best practice is documented, shared at a conference, and adopted by your team, the underlying conditions have changed. The model is different. The failure modes are different. The practice no longer fits, but nobody notices because nobody is measuring.

The scientific method is immune to this failure mode. It does not depend on stable conditions. It depends on the willingness to observe what is actually happening, form a theory about why, test that theory, and abandon it when the evidence demands. It is self-correcting by design. Every other methodology is self-perpetuating by default.

I am not suggesting that every developer should run controlled experiments with statistical rigour. I am suggesting something more modest and more radical: treat your practices as hypotheses. When you adopt a new workflow, measure whether it works. When someone tells you that a technique produces better results, verify it in your own context. When a practice stops producing results, stop doing it, regardless of how many people on the internet swear by it.

This is harder than it sounds. We are pattern-matching creatures. We find something that works, and we repeat it. We build identity around our tools and methods. We become Emacs people, or Vim people, or Cursor people, and the identity outlasts the evidence. The scientific method asks you to hold your practices lightly — to value them for what they produce, not for what they represent.

Throughout this series I have described what works for me, today, with the current generation of tools. I have been specific because specificity is useful and vagueness is not. But I want to end with a caveat that is also a liberation: all of it is provisional. Every practice I have described is a hypothesis that has survived testing so far. When the tools change — and they will — I will test again. Some practices will hold. Others will not. The method endures. The details are negotiable.

That is the only honest position in a field this young and this volatile. Anyone who tells you they have found the permanent answer to AI-assisted engineering is selling something. The best you can do — the best anyone can do — is observe carefully, test honestly, and revise without sentiment.

The tools will change. The method will not. Start there.


This is the final post in a five-part series on LLM-assisted engineering practices. The full series: Your screen is the bottleneck, Generate less, Stop typing, Build the guardrails, and this post.

If you are an engineering leader trying to work out how your team should be working with AI tools, I would like to hear from you. Not to sell you a framework — I have just spent a thousand words arguing against those. But a conversation with someone who is doing this work daily and measuring the results might be more useful than another conference talk.

Find me on LinkedIn or at [email protected].

Footnotes

  1. Or in agile terms: collaborate, deliver, reflect, improve.