Stop Estimating. Start Sizing.

The word “estimate” is the problem. It invites temporal reasoning — and temporal reasoning about complex work activates every cognitive bias in the catalogue. Kahneman and Tversky documented the planning fallacy in 1979: people systematically underestimate completion time even when they know their previous estimates were wrong. Moløkken-Østvold and Jørgensen reviewed decades of estimation surveys and found that 60–80% of projects overrun, with average overruns of 30–40%. Formal estimation models (COCOMO, function points, the lot) don’t produce more accurate results than expert judgement.

The entire discipline of software estimation has failed to improve on asking someone who knows what they’re doing to guess. Perhaps we’re asking the wrong question.

Sizing is a different operation

I don’t ask teams to estimate work. I ask them to size it.

The distinction matters. We’re not trying to work out how long it’ll take to chew the bite — we’re working out how big a bite we’re taking. That’s a fundamentally different cognitive task, and one that humans are measurably good at.

You can look at two objects and tell me which is heavier. You can say with reasonable confidence that one weighs roughly twice as much as the other. What you can’t do — what nobody can do reliably — is tell me how long it’ll take you to carry the heavier one up four flights of stairs. That depends on your fitness, the staircase, whether you slept well, and a dozen other variables that aren’t properties of the object.

Weber’s Law, established in the 1830s and applied to software by Eduardo Miranda at Carnegie Mellon, formalises this: humans are significantly better at relative comparison than absolute magnitude judgement. Story points exploit this. When a developer says “this is a 5,” they’re not predicting the future. They’re making a relative complexity judgement against a known reference. That’s sizing, not estimation.

Complexity is what we’re actually measuring

When I ask a team to size something, I’m asking them to assess complexity. How many layers of the system does this touch? How much testing does it require? How well do we understand the problem space?

These are properties of the work itself — not predictions about how quickly someone will complete it. A task that touches three layers of the system with comprehensive tests is an 8, whether it’s picked up by someone who’s been in the codebase for a year or someone who started last week. The time will differ. The complexity won’t.

Points	Complexity	Layers	Tests
1	Trivial — single-file fix	1	Minimal
2	Small — well-understood change	1	Few
3	Moderate — one feature, one layer	1	Some
5	Significant — multi-file, multiple concerns	1–2	Good coverage
8	Large — cross-cutting, domain knowledge needed	3+	Comprehensive
13+	You don’t understand this yet. Decompose.	—	—

The scale is Fibonacci (1, 2, 3, 5, 8), capped at 8. Not because Fibonacci is fashionable — most teams have cargo-culted it without understanding why it works — but because the non-linear gaps force meaningful distinctions. You can’t split the difference between 3 and 5. You have to decide: is this thing genuinely more complex, or am I just uncertain about it?

The ceiling is a decomposition trigger

Anything above 8 isn’t a size — it’s an admission that the work isn’t understood well enough to size. A 13 doesn’t mean “plan for two weeks.” It means “stop, break this apart, and come back when the pieces are each 8 or smaller.”

This is where the real value lives. The ceiling converts sizing into a forcing function for understanding. You’re not making a big task smaller for accounting purposes; you’re identifying that you don’t yet know what the task actually consists of.

From sizing to forecasting

Sizing tells you how much work you have. To say anything about when it’ll be done, you need a second ingredient: observed throughput.

Track what your team actually delivers week over week. Don’t declare velocity in advance (“a 3 is about two days” is the temporal reasoning problem smuggled back in through a side door). Measure it. After three to four weeks of data, a throughput number emerges that’s specific to your team, your codebase, and your current context.

On a recent engagement, our reference task was a security hardening piece: rate limiting for authentication endpoints. Twenty-one files changed, 1,325 lines across database, middleware, configuration, tests, and CI. One developer, one working day. We called it an 8.

That gave us an initial velocity observation. Combined with the sized backlog, we could produce a forecast. But — and this is where my physics background makes me difficult to work with — a forecast without uncertainty is just a guess with more steps.

Uncertainty propagation

I was trained as a physicist. When you conduct an experiment, you don’t report a measurement without its uncertainty. Saying “the beam is 3.2 metres long” is incomplete. Saying “the beam is 3.2 ± 0.1 metres” is a measurement. The uncertainty is not an admission of failure — it’s information about the quality of your measurement.

The same discipline applies here. There’s uncertainty in the sizing (is this really a 5, or could it be an 8?). There’s uncertainty in the velocity (we’ve observed 12 points/week, but from limited data). These uncertainties propagate through to the forecast. You don’t get to pretend they don’t exist just because your stakeholder wants a date.

So instead of “we’ll be done by the end of April,” the forecast becomes: “At current velocity, remaining work places delivery in the range of 12–18 weeks. The range is wide because we have limited velocity data and 40% of the backlog isn’t specified yet. Here’s what actions collapse that range — more velocity observations, specifying the remaining work, or reducing scope.”

That’s less impressive on a slide. It’s also the only version that isn’t lying.

Sub-linear scaling

A common response to a wide forecast range is “let’s add people.” Two developers don’t produce twice the output of one. Coordination overhead, context-switching, and code review cycles compress the multiplier to roughly 1.7×. Three developers, about 2.2×. These are working assumptions that observed data will replace, but they’re better than the linear model stakeholders instinctively reach for.

What I don’t do

No sprints. In a pull-based system, time-boxed iterations are ceremony. WIP limits and cycle time are the relevant controls.

No hours. Ever. Hours are temporal estimates wearing a thin disguise. They activate exactly the biases that sizing is designed to avoid.

No treating sizes as commitments. A size is a property of the work. A commitment is a promise about the future. Conflating the two produces sandbagging, corner-cutting, and dishonesty — none of which ship better software.

Three distinct things

The industry lumps sizing, velocity measurement, and forecasting together under “estimation” and then wonders why it’s so consistently wrong. They’re three distinct operations:

Sizing is a relative complexity judgement. Humans are good at this. Fibonacci points, anchored against a real reference task, with a hard ceiling that triggers decomposition.

Velocity is an empirical measurement. You observe it; you don’t declare it. It has an uncertainty that shrinks as you collect more data.

Forecasting is the combination of the two, with uncertainties propagated honestly. It produces a range, not a date — and the width of that range is itself useful information about how well you understand the work ahead.

I’ll be writing more about how I’m using these ideas — along with some language model tooling — to improve my product management processes. But the foundation is this: stop asking people to predict the future. Ask them to tell you how big the thing is. They’re good at that.