I’ve been struggling for a metaphor to help explain the difference between statistical forecasting and estimation, and this one came to me, so lets give it a whirl.

Let me express the scenario in some human readable BDD (Gherkin) language

**
Given I am an orange juice shop owner**

and currently have no oranges

when I buy 25 boxes of oranges of varying sizes and varieties

then I need to forecast how many glasses of juice I can produce to sell

Let me now describe two different approaches, the first like normal agile estimation processes, the second like a scientist.

I close my shop to customers gather my team together. I open the first box of oranges and hold up the first orange, showing them how plump it is and telling them it looks like it is probably a Belladonna variety orange (http://en.wikipedia.org/wiki/Orange_(fruit)#Common_oranges).

I ask them all to estimate how much juice the orange will yield. They each choose a planning poker card and I count them down, and they reveal their numbers. I ask them to discuss any outlying responses and then re-estimate until we get consensus. I note the number and move on to the next orange.

The second orange is a Berna orange and is quite small. We estimate that one. And so the long day goes on, with us repeating the process and selling no orange juice.

Eventually we finish estimating the first box, and we have a number which is our estimate of how much juice box 1 will yield. We have to make a decision, do we stop estimating and open the shop to sell some juice or do we go on and do the same process on box 2 through box 25.

Some juice shops stop at 1 box and use the estimated yield for that box as the "magic number" of yield per box. Other shops keep on going until all 25 boxes have been estimated one orange at a time, because they need a “more accurate estimate”. The downside of the accuracy is that we had to buy it by keeping the shop closed for 25 times as long.

What do I get at the end? Somewhere between 200 and 300 Story Points of oranges.

I open box 1, and count how many oranges are inside. I also open boxes 2, 3, 4, and 5 and count the oranges in each of them. 10 minutes later I open the shop for business and start my staff selling juice.

When a customer orders juice we measure how much juice the first 11 oranges actually yield. We don’t estimate, we measure.

**I now have enough data to make a pretty accurate forecast **of how
much juice the boxes will yield. It’s only taken me 10 minutes.

Now for the sciencey bit. The German Tank Problem (http://en.wikipedia.org/wiki/German_tank_problem) is a famous bit of Bletchley Park Boffinery from the second world war. To save you a bit of reading, the Allies wanted to know how many of a particular tank they were likely to come up against in France when they invaded. There were 2 ways of getting the forecast, via Military Intelligence estimates, or using Statistic and probability. Lives depended on this so it had to be correct.

Here is a comparison of the 2 methods used over time, and on the right, the actual numbers found out at the end of the war.

Month | Statistical estimate | Intelligence estimate | German records |

June 1940 | 169 | 1,000 | 122 |

June 1941 | 244 | 1,550 | 271 |

August 1942 | 327 | 1,550 | 342 |

So the provenance for using this method is pretty good. Lets apply it to the oranges.

The key to it is understanding that the maths lets us make very accurate predictions using very small sample size. Indeed the formula below shows how likely the next item measured falls within our existing range of highest to lowest values, where k is the size of our sample

% Likelihood = (1 - (1 / k – 1)) * 100

So if we have 5 boxes of oranges with 17, 23, 16, 30, 25 oranges in each, the likelihood of the 6th box having more than 16 oranges and less than 30 is (1 - (1 / 5 -1)) * 100 which is 75%. 75% likelihood of all future boxes being inside the current know range from a sample of only 5 boxes.

A sample of 11 gives us 90% likelihood of the next being within our known range. Credit
goes to Troy Magennis for explaining
this to me over a couple of *Weissbiers* at
the Kanban Leadership Retreat in 2013.

So thats a likelihood of 75% that we have between 16 and 30 oranges per box. That gives us a median of 23 oranges per box

So, how much juice will we get per orange? For the 11 oranges I measured I got 79.1, 78.5, 71.2, 72.1, 65.2, 79.3, 73.2, 67.2, 65.0, 75.3, and 69.1 ml.

So thats 90% likelihood that all oranges have between 65.0 and 79.3 ml each. That gives us a median of 72.2ml juice per orange

So I have 25 boxes, each with 23 oranges, each giving 72.2ml juice. I have a total yield of 25*23*72.2 ml of juice = 41.515 Litres of Juice. I’m going to have to buy a lot more boxes to keep my shop in stock for the day. I’m glad I found that out early enough to get back to the wholesalers in time.

In the knowledge work world, we tend to have to solve the same problem for forecasting work completion, and measure days per work item, and work items per “epic” or “MMF” or “MVP” or Project (whatever you call your orange boxes in your context). If you want real accuracy instead of working out medians and using those, you would plug the very same numbers into a Monte Carlo Simulation of your system of work, and work out how many of the project runs finish before each date. That is much more complicated to do, and requires a good old dose of processor power, but is far more accurate than my simple sums on the median values. However, even my simple sums are much more accurate than estimation, and cost much less time off from doing value work to generate.

If you’re running an IT Software project with 25 epics, you’d break down the first 5 epics into stories (the ones epics you’re going to work on first anyway) to work out the stories per epic number. When you start work you can see that each story takes between say 2 and 9 days based on a sample of 11 stories… We have all the data we need to make a good forecast, and not one piece of estimation has occurred. Most of the data is derived from doing the actual work we need to do to finish the project. Which is ideal, as it means we are focusing on doing the thing that will get us finished, and the forecast is a secondary outcome, not a distraction from doing the work like with estimation.

Whenever I’m in a meeting talking about metrics, someone always brings up Velocity. If you’re familiar with Scrum you’ll know that in this context, velocity means the number of story points completed per sprint. If a team completes 4 stories each of which scores 5 story points in one sprint, the velocity of that team is said to be 20. If that doesn’t make sense, you should probably do some googling about now and come back when it does… don’t worry, I’m happy to wait for you. Look up planning poker while you’re on. ;-)

Ok ready to continue?

No, Velocity can never be a real metric, and to use it as such is to play a dangerous game. But let me explain why this is so, and if I’m taking it away, what you can replace it with (or use alongside) as a real metric.

It is my belief - although as yet I do not have any scientific data to back myself up - that numbers hold a special place in our minds. Numbers and Mathematics are after all an abstract construct developed by humans as a language to explain science. It is my premise and belief that we see things that are expressed as numbers as things that we can control. If my mass is 125 Kilograms, I can target loosing 5 kilograms then go on a diet and exercise regime, and measure my progress with my measured mass on any day.

What madness is this? Well I’m trying to show the difference between a real mathematical number like say 3.142 and a string of digits like 055555 555 555. You probably recognise both of those, one is a shortened for of Pi, which we use in trigonometry to work out areas and circumferences of circles, and the other is a fake uk phone number. However doing maths makes sense with Pi, it doesn’t with a phone number. Imagine adding 44 to them.

x= 44+ 3.142, so x= 47.142

y= 44 + 05555555555 ….

While you can work out that y = 5555555599, it really makes no sense, what we really want to do is manipulate the string of digits to be +445555 555 555 - the international form of the phone number.

So some numbers are real mathematical numbers and others like phone numbers are just strings of digits, so we need to very careful whenever we use something that looks like a number but is really a string of digits as people will most likely not understand the difference without explanation and try to use the number as a real mathematical number and a control point.

Velocity is a tricky one as it looks like it should be a real number, after all we can “Measure" it for a team. But appearances can be deceptive. My old Physics teacher at secondary school once told me that “if the maths is ever confusing, look at the units and they will help you understand what is going on.” Wise words for maths, physics and velocity. Velocity is measured in “story points completed per sprint” which might as well be “tulips completeted per sprint” The problem with that is that it is a malleable number.

What do I mean by a malleable number? Malleable means “easily influenced; pliable”.
So let’s pretend I’m a Command and Control manager who is up against it for a project.
I see Team A with a velocity of 22, and Team B with a velocity of 35. If I think these
are real numbers and in my mind they are therefore control points, so I would be likely
to ask the question, "how come Team B is so much ** better** than
Team A, and why can’t Team A do 35 points per sprint too?”

We, as the kind of people who read blogs like this, probably know innately that that question is a danger sign, but lets follow it through.

Team A get pushed into working towards a target velocity of 35 points. So what are they likely to do in the next story grooming (refinement) meeting when they come to estimate their stories? I suggest that a story that would have been a 5 last time round will now be an 8 - or even a 13 point story. Why am I so confident? I’ve seen it happen in my own teams. But that is the least of the problems.

As a free gift with the instruction from manager to team there is a side order of “understanding that Management don’t seem to have a clue” which undermines the organisation, and pushes the team towards being more insular and deceitful.

It has a negative effect for everyone. Even if the team doesn’t “cheat” on estimation, they may well start “sand bagging” the sprint by working longer or extra hours. And of course that will artificially uplift velocity at first, but as people get tired, the velocity will actually drop (again I’ve seen that in action) as people stop working in a sustainable manner.

As soon as the thing we are trying to control is malleable, as human animals we can’t resist playing with it. A bit like a blob of blu-tac on your desk, it’s malleable and you can’t help but play with it. How do we solve that problem? We put the blu-tac away in a packet, drawer or cupboard. We need to keep the malleable stuff away from the people who want to play with it - so it is with velocity. Just like blu-tac, velocity is a useful tool when used correctly (like for deciding when to stop grooming stories at the story grooming meeting), but should be put out of reach the rest of the time. Use something which isn’t malleable as your metric - like how many days stories take to deliver. In the Kanban world we call that the Lead Time, and it is measured in days (or sometimes hours), a fixed unit of time. Unless we solve the problem of light speed travel, fixed units of time are Immutable Numbers.