Monday, August 22, 2016

Probabilistic Forecasting - Effects Of Uncertainty On Predictions(aka Controlling Variability)

The variability in Monte Carlo predictions or the range of predictions is a direct result of the variability in the team's daily throughput. A team with a very consistent throughput will produce a "tighter" Monte Carlo result set. The results of simulations for teams that have a regular daily throughput will not change by much at different confidence levels. The results of simulations for teams that have fluctuating daily throughput will have more pronounced changes as we change confidence levels. Lets take the following to hypothetical teams as examples.
Both teams in this case have closed 30 stories over the course of 30 days.
Team A finishes one story almost every day, with a few days where they finish 2 stories. Their historical throughput graph looks like this - 

Here the horizontal axis is a timeline and the vertical axis is the number of stories done on that day. When we run Monte Carlo simulations (10,000 simulations) for this team the following results appear -

The percentage lines on the above graph are levels of confidence that help us interpret the graph. Based on the above results, Team A has a 95% chance of getting at least 24 stories done, 70% chance of 28 or more, and a 50% chance of getting 30 or more stories done over the next 30 days.

Now let us consider Team B, which has a more variable daily closure rate. The team tends to have some days when they close a bunch of stories and other days when they do not close any at all. Their throughput graph looks like this - 

Just like Team A, Team B also competed 30 items over the same time period.

Examining the results from Monte Carlo, we can see that there are many more possibilities and the numbers on the conservative end of the spectrum are much lower. Team B has a 95% chance of getting at least 16 stories done, 70% chance of 21 or more, and a 50% chance of getting 30 or more stories done over the next 30 days.
We can see that in both data sets, the middle value is 30 stories, but values that give the same amount of confidence on the higher side are much lower for the team with higher variability in throughput. We can conclude from this that in order to make predictions with high confidence, that still help us deliver the best results from our teams, we need to control the variability in our throughput. The question though is - How can team create systems that have lower variability so that we can make predictions with higher confidence?

Probabilistic Forecasting - Forecasting In The Face Of Uncertainty - Multiple Items(Monte Carlo)

Multiple Item Forecasts

The rate at which stories get done, can help us figure out what the capacity of a team is for a given period of time. We need to make sure that we do not fall into the Flaw of Averages trap though. We have to model the inherent uncertainty in our processes in order to make sensible predictions for a team. Apart from how long it takes for things to get done, uncertainty also presents itself in the form of how many things get done on a given day. Using the historical trends of how many things are getting done on a daily basis we can model the future, assuming that the team will behave the way it has behaved in the past. This, in essence, is the Monte Carlo method. Monte Carlo uses data from the past to give us probabilistic capacity for a team. We assume that if, say in the past 30 days there have been 3 days when the team closed 2 stories, then there is a 10% (3 divided by 30) likelihood that any random day in the future, the team will close 2 stories.
Monte Carlo Method
The Monte Carlo method(One variation of it) as we use it, runs through the following steps -
  1. We determine a past range to use and a future range to predict.
    1. The range we select is usually in the order of a few weeks.
    2. We use the latest few weeks as we believe that the latest data is the best representation of future performance.
  2. For the first day in the future range that we are trying to predict, we randomly select a day from the past range.
    1. The throughput from the randomly selected day in the past is assigned to the day in the future range we are trying to predict.
  3. Step 2 is repeated for all days in the future that we are trying to predict.
  4. When all the days have been predicted using the past range, the total of all the throughputs assigned to the future days, gives us one answer for how many stories can the team get done.
    1. We record this throughput as one possible result, which can answer how much capacity thee team has for a given time period.
  5. We repeat steps 2 through 4 a few thousand times and gather the results of each of those simulations.
The results of these simulations can be over a wide range, depending on the variability in the team's processes(i.e., fluctuations in number of stories closed on a daily basis) and the length of time we are trying to predict over. The numerous predictions though, all represent the possibilities available to us to choose from. Based on the distribution of these possibilities, we can start saying what is the probability that we can get at least x number of stories done. For example, if 85 percent of our simulations told us that over the next 60 days a team can do either 30 stories or more than 30 stories, we can say that we have an 85% confidence that the team can do 30 or more stories. The same set of simulations might have 50 percent of our results be 40 stories or more. This would give us a 50% confidence that the team can do 40 stories or more over the next 60 days. We can now plan according to the amount of risk we are willing to take. If we plan for more stories, at least we are aware of the amount of risk we are taking.

Next : Uncertainty And Predictions

Probabilistic Forecasting - Forecasting In The Face Of Uncertainty - Single Items

Single Item Forecasts

Our ability to estimate how long a single item is going to take to get done is largely overestimated(because we are bad at estimating (smile)). One of the best ways to get a handle around the uncertainty and getting a better idea of how long something might take to get done, is to actually start measuring how long they currently take. This is where the metric of Cycle Time comes in. The most common reason to lower cycle time is to increase throughput. If each item gets done faster, more items will get done over a period of time - sounds obvious. But another great reason for lowering cycle time is to lower variability. Going back to the commute example, if we timed our commute everyday, we would be able to get an idea of what the distribution of our commute times is. Similarly, if we timed our stories and recorded how long it takes for them to complete, we can start making some sense of the variability. The graph below is the cycle time scatter plot for a team. It shows the days it took for stories that closed in the first quarter of 2015. Along the vertical axis is the number of days a story took and the horizontal axis is the time line. Each dot here represents at least one story.

The horizontal lines running through the graph show what percentage of these stories took 12 days or less, 9 days or less and so on. Based on the above graph we can say that any story this team starts has an 85% chance of getting done in 12 days or less. Half the stories(50%) get done in 6 days or less. Now, when this team is handed a new story to work on, they can use this data to communicate to the story's customer how many days it will take with what level of confidence. This is a much better way of forecasting and "estimating" the amount of time a single item would take than taking averages or estimating points.

Probabilistic Forecasting - Understanding Uncertainty

The subtitle of the book The Flaw of Averages is - "Why We Underestimate Risk In The Face Of Uncertainty". Let us talk a bit about uncertainty. Uncertainty comes in two flavours - Single item and Multiple item. The Single item uncertainty refers to the variation in lengths of time around how long it takes for one thing to get done. An example of this would be the cycle time of a single story or cycle time of a single feature. Multiple item uncertainty refers to how long does it take a set of single items to get done. This would be equivalent to figuring out capacity of a team for the length of a release. In other words - How many stories can we get done in a given time frame?

Single Item Uncertainty

For those who commute to work - How long does it take for you to get to work? Let us say your answer is 15 minutes. Would you bet your paycheck on getting to work exactly 15 minutes after you leave home? If we started noting down the amount of time it takes every day to get to work, we would probably see a very random distribution of times(with a high likelihood that none of them is exactly 15 minutes). The reason for the random distribution of time, is that there are inherent uncertainties in the process of commuting from one place to another. Traffic lights, errant drivers, accidents, heavy traffic and many other variables come into play. All these variables, which compound on each other lead to the uncertainty that prevents us from knowing the exact time we will be at work. There are numerous parallels we can draw between commuting and getting a work item(story or feature) through our process and calling it finished. After years of having tried to do this, we know that asking a developer to estimate how long a story is going to take is a futile exercise. This is due to variability built into knowledge work. We do not know what we will find when we go to try and implement something, much less how long testing it would take. The best we can do is give a range that we are comfortable with. Even beyond that, there is no direct predictability on when a build with the change will become available, or how long a change will sit waiting for someone to pick it up to test. The greatest sources of uncertainties though, are external dependencies. Whether at story or feature level, completion of a work item that cannot be done by one team, is very difficult to predict.
The one thing we can say for sure is that the larger a work item is, the more unpredictable it will be. It is again analogous to a trip in the car. A "15 minute" trip can be variable by 10 minutes, most of the time, but a "4 hour" trip can easily be variable by 1-2 hours. When something looks like it will take a long time, it is more susceptible to variability and the completion time for it is increasingly uncertain.

Multiple Item Uncertainty

Imagine we were holding a chess competition to figure out the best chess player among a group of 10 players. The structure of the tournament means that we need to have 20 games in order to crown a champion. Each game can take anywhere from 15 minutes to 3 hours. We have 2 chess boards. How many hours do we need to schedule in order to make sure we have enough time to finish 20 games? This is the multiple item problem that we try to resolve everyday when we try to figure out how long a particular functionality or feature is going to take to develop. Looking at this another way - How many games can we play in a 24 hour schedule? That is the same problem as trying to figure out how many stories we can get done in a release. As you can see, the fact that each game itself can be of a variable length, makes it very hard to predict the completion time of all 20 games. We can easily be drawn into the flaw of averages thinking here. There are also dependencies between games. The earlier round games have to be played before games in a later round. Playoff games have to be played in a particular sequence. This would make our job harder. If we do not have players available at certain times when they are scheduled to play, it would add more delay to the tournament as well.
What makes this a hard problem to solve, is that there are multiple answers. The total time for the tournament (just like the car trip) is not just one number, but multiple possible numbers. The trick is to figure out what is the probability of each of those numbers. The easiest way to do this would be to play the tournament thousands of times and record how long it took every time, so that we can use that data to determining probabilities for different lengths of times.

Next : Forecasting Single Items

Probabilistic Forecasting - The Flaw Of Averages

The Big Question

Whenever we are planning out a release, the first question that we try to answer is - What is the team's capacity? Similarly when we start work on a feature, the first or story question is - How long will this take? We have traditionally tried to answer these questions in multiple ways with very little attention to the fact that there are multiple possible answers. Each of these answers have their own probability of being correct. We need tools and systems that help us figure out what these answers are and what are their probabilities of being correct.

The Flaw Of Averages

Our traditional approaches to projecting capacity and of tracking progress have been fraught with errors. We have in the past used techniques like average number of stories, or average number of points or even stories done in the last release to determine what we believe is the capacity of the team for a given release. These approaches suffer from a few major flaws. The greatest of these flaws is that average is at best telling you that there is a 50% likelihood of doing the same amount of work or more and 50% of doing less. For greater level of detail into why "Plans based on average fail on average" please read the fascinating book "The Flaw of Averages". 
This calls for a better way to predict and track capacity. A method that does not just give us prediction in the form of a date or a number, but also the likelihood of that prediction being correct. We need something that is better than average, something that lets us know what is the level of the risk we are taking when we say that a team will be able to get X number of work items completed. Every prediction has the possibility of being wrong, hence, when we make a prediction, we should acknowledge what is the probability of that prediction being correct and conversely the probability of the prediction being wrong.
The Flaw of Averages, does not preclude us from using the past as a predictor of the future. Past performance is the best baseline we have to project future performance. There are better methods to do this than traditional methods like gut feel estimates and averages. A detailed look at our past performance can reveal the level of uncertainty that exists in our system. An Understanding of this uncertainty can give us the ability to make better predictions. Modes of forecasting that boil predictions down to a single number are flawed because they don’t take the random nature of software development into account. Before we get into modes of prediction, let us understand the nature of uncertainty in software development.

Next : Understanding Uncertainty

Thursday, June 30, 2016

The Audacity of the Strategy of Hope

Typing the words "Hope Meaning" into Google (since Google knows everything), gives you the following definition of hope -

Whether hope is being used as a noun or a verb, the implication for projects is the same. There is always a degree of hope involved in projects that are either being planned or are already active. Projects start with a certain level of confidence of hitting the desired end date. The point at which our confidence in making the date falls below our initial confidence is when we start employing the "Strategy of Hope".

A soccer team that is up 2-0 at half time is operating in the second half with confidence because they have a very high likelihood of winning the game. The percentage of games won from this position is very high. On the other hand, the losing side at the same point is operating (amongst other things) under the strategy of hope. According to, "a mere 1.8% of 2-0 leads have been squandered into defeats, whilst 93% have been converted into wins". Despite knowing these percentages, we still see the hope bias, both in sports and Software development. We always believe that we can be the 1.8% and not realizes that only 1.8% can be the 1.8%.

The language used by development teams and managers starts to change the more they start to rely on hope. Words that represent high degree of certainty like "definitely" and "will", are replaced by words that represent more hope than certitude. Phrases containing words similar to "if", "may" and "hopefully" become the norm. This lasts usually till the last day of the project, when the certainty comes back. This time though, it is "definitely not" and "will not" that are the phrases which are most commonly used. Hope turning to despair in the last throes of a software project is all too common an occurrence. 

I believe that traditional application of hope in projects follows the graph as shown below. The reliance on hope comes in multiple forms. The most common form, which I have been guilty of as well, sounds a bit like this - "We have been going through a rough patch due to [insert reason here], we expect to get over the hump soon and make up ground". It is usually not till the release date or the day before the release that we finally admit that there was too much ground to make up.

Saying that it looks bleak, but we will definitely get it done, is usually just delaying the point of despair. There are two things that commonly happen as we approach the point of despair. First, the team starts working overtime to get the impossible done. Second, the team takes shortcuts to get the work done. Both of these avenues are likely to result in quality issues and creation of tech debt. In circumstances where the date or scope is negotiable, the information regarding a delayed release or delayed features arrives too late, causing loss of credibility and possibly revenue.

How do we escape the hopeless Strategy of Hope (Alright, it is not completely hopeless, every once in a while it does work)? Whenever we see a project that is starting to rely more on hope and less on what the team's actual current performance tells us, it is time to self correct. Avoiding reliance on hope is not possible, there is always a degree of hope in every endeavour. If you have planned the project start and end dates knowing the confidence level you have of completing that project on time, and the confidence starts getting replaced by hope - that is where you readjust scope or time. Set expectations early and often.

We live in a probabilistic world. Every software professional will tell you they do not really know exactly how long it will take them to fulfill a single request and get it into production. I am not sure why we have the confidence to know exactly how long it will take to get numerous requests into production. Setting end dates using probabilistic forecasting measures and not being deterministic about a release date and release contents is accepting reality. Acknowledging reality and shedding the reliance on hope is not taking the easy way out, it is simply doing business properly. Acknowledge your reality early and acknowledge it repeatedly as often as you can.

We use Monte Carlo (more on these in another post) simulations to understand our reality when it comes to releases. We run these simulations for our releases every hour, for each team, to find out where the teams stand and what kind of progress they are making. The simulations give us a percentage probability of the team being able to finish all the items in the release by the release date, based on their recent performance. This means that we find out very soon whether our confidence in the completion of the release has dropped below an acceptable level. Invariably, when the confidence in the release goes below a desirable point, the Strategy of Hope starts to make its appearance. This is the point where we try to look for alternatives to hope. We look to realign expectations as early as possible rather than waiting for the last part of the release. Teams should aim to avoid disappointment brought about by despair at the end of a release.

Monte Carlo is not the only way to do this of course. Whatever is your metric/simulation/gut feel that shows that hope has become the operative method, use it to detect the point where you readjust expectations. Lower you tipping points for despair, so that you do not have to wait too long to realize that your expectations are not aligned with reality. Whenever your reliance on hope exceeds a certain point, realign expectations if possible. If realigning is not possible, be prepared to deploy the emergency fixes immediately after the release and carry the tech debt forward for eternity.

Strategy of hope is probably best left to soccer teams that are 2-0 down at the half. They are waiting for the miracle in the second half. They are carrying the "feeling of expectation and desire for" that miracle. It happens once in a while, that a team that has not scored a goal in a half, will score three in the second half, but do you really want to bet on that miracle?

Friday, May 20, 2016

Mathematical Boondoggle

I spent the last week in San Diego at the Lean Kanban North America conference, followed closely by the Kanban Leadership Retreat. All the big names and though leaders in the Kanban community were there, at least all the ones that have not been banned from the LeanKanban Inc. events were there. I had no idea conferences had banned lists. After finding out about this, I firmly believe, any speaker/writer/thought leader on the conference circuit has not done enough innovate work, unless they have gotten themselves banned from a conference. I looked forward to a week of exciting and invigorating conversations along with some quality team building time with the development managers at Ultimate Software.The trip lived up to the latter for sure, the antics of the Ulti-Managers will have to be a blog post on a different blog. The former on the other hand was a bit of a disappointment. There were definitely some bright spots. Some sessions and conversations were definitely enlightening, but a lot of it was filled in by what I can only call Mathematical Boondoggle.

Look At All The Maths I Have

Since most Kanban implementations are metrics driven, the community seems to have a lot of conversations around them. Infact there seems to be a mathematical arms race amongst the leading voices in the community. There are multiple curve-fitting conversations and detailed analysis of Cost of Delay curves that are being explored and presented. Nothing wrong with the concepts and with the theorizing of these concepts, but what is missing is the practicality of their application. Over the past few days I have been scratching my head to figure out the dots that might not be getting connected in my head. Maybe my education in the Kanban practice has been too close to practical application and use of real data that making the data fit a Weibull or a Log-Normal curve in order to do forecasting and sampling makes no sense to me.

Both Troy Magennis and Frank Vega presented forecasting techniques (Monte Carlo) which used real data to generate forecasts and probabilities. That was very heartning and reaffirming to see. They did not try to fit the data into any named statistical curve. They took the data as it was and used the knowledge of the fact that every individual data point is just as likely to occur again to simulate what the results would be. They did not make any attempts to figure out the tail of the distribution or what the shape factor of the Weibull distribution was. I know Troy Magennis has in the past(and probably still does) espoused the use of Weibull distribution, but his session included no mention of it.

There was a good amount of SAFe bashing (not that I am a fan of SAFe), especially around the way SAFe uses estimated and hence not real data to determine the calculation of the weighted shorted job first. This was in the middle of the Cost of Delay discussion. The irony was strong in this discussion and it appeared to be lost on most of the folks in the room. Folks ridiculing a formula that uses estimates and makes multiple assumptions in the middle of a discussion where they are espousing a technique(Cost of Delay calculation) that starts with multiple estimated values and multiple assumptions regarding value curves.

A group of us had the same conversation at the dinner table with a few folks and it seems that at least that group agreed that Cost of Delay is almost impossible to calculate because of the inherent problems with estimating of the value and of the value decay over time. The only practical voice on the Cost of Delay conversation came from Klaus Leopold, as he said that the value conversations should only happen at the feature level, not above or below that. I will have to have an in depth conversation with Klaus at some point to understand how he is calculating the value though. It just does not seem that it is something that can be easily determined and that determining it by itself does not add to the cost.

I See Your Theoretical Math And Raise You Practicality

I left the Leadership Retreat disappointed in one major fact. The community that has valued and espoused using real measurements to make decisions and collection of actual data for analysis has moved so far in the direction of adding layers on top of the data in order to analyze it. The same community is talking in terms of estimates when we have espoused real data since the beginning. It increasingly seems to be mathematical boondoggle coming through from the leaders of the Kanban community(at least the ones not banned from the conferences). 

Apart from Troy and Frank's presentations and the dinner conversation, I had at least one more reaffirmation from another attendee of the conference and the retreat. We happened to run into each other at the San Diego airport, and did a quick sharing of notes. He was frustrated with the theoretical nature of the discussions as well. In his words (and gestures) - "The discussions were up here(raises his hand up above his head) and the reality is down here (lowers hand to his knee)".

There is a lot of good work left to be done on the practical reality of software delivery before we go so far into the land of theoretical mathematical boondoggle. Hopefully in the future these gatherings are more about actually learning the applied concepts that work and figuring out which ones do not work. It would be great to be able to share experiences and learn from others about what is working and what is not, as opposed to the math devoid of practical applications or examples(hence boondoggle). Also boondoggle has once again become one of my favourite words.