Software Estimations – why is it so hard?

by Ger Cloudt, author of “What is Software Quality?”

Image by  OpenClipart-Vectors from  Pixabay

Organizations like predictability in their development projects. High predictability enables e.g. Sales to sell and actually deliver. It enables the organization to negotiate contracts which can be fulfilled and if obligations are met no sales will be lost.
However, it seems that predictability of software development is low. Agile principles, like developing in small increments with a Potential Shippable Product at the end of each sprint, is a way to address the predictability problem in software development. Having small increments enables frequent deliveries, in time, but maybe not with the wanted scope. This is a less problem because a next release, which will be available soon, will contain the missing features.
But still, this incremental approach is not applicable to every software development and thus we are back to the predictability of software development.

In the development of multi-disciplinary embedded products predictability still is important. Since releasing and updating the software is not as frequent as in a true Agile environment, available functionality in a release becomes more important because if not in, the customer needs to wait for a next release which might take a considerable amount of time or even worse if the product is not remotely upgradable.

Cone of uncertainty

In 1995 Boehm et al presented the estimate converge graph, also known as the cone of uncertainty stating that for any given feature set estimation precision can improve only as the software itself becomes more refined.

Investing more time and effort in developing requirements, understanding requirements , analyzing requirements and even building the software itself will result in more accurate estimates as the cone of uncertainty shows. But still, even if everything seems to be clear, complete and understood we still will have uncertainty which will cause deviations from the plan. Uncertainty cannot be banned and therefore predictability will never be as accurate as 100%.

Puzzle analogy

Then, how to explain that software estimations are difficult and unprecise? For this I would like to use an analogy of solving puzzles like e.g. crosswords, cryptograms or Sudoku’s.

If I want to explain the difficulties we experience in software estimation to somebody I would ask the person to estimate how long it would take to solve a booklet of puzzles? Most likely you will get questions in return like, what is the difficulty level of the puzzles? What kind of puzzles are your referring to? How many puzzles need to be solved? How big are these puzzles? Am I allowed to use a dictionary? Well…., I do not know but anyway please provide me an estimate. You can imagine the accuracy of such an estimate.
This question of estimating the needed effort to solve an unknown bunch of puzzles can be compared to asking estimates typically for road mapping purposes. The development team is provided with some one-liners of the requirements and an estimate is asked, to be able to plot a roadmap.

Then we can take a next step and provide the actual booklet of puzzles. Having a look into the booklet will provide better insights and most likely the initial estimate as provided will be adjusted towards the new insights.
The more time you spend to have a look at the puzzles and understand the difficulty, the better the estimation to complete will become. This is in real software development comparable to your collecting, understanding and analyzing of requirements and maybe here and there perform some pre-development or prototyping for high risk areas.

To achieve an even better estimate you could not only have a look at the booklet of puzzles, but actually solve some puzzles. Measure how long it takes to solve and count the remaining, not yet solved, puzzles. This is what we call measuring velocity and applying it to the remaining work to predict when to deliver.
You would expect a pretty accurate estimate, right?

However, when several puzzles are made and velocity seems to be stable I will come in and tear out a number of puzzles and add another number of puzzles to the work to be done. Typically this can be compared to changing requirements and adding requirements during the project which will happen throughout the development.

Another problem you will encounter during your puzzle solving is that suddenly you will encounter puzzles with an unexpected very high complexity. If the puzzles you solved so far are in the complexity area between 3 or 5 stars you will encounter some puzzles with a complexity much higher and your velocity will drop tremendously. Encountering these high complexity puzzles can be compared in encountering difficult problems during development, hard or nearly impossible to reproduce and even harder to solve. Also you will encounter puzzles which have a relation with previously solved puzzles and to be able to solve these new puzzle you have to redo the previously solved puzzles.

And then we did not yet talk about external influences in your puzzle-solving. What if you are out of pencils? Just because I come into the room and want to replace all pencils by a cheaper type of pencils? Or suddenly your dictionary will be lost? Compare it to Corporate IT performing a security update on the network. If you are lucky the update is done in the weekend and not affecting your project but there is a risk you will have problems on Monday.

If estimating puzzle solving is already hard… how about estimating software?

As you can see, estimating solving puzzles, an activity everybody can perform, is already hard and inaccurate. Imagine the development of a big software system which only can be done by highly educated engineers. How can we expect accurate estimates?

Don’t let Test Automation be the final nail in the coffin!

by Ger Cloudt, author of “What is Software Quality?”

Image by Michael Schwarzenberger from Pixabay

“We need an extra Test Engineer.”,  the Agile Master informed me on a sunny Monday morning. Already I noticed that the team needed more effort to keep the dashboard of the nightly build and testing green. More often in the morning “reds” were reported indicating the build or test had failed. Good…, our short feedback loop seems to work, defects are detected early and thus can be addressed immediately.
However before granting the request of the Agile Master I wanted to understand better the root-cause of the increase of reds.

Automate everything.

Clearly there is a lot of pressure on development teams to become faster, more efficient and predictable. Over time the software industry addressed this demand a.o. by changing processes and becoming Agile. One of the key principles to mitigate waste is to create short feedback loops. If something is done wrong and detected very fast the waste is limited, simply because it is relatively easy to fix. That’s why we try to test as early as possible to catch defects as early as possible because we all know the relation between the time gap of defect insertion and defect resolution and costs. The earlier a defect is detected and solved the cheaper it is.
Testing as early as possible implies a lot of testing, over and over again. And that’s why we started to automate testing because activities to be repeated over and over again should be automated with clear reporting such that fails are noticed immediately without too much additional effort.

How about all these reds.

Before deciding on adding an extra Test Engineer we needed to have a closer look at the root-cause of these increasing reds in the nightly test run. First question to be asked was whether the reds were caused by indeed regressions in the Software Under Test? Analysis of the reds showed that this was not the case. Failures in test cases, instability in test framework and test infrastructure, not updated test cases clearly caused the increase of reds. A next interesting question to be asked is what percentage of reds is caused by actual regressions in the product code? As it seemed the majority of reds was not caused by regressions in product software but by other reasons.

Over the years automated test bench grew, configurations were added, tests were added, infrastructure changed and apparently more and more effort is needed to maintain all these automated tests and everything related. To be honest this should not be a surprise because test automation is software development. And software is subject to Technical Debt (see Help…, my software rots!). So, your test cases and test environment will be as well.

Gherkin scenarios

Let’s have a closer look at Gherkin scenarios supported by tooling like e.g. SpecFlow and Cucumber.
Gherkin is a business readable language used to describe behavior which can be used for defining executable test cases. It consists out of steps and uses keywords like “Given”, “When” and “Then” to describe a precondition, an action and a result. Each step is associated to a keyword. Scenarios are written in Gherkin and the steps require “glue-code” to address the Software Under Test. Tooling like SpecFlow generates for each step a signature which consists out of a method interface in a specific programming language like e.g. C# or Java. The actual code (glue-code), implementing the step, to address the Software Under Test in the correct way needs to be programmed by the Test Automation Engineer.

Despite Gherkin is a simple language, you still can build a mess… and if not paying attention to good programming practices you will build a mess even in Gherkin.

Therefore even in Gherkin scenarios you might think about defining generic steps to be re-used in multiple test cases and specific steps. Even in Gherkin scenarios you might think about Clean Code principles like using meaningful names, naming conventions and keeping steps small. Once I saw a step in a Gherkin scenario which resulted in a method with more than 15 arguments! Ouch……, what does Clean Code of Robert C Martin say about number of arguments? What if you need to adapt this Gherkin scenario due to a new requirement? Imagine you might have many of these of kind scenarios…..
As you can build a mess in both your Gherkin scenarios as well as your glue-code, test cases contain Technical Debt which might slow you down significantly and result in increasing numbers of reds in your automated test execution.

Even more code.

For testing we use Gherkin scenarios, glue-code, unit tests. All of these are code. But there is even more code, we have e.g. build scripts, configuration code and test framework code. There is a lot of code outside the actual software which is delivered as product or service. And also this code needs to be maintained, also this code needs to be changed as your product is evolving. New components are developed, meaning build scripts to be adapted, new test cases to be created. Existing test cases need to be changed, so there is always work to be done on code. For this reason, for accomplishing a sustainable pace of development, this non-product code needs to be handled in the correct way, in the same way as our product code. Mitigating Technical Debt as much as possible.
Is your test framework actually designed? Or did it grow without any design or structure? Is your test code under version control? Do you apply Clean Code practices on your test code and Gherkin scenario’s? Do you apply static code analysis on your test code? Is your test code reviewed? Is your test infrastructure maintained? Do you track defects in test code? Are your Test Automation Engineers actual Software Engineers?

To summarize, non-product code is code as well. Technical Debt is not only applicable to your product code but to all other code as well. To keep your automated testing up and running without too much effort one needs to apply proper software craftsmanship on all code and not only on the product code. If not, more and more effort need to be spent on analyzing “false-positive” reds resulting in slowing down your regular development until it becomes the final nail in the coffin.

Speed versus Pace

by Ger Cloudt, author of “What is Software Quality?”

Photo by Patrick Robert Doyle on Unsplash

Last Sunday, June-20th 2021, we witnessed maybe one of the most exciting Formula-1 races. A battle between the teams of Mercedes and Red Bull. A race which clearly demonstrated speed versus pace, which can act as a metaphor in software development as management regularly is asking for speed. However, the question should be whether management should ask for speed or pace?

Qualification versus race

In a Formula-1 weekend on Saturday, the qualification is driven to determine the start order of the drivers. The driver who drove the fasted lap in Q3 of the qualification will get the first and best start position; pole-position. Clearly, qualification is about speed, being the fastest over only 1 lap.

However, during the race it is about pace, in the Formula-1 they even call it race-pace. The race last Sunday was run over 53 laps with a total distance of approx. 310 kilometers, taking 1 hour and 27 minutes for the winner Max Verstappen.

The fastest lap during the qualification was driven in 1 minute and 29.99 seconds, the fastest lap during the race was driven in 1 minute and 36.4 seconds, clearly demonstrating the difference between speed and pace.

One of the reasons for the difference in velocity during qualification and the race is tire wear. The condition of the tires is highly determining the grip of the car and therefore highly influences the velocity. With higher velocities the tires will wear out faster, resulting in less grip and a lower velocity. That’s why it is important for the drivers to manage their tires carefully to avoid a strong decline in grip. An additional choice drivers have, is to make a pit-stop and change tires. The time penalty associated is, dependent on the circuit, approx. 25 seconds for each pit-stop.

In last Sundays race, the winner, Max Verstappen, had chosen for a 2-stopper (2 pit-stops) against the 1-stopper of Lewis Hamilton who finished second. When leading the race, closely chased by Hamilton, Verstappen made his second pit-stop to change to a fresh set of tires, giving away the leading position and taking a penalty of approx. 25 seconds. Hamilton continued without additional pit-stop resulting in a slower pace. Verstappen overtook Hamilton 1 lap before the finish.

Software Development

Let’s see whether we can discover a similarity between Formula-1 and software development. Like the tires in a Formula-1 race wear out, software wears out as well over time. Over time, imperfections will creep into your software, called Technical Debt, by which your software wears out. As a consequence complexity of your software will increase and your development velocity will decline. Internal software quality is like the tires of a Formula-1 race car, both wear out causing a decline in velocity.

In software development there are 2 ways in mitigating Technical Debt. The first one is “doing things as they should be”, meaning, applying good engineering practices like proper and thorough requirements engineering, solid design practices, producing Clean Code and testing efficiently and sufficiently to detect defects as early as possible. This way of mitigating Technical Debt can be compared with “carefully driving” in Formula-1 to limit the wearing out of the tires as much as possible. However, “carefully driving” like “doing things as they should be” will have a lower velocity than speeding as fast as you can. Initially you will be faster when speeding or taking short-cuts, but over time your tires or software will be worn out such you will become slower and slower. The Formula-1 car will become slower simply because the grip of the tires is declining and possibly even resulting into a blowout. The software development will become slower simply because the complexity of your software is increasing and possibly even resulting into a situation in which your engineers do not dare to touch the code anymore because they do not understand it anymore.

The second way of mitigating Technical Debt is “correcting imperfections” in the software, meaning to perform restructuring and/or refactoring of your code, solving open defects or add missing test cases. This way of mitigating Technical Debt can be compared with the pit-stop to change the tires in Formula-1. A penalty is taken, in software development all needed effort to perform this refactoring, in Formula-1 the time needed to make the pit-stop. However the result should be less complexity in your software such development velocity increases, in Formula-1 the pit-stop results in a fresh set of tires with high grip and a resulting higher velocity.

Pace versus speed, a balancing act.

In Formula-1 it is a balancing act between speed and pace during a race. Should the team apply for a 2-stopper or a 1-stopper? How “carefully” should the driver drive to mitigate the wear-out of the tires? The same questions can be asked considering mitigating Technical Debt in software development. How much effort should be invested in performing re-designs or refactoring the code? How much effort should be invested in analyzing the requirements, creating the perfect design, producing perfectly Clean Code and testing the maximum? It depends, it depends on how long your race will take. It depends on how long you will need to develop and maintain your software. What is clear, is that software development in general is a long lasting activity which might take years or even decades. That’s why we should focus on achieving a sustainable pace instead of a high speed.

Debugging – scope matters.

by Ger Cloudt, author of “What is Software Quality?”

Image by mohamed Hassan from Pixabay

When considering the core activities of a software development team, typical one would come up with the following activities; requirements engineering, design & modeling, coding & unit testing, functional testing, system testing, acceptance testing, and not to forget configuration management and tool management & support. However, it seems to be that on average a developer creates ~70 bugs per 1000 lines-of-code of which ~15 will find their way to the user of the software[1]. This implies there is another, not yet mentioned, important and time consuming activity to be performed by a software development team; debugging.

Simply stated, debugging are the activities performed to root-cause and resolve a malfunction of the software, or in short, solving that bug. Debugging might be one of the most underestimated activities of a software development team. According to Coralogix[1], solving a bug takes 30 times longer than writing a line of code. Reason enough to have a closer look to debugging, I would argue.

Execution Paths

Source code is executed sequentially by the CPU; the order in which the different statements are executed is referred to as the execution path. The figure below depicts the elementary execution paths for the possible control flows in source code.

Each circle is called a node and represents a statement. The arrows between these nodes are called edges and signify possible control flows. As an example, whenever an if- statement evaluates a condition, it continues in either direction (one of the two edges) depending on the outcome of evaluating the condition. Thus, an if-then or an if-then-else statement has two possible execution paths.

The following figure shows a small program of an algorithm which searches for the smallest minimum number in both sorted arrays in addition to the visualization of the possible execution paths through this program’s code.

Whenever we would execute this program with the following input data:

array_A = {1,2,4,6,8}
array_B = {2,3,5,6,7,8,10,11}

the following execution path would be executed resulting in the correct outcome “2”.
A,B,C,D,E,G,H,J,K,L,D,E,F,K,L,D,M

Let’s insert an error

When there is a bug in the software in most cases a not-expected execution path is executed, like if we would enter an error in the small program above by replacing statement E by

“if (array_A[i] != array_B[j])”

The execution path with mentioned input data would become
A,B,C,D,E,F,K,L,D,M
resulting in the wrong outcome “1”.

Debugging therefore focuses on understanding what execution path is executed in this error situation and where it differs from the expected execution path. To do so, one needs to have insights in the values of the different variables in the code at different stages in the execution to understand why certain decisions during execution are made. Debug tooling may support this process by the ability to set break points, step statement-by-statement through the code and examine values of variables at any required time. Whenever the actual not expected execution path is identified, the engineer needs to understand why this unexpected execution path was executed before the bug can be fixed.

Scope matters

Debugging a bug in the example in this article would be relatively simple due to the scope of the problem. In this program we only have 1 function consisting of 13 lines-of-code taking less than ½ page of paper if printed. A minimum scope which decreases complexity of debugging.

But how about a medium scope consisting of multiple functions like in the following figure? Here we have a visualization of execution paths of 11 functions consisting of 99 lines-of-code, which would take ~3 pages of paper if printed.

Somewhere in this code there is a bug……. As you can imagine it will be more complex to debug and find out where the bug is located, it could be anywhere.

And how about a bug detected during system testing or by a customer while running the complete software program? A large scope consisting of e.g. 100,000 lines-of-code? Thousands of functions and when printed being an equivalent of a pile of books like in the picture below.

Somewhere in this code there is a bug……. It can be anywhere, in any book, in any chapter on any page. You can imagine that debugging in a large scope is much more complex than having to consider a medium scope or, even better, a minimum scope only. As modern software programs are even much bigger than the represented pile of books in the picture, it becomes evident that we should focus on finding bugs in smallest scope as possible. That’s why engineers should focus on unit testing, in which the scope is the smallest one, and try to catch the majority of bugs. Only a few bugs should find their way into component or module testing and even less should find their way into system and acceptance testing, simply because debugging becomes more complex when scope increases.


[1] https://coralogix.com/log-analytics-blog/this-is-what-your-developers-are-doing-75-of-the-time-and-this-is-the-cost-you-pay/

Follow the green-line when tempted to follow the red-line.

by Ger Cloudt, author of “What is Software Quality?”

Organizations are striving for high-productivity, which, in Agile software development, can be expressed as velocity of a team. Teams with a high velocity are considered as being efficient. But is this true? Is it possible that a team with an apparent high velocity ultimately has a lower velocity? Yes, it is.

The cumulative velocity of a team results into what we call an earned value. This is a cumulative representation of all story points earned as a result of done User Stories. An earned value chart often is used in projects to show progress in completion of User Stories. The slope of this earned value chart is determined by the velocity of the team, and guess what…. we would like to see a steep slope, indicating a high velocity, reaching the targeted delivery scope as soon as possible. Here lies a risk, the risk that the team will be tempted to deliver a high velocity by e.g. postponing the resolution of bugs. Because, the later a bug is detected and solved the higher effort is needed, postponing resolution of bugs to later stages of the project will result in a long period of “maturing the system” in which the unresolved bugs need to be addressed. This is what we call following the “red-line”, an apparently high velocity in finishing User Stories succeeded by a relatively long maturing period.

We, as software professionals, however should not be tempted to follow the “red-line”, but we should make sure testing is sufficiently addressed and found bugs are solved fast, resulting in a flatter slope (and thus in an apparently lower velocity) such that the period of “maturing the system” is limited which results in an earlier delivery date! Mitigating the defect resolution gap as much as possible is what we call following the “green-line”.

Another reason to be on and to stick to the green-line is related to risk and uncertainty. In many cases e.g. insufficient testing imposes risk and is difficult to estimate. Which tests are still not executed? How many bugs will result from these tests and how much effort will it take to solve them? Good reasons to test-as-early-as-possible and solve-defects-as-early-possible resulting in a flatter slope of your earned value line but in a significant shorter maturing period and thus earlier delivery of your product. Therefore as a manager you should not ask your team about their velocity, but you should ask whether the team is on the “green-line”!