The biggest IT-fail ever.

“The biggest IT-fail ever.” called Elon Musk it on X. He was talking about worldwide problems at July 19th, 2024 caused by an update of CrowdStrike software on Windows systems bringing down airports, financial services, train stations, stores, hospitals and many more. ~8.5 million Windows devices were hit[1] by this software update resulting in crashing computers displaying the infamous Blue Screen of Death (BSOD).

An update by CrowdStrike was released, impacting Windows operated systems worldwide. Soon a workaround was issued which needed manual operation for each individual computer. This workaround prescribed to reboot the crashed computer in safe-mode and consecutively a “C-00000291*.sys” file with timestamp ‘0409 UTC’ in the folder C:\Windows\System32\drivers\CrowdStrike should be deleted or maybe renamed[2].

CrowdStrike

CrowdStrike is an independent cybersecurity company providing security related services. One of the products of CrowdStrike is the so-called “Falcon Sensor” which runs in the kernel of Windows and analyses connections to and from the internet to determine possible malicious behaviour.

Already soon from the workaround as published by CrowdStrike, one can conclude that the update of this specific .sys file caused the failure.

Preliminary Post Incident Review

Approximately one week after the failure, CrowdStrike published a preliminary Post Incident Review[3] (pPIR) from which we can learn that the Falcon Sensor software is using two types of content, being ‘Sensor Content’ and ‘Rapid Response Content’. As explained in the pPIR, the Sensor Content is released and deployed in combination with an update of the Falcon sensor itself. The Rapid Response Content contains configuration data that may be updated from the cloud independently from the Falcon Sensor.

Additionally, in the pPIR it is explained how the software and data are tested. The release process of the Falcon sensor, including Sensor Content, starts with automated testing like unit testing, integration testing, performance testing and stress testing in a staged process of which dogfooding at CrowdStrike is part. Before made generally available to customers deployment is done toward early adaptors.

No complex scenario needed

Considering the scale of computers crashed on July 19th, 2024 one would expect that the fault should have been detected during the first stage of deployment, being at CrowdStrike themselves during their dogfooding. It was not a specific complex scenario to be executed to crash the system, one can conclude that whenever the Rapid Response Content file was updated the crash was inevitable. So, apparently the test process for a Rapid Response Content has to be different excluding any stagged deployment.

Rapid Response Content

To understand the test process of a Rapid Response Content update, one need to understand that Rapid Response Content is generated by so-called Content Configuration System software running on the Falcon platform in the cloud. Part of this Content Configuration System is the Content Validator, a piece of software which validates the generated Rapid Response Content.

Blame the test

On July 19th, 2024, the Rapid Response Content files were deployed into production after successfully passing the Content Validator. According to the pPIR of CrowdStrike, problematic content in the Rapid Response Content, which was not as such detected by the Content Validator, resulted in a out-of-bounds memory read triggering an exception.

Apparently the Content Validator was the only test performed on Rapid Response Content before deployment into production and in this particular case it did not detect an error in the Rapid Response Content to be published. No higher level test, testing the combination of the Falcon Sensor and the Rapid Response Content seemed to be performed.

Reading the pPIR it feels like “blame the test” for not finding the bug: the Content Validator failed to detect the bug and no higher levels of testing were performed.

Let’s try to visualize the Falcon Sensor, it’s Content files and the Content Configuration System in the figure below.

Preventive Actions

In the pPIR CrowdStrike presents preventive actions for preventing this happening again. Next to more extensive testing of Rapid Response Content, improvement of the Content Validator and a staged deployment of Rapid Response Content will be introduced. This sounds reasonable, but one could ask the question why these were not already in place? In a proper root cause analysis, like an 8D, this question will be asked.

One preventive action, however, caught my attention: “Enhance existing error handling in the Content Interpreter.”. This because previously in the pPIR it was mentioned that the unexpected out-of-bounds exception could not be gracefully handled. I have the feeling that a formulation like that the unexpected out-of-bounds exception was not gracefully handled would be a better one and enhancement of the existing error handling will encompass the handling of this unexpected exception. However, without knowing the details of the code, this is a speculation.

Problem Areas

Thinking about this problem and what we can learn from it with the information available so far, I would come to three problem areas.

  1. Testing
  2. Coding
  3. Configuration Management

Considering testing, I do agree with CrowdStrike’s analysis in the pPIR. They explain the test process and their proposed preventive actions on testing to improve. However, like already mentioned, I think they need to understand why the obvious proposed preventive actions for Rapid Response Content were not in place already?

About the coding part, I would be very curious to see the actual code where the out-of-bounds exception happens. Is the memory location immediately accessed or is there any form of a validity check of the memory address to be accessed before actually accessing? Is there any mechanism, like a catch (exception handler), in place to catch the thrown exception? How will the pronounced enhancement of the existing error handling look like? What can be learned by the engineers considering unexpected exceptions in other places of the code?

Maybe you are surprised that I bring up configuration management as a problem area. Looking at the visualization of the Falcon system (including the Content Configuration System) I do recognise different parts with a different update cycle. The Falcon Sensor in combination with the Sensor Content next to the Content Configuration System with the generated Rapid Response Content. As explained in the iPIR, the Rapid Response Content contains so-called Template-Instances whilst the Sensor Content contains Template-Types. I can imagine that the Template-Instances need to be compatible with the Template-Types. This is a configuration management problem. I did not read anything about configuration management in the iPIR, so I hope CrowdStrike is in control of this. Are we sure the right version of Template-Types was used when the Content Configuration System generated the faulty Rapid Response Content file?

Looking forward to the full Root Cause Analysis by CrowdStrike.


[1] https://blogs.microsoft.com/blog/2024/07/20/helping-our-customers-through-the-crowdstrike-outage/

[2] https://www.ncsc.nl/actueel/nieuws/2024/juli/19/wereldwijde-storing

[3] Falcon Content Update Remediation and Guidance Hub | CrowdStrike

Help…, a squashed bug on my windshield is blocking my view!

Wiper icons created by Freepik – Flaticon

And I’m hovering like a fly, waiting for the windshield on the freeway.” A well-known part of the lyrics of the song “Fly on a Windshield” by Genesis. I love listening to this song, especially the guitar part right after singing this line, emphasizing the windshield hitting the fly. However, everybody driving a car on a hot summer day is familiar with the experience. Squashed bugs on your windshield. Whether you like it or not, it is inevitable and becomes annoying.

These squashed bugs on your windshield may be compared with technical debt in software development. Both ‘creep’ in and it is inevitable. Bug by bug will be squashed, piece by piece a little bit of code will be unnecessary complex. Unnecessary dependency by unnecessary dependency will be inserted in your software. Compiler warning by compiler warning will not be solved. All of these representing a little bit of technical debt. If you look at these squashed bugs individually they are not a problem unless when it is a large bug squashed right in your field of view. However, in case no corrective actions are taken, like using your wipers to clean the windshield, the amount of squashed bugs will become annoying. Many squashed bugs, even small ones, will become annoying due to the quantity of them. The same holds for technical debt in your software, a few imperfections in the code will not hamper you, however many of them will.

Many squashed bugs, even small ones, will become annoying due to the quantity of them.

It is not alone the squashed bug on your windshield, important as well is the location on the windshield where the bug is squashed. If you have a squashed bug somewhere at the edges of your windshield, who cares? However, sometimes I experience this squashed bug just in front of me, in the middle of my sight. Very annoying. The same applies to technical debt. Take a function with a too high cyclomatic complexity as an example. If this is a function which, so far, did not show any problems and did pass all testing successfully and it is not about to be changed, who cares? Just leave it as is, like the squashed bug at the edge of your windshield. However, if we have this function with a too high cyclomatic complexity and we need to apply a change to it due to whatever reason, you will suffer from its complexity like you suffer from this squashed bug in your sight.

And then you have small and big bugs. Small squashed bugs causing only a small spot on the windshield whilst big bugs cause big spots on the windshield. Like in software development, small imperfections in code can be handled by the ‘boy-scout-rule’ which states that you should leave the code more clean than how you encountered it. A small effort to remove a small imperfection in the code. It is like using your windshield wiper to wipe away the small squashed bugs. However, bigger imperfections may need specific restructuring which needs to be prioritized and planned like the large squashed bugs which cannot be removed by using your wiper. Extensive cleaning beyond wiper usage might be needed.

In 1992, Ward Cunningham made a comparison between technical imperfections and debt (like in finance). The accumulation of technical imperfections in your software is like debt in finance. You will have to pay interest by the effort needed to work with higher complex software due to the accumulation of imperfections and you can pay repayment to get rid of some debt like restructuring of refactoring your code. When you are driving on the highway this summer, and a bug gets squashed against your windshield, think about your technical debt in your software.

A misleading Maintainability-rating.

Recently I was approached by one of our engineers, asking me the question: “Here I have a piece of code with a ‘maintainability-rating’ of A in SonarQube. But when I look at the code, I think it is complex code which is not easy to maintain at all! How is it possible that SonarQube provides an A-score for ‘maintainability’?”.

ISO-25010

To answer the question first we need to understand what is meant by maintainability. Maintainability is one of the quality-characteristics of ISO-25010[1]; the ISO Software Product Quality standard. In ISO-25010, maintainability is defined as:

“This characteristic represents the degree of effectiveness and efficiency with which a product or system can be modified to improve it, correct it or adapt it to changes in environment, and in requirements.”

Additionally it goes into more details about five sub-characteristics on which I will not elaborate in this article.

SonarQube

Secondly we need to understand how SonarQube is determining the maintainability-rating of a piece of software. Luckily, SonarQube publishes how it determines their metrics[2]. Looking at their website for maintainability-rating, results in the following definition:

“Maintainability rating: The rating given to your project related to the value of your Technical debt ratio.”

This definition is followed by ranges of percentages for scores A (high) – E (low). Apparently we need to understand Technical debt ratio as well:

“Technical debt ratio: The ratio between the cost to develop the software and the cost to fix it. The Technical Debt Ratio formula is: Remediation cost / Development Cost.
Which can be restated as: Remediation cost / (Cost to develop 1 line of code * Number of lines of code).
The value of the cost to develop a line of code is 0.06 days.”

The next question in our quest is “how are the remediation cost defined?”. Unfortunately, these are not defined on SonarQube’s website as such. However, reliability remediation effort and security remediation effort are both defined as respectively fixing all so-called bug-issues and vulnerability-issues. As far as I could find, these two are the only two I assume to be part of remediation cost. Both, bugs and vulnerabilities as detected by SonarQube are warnings as produced by static code analysis.

To summarize, the maintainability-rating in SonarQube is based on the estimated effort needed to solve warnings produced by static code analysis. Static code analysis refers to the analysis of the source code without actual execution of this code[3], which results in warnings to be considered and solved. Warnings which may lead to actual bugs or to actual vulnerabilities in the code are classified as bugs and vulnerabilities in SonarQube and are in scope of the maintainability-rating.

Complexity

Referring back to the ISO-25010, maintainability is how easy a product or piece of code can be modified for whatever reason, which is, for sure, heavily determined by the complexity of the code.

Two important aspects of complexity are dependencies and obscurity.

Dependencies

It is obvious that when we have many dependencies between different software entities on different abstraction levels, complexity rises. Therefore, one should focus on reducing dependencies in the software as much as possible. It is no coincidence that design paradigms like ‘low-coupling & high-cohesion’ are the basis for the SOLID design principles, which have the goal to reduce dependencies in the software such that engineers can change entities of the software without having to change others. Applying these design principles in a proper way does mitigate complexity of this software.

Question is, are the registered bugs and vulnerabilities by SonarQube reflecting the dependencies in your code? No, they are not.

Obscurity

Not understanding the intention of the software, or to be more specific of the code, increases complexity as well. This is exactly what should be covered by creating so-called ‘Clean Code’. Clean Code is code that works and is understandable by other human beings. Code which is hard or nearly impossible to understand by other human beings is called Bad Code. In many cases Bad Code is associated with complex, big functions containing deeply nested constructions and a high cyclomatic complexity. However, one should take into account that also seemingly simple small pieces of code can be obscure. Examples of these are the usage of misleading names of variables and not obvious constructions for simple operations.

Question is, are the registered bugs and vulnerabilities by SonarQube reflecting the obscurity of your code? Partly, I would say. Some warnings produced by static code analysis for sure are about understanding and interpretation of language constructs. Addressing those will decrease obscurity. Interesting to see is that SonarQube, apparently, did not include the measured Cyclomatic Complexity in the maintainability-rating. Cyclomatic Complexity clearly is related to maintainability.
Additionally we have others aspects which contribute to obscurity, but which can not be measured by static code analysis as performed by SonarQube, like usage of meaningful names, to mention one obvious one.

Is the Maintainability-rating misleading?

To summarize, is a piece of code with a maintainability-rating of ‘A’ as provided by SonarQube maintainable? You can not tell, simply because a high maintainability-rating of SonarQube only tells you that the majority of reported static code analysis warnings classified as bugs and vulnerabilities are solved. It does not provide sufficient information about dependencies and obscurity of code. As such, I think the maintainability-rating of SonarQube is misleading due the fact that its name is not reflecting what it shows.


[1] https://iso25000.com/index.php/en/iso-25000-standards/iso-25010

[2] https://docs.sonarsource.com/sonarqube/latest/user-guide/metric-definitions/

[3] “What is Software Quality?” – page 159

You Create the Complexity of Tomorrow!

In his book “Facts and Fallacies of Software Engineering”, Robert L. Glass states that an increase of 25% in problem complexity, results in a 100% increase in complexity of the software solution. Reason enough, I would say, to focus on mitigating complexity.

In software development, the primary source of complexity comes from requirements and constraints. Requirements determine what to build and therefore are a main contributor of the complexity the development team is facing. The same goes for constraints, such as memory and/or CPU-cycle limitations in embedded systems, which can add considerable complexity. In many cases, the influence of the development team on this ‘externally’ imposed complexity is limited. Still, by providing feedback to stakeholders and engaging in discussion for alternatives, complexity imposed by requirements and constraints might be mitigated.

However, this externally imposed complexity by means of requirements and constraints, is not the only complexity the development team is facing. Secondary complexity imposed on the team, is the complexity of the existing software in which the new requirements need to be added.

Code produced today is a legacy tomorrow

Today’s software development is incremental and lasts for many years or maybe even decades. This implies that decisions taken on the design and implementation will have a big influence on future development. The complexity induced by these decisions is the secondary source of complexity the team is facing. In other words, the complexity created today by the team will be faced in the future. The good news of this complexity is that the developers are in full control.

As a software developer you create the complexity of tomorrow!

“Complexity is anything related to the structure of a software system that makes it hard to understand and modify it”, says John Ousterhout in his book “A Philosophy of Software Design”. Two important aspects of complexity of a software system are dependencies and obscurity.

Reduce dependencies

Because it is known that dependencies are an important aspect of complexity, one should focus on reducing dependencies in the software as much as possible. It is no coincidence that design paradigms like ‘low-coupling & high-cohesion’ are the basis for the SOLID design principles, which have the goal to reduce dependencies in the software such that engineers can change entities of the software without having to change others. Applying these design principles in a proper way does mitigate complexity of this software.

Reduce obscurity

Not understanding the intention of the software, or to be more specific of the code, increases complexity as well. This is exactly what should be covered by creating so-called ‘Clean Code’. Clean Code is code that works and is understandable by other human beings. Code which is hard or nearly impossible to understand by other human beings is called Bad Code. In many cases Bad Code is associated with complex, big functions containing deeply nested constructions and a high cyclomatic complexity. However, one should take into account that also seemingly simple small pieces of code can be obscure. Examples of these are the usage of misleading names of variables and not obvious constructions for simple operations.

Once, I saw a little piece of code:

for (i = 1; i <= m; i++)
    number++;

Thinking about what was happening here and knowing that m was an unsigned integer;

number = number + m;

would do the trick as well.

It started with quite a seemingly simple little piece of code, easy to understand. Still, I would call this simple piece of code obscure, simply because you start to wonder…, why? Why is it programmed in this way? Why is a simple addition programmed in a loop? This seemingly simple piece of code raises questions and therefore creates complexity.

Spaghetti code

Software with a high number of unnecessary dependencies is often referred to as “spaghetti code”. You might understand where the term “spaghetti code” comes from. All the different spaghetti strands mixed up with each other, do visualize the dependencies between the different software entities. You can imagine that due to this obscure mess of dependencies complexity is ramping up.

It is for a reason that the first value statement in the “Manifesto for Software Craftsmanship” mentions: “Not only working software, but also well-crafted software.”

As a software engineer, take your responsibility and develop well-crafted software to mitigate complexity!

The gaps between the intended-, implemented- and understood design

Designing software is the process to create and define the structure of this software to accomplish a certain set of requirements. Typically the design consists out of different decomposition levels in which the software is decomposed into different entities, which do interact with each other. As such, one could conclude that we will have one design for the software comprehending different decomposition levels. What many people do not realized is, that we do have different types of design, being the intended design and the implemented design.

The intended- and implemented design.

The intended design is the design as it is designed to be implemented. It is the intention that the intended design will be implemented as such in the actual code. Typically, the intended design is the design as it is documented in a tool like Enterprise Architect.

The implemented design is the design as it is actually implemented in the code. In an ideal situation this  implemented design will be equal to the intended design. However, this never seems to be the case in practice. Differences will always exist between the intended design and how it is actually implemented in the actual code; the implemented design.

Figure 1 illustrates two components (cubicles) each containing a number of functions (black dots). The lines are interfaces between different functions.

Figure 1

Let’s suppose that the implementation of this intended design is exactly implemented in this way; the implemented design equals the intended design.

Whenever a change is required and a certain function would need data from another function in the other component, the needed communication should be implemented by means of the well-defined interface between the two components. However, it might be decided to directly call this function in the other component without using that interface. If this happens, an unintended interface between the components will be realized (visualized by an additional call between the components). In case this happens multiple times, we will come across a situation as reflected in Figure 2, illustrating the gap between the intended design and the implemented design.

Figure 2

It is obvious that one would need to mitigate the gap between the intended– and implemented design as much as possible.

One can think of several reasons causing the gap between the intended– and implemented design, like an inexperienced engineer not being aware of the intended design, an engineer implementing that hack due to time pressure or even not updating the documentation of the intended design after a necessary change was applied in the code.

The understood design

In “Who needs an Architect?”[1] Martin Fowler stated:

“The expert developers on the software will have some common understanding of how the thing works.
And it is that common understanding which is effectively the architecture.”.

Taking into account that architecture is your highest level of design, this puts a new perspective on the design of a piece of software. Asides the intended design and implemented design, apparently there is something as an understood design; the common understanding by the experts of how the software works.

Experiences

During my career, I’ve seen many situations in which all three design-types were inconsistent with each other. Dependent of the size of the gap between specifically, the understood design and implemented design, unexpected side effects, needed rework and even instable software as a result was experienced. In some cases, in which the gap between the implemented design and the other design-types was huge, the software was not maintainable anymore. Engineers did not dare to ‘touch’ the code anymore, afraid they would break it.

Therefore, it is important to mitigate the gaps between the intended-, implemented and understood design as much as possible by documenting and maintaining the intended design, by sticking to the defined architectural and design rules when implementing and by running static design analysis by reverse engineering tools like Lattix, to get insight in the implemented design.


[1] https://www.youtube.com/watch?v=DngAZyWMGR0