Skip to content

Assume the Worst — and Verify Your Success

September 25, 2012

There are very few “rants” that I go off on when it comes to programming because I generally believe that most programmers want to do the right thing and there are innumerable ways to approach any problem, so who am I to say better?

That said, there are a few hot-button issues for me when it comes to programming patterns, and I encountered one such situation again recently. To frame the topic, here’s what I encountered:

Some machines within our domains at work are centrally managed using Microsoft System Center products (more specifically here, System Center Configuration Manager) and overall the system works pretty well. The basic premise is to ensure (enforce, really) a certain level of software version/patch compliance and this is meant to be achieved by having SCCM push down a package which verifies local software and automatically launching patches and product installations. The problem comes when something goes wrong with these installations and because of poor (perhaps “junior developer”) implementation patterns, this process which is meant to make things better can be detrimental instead, to the point of even acting like a denial of service attack on our own managed resources!

So what are the problematic patterns here I stamp my fist on tables and rant about?

  • Assuming success is the natural outcome
  • Assuming any arbitrary failure can be recovered from

In this particular case, the patching solution was making things worse because it attempted to install a piece of software which was inappropriate for the environment, failing, and then retrying ad-infinitum. This infinite loop not only meant that the machines were not patched as intended (and actually got further and further behind in patching because the process never completed successfully), but also ate up most of the machines processor and I/O constantly trying to install a package which would never succeed!

This situation was particularly annoying because it clearly should have been caught in QC before ever impacting my machines and demonstrates some lazy error handling practices which assumed success and swallowed errors rather than recovering or bailing out.

The definition of insanity is doing the same thing over and over again and expecting different results. - Albert Einstein

Blindly Assuming Success

This is a pattern I see quite frequently in software. I pride myself on being an optimist, but there is optimism and then there is an unhealthy disconnection from reality. Mechanical and structural engineers working in the “real world” have to base their designs on known, proven tolerances, planning for failure and understanding the limits of their materials. They know that given the environment their design is supposed to operate in, they can account for the expected stresses plus some additional safety factor. Outside of this tolerance range is unpredictability, “out of warranty”: risk. I often see the exact opposite in software design: there is a blind faith that whatever is attempted will succeed. So I rant:

Assume Failure — and Verify Your Success!

Does this mean you must verify the outcome of every statement? Of course not — we must always balance the risk of failures against their likelihood of occurring, and incredibly basic operations have a very high likelihood of success but I’m always amazed at the assumption that resources are infinite and immediately available. “On my dev box I was always able to create file X, so why shouldn’t I be able to do so in production?” cries the Dev. You might be right, but let’s be certain, shall we? Just verify the operation you attempted after doing so — or better yet, verify that it should succeed before even attempting it. What is the worst that can happen by assuming failure? Typically, just a disappointing experience for the user because what they wanted to occur did not. On the other hand, by assuming success without verifying can (and has) lead to some catastrophic situations where not only does the program not do what the user desired, but much worse damage occurs as well.

Failure is Always an Option

Contrary to the famous quote of Gene Kranz during the Apollo 13 mission, in software failure is always an option. Your own code will never know enough about its own environment and context to be truly bulletproof — nor does it need to be. What it does need to explicitly accept, though, is that failures can happen: some we can expect, understand, and may even be able to gracefully recover from, and all others we cannot. It’s the second class of failures that seem to be downplayed — not that we don’t recognize they can happen, but rather that we should not even try to deal with them. Ever seen code like this?:

try {
   // Do something
}
catch (Exception ex) {
   // Recover from ex
}

This kind of thing drives me nuts: it attempts to recover, but without any chance of really doing so, because catching such a generic exception type leaves you open to ANYTHING going wrong — and I guarantee you cannot recover from every failure type 🙂 This is the most extreme case, but even in many cases where more specific exceptions are caught there is still not enough context to reliably recover automatically. So if you can’t really recover, why are you even catching an exception?! So final rant:

Catch only what you expect can fail and that you can recover from — or fail gracefully and quit.

OK — ’nuff said, I’ll get off my soapbox. Over-and-out.

Advertisements
Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: