Tag Archives: Measurement

Estimating Testing Times: Glorified Fortune-Telling?


Hofstadter’s Law:
It always takes longer than you
expect, even when you take
into account Hofstadter’s Law.

Douglas Hofstadter

A good friend of mine is a trainer for CrossFit, and has been for years. For a long time he trained clients out of his house, but his practice started outgrowing the space. His neighbors were complaining about the noise (if you’ve ever been in a CrossFit gym you can easily imagine that they had a point). Parking was becoming a problem, too.

So, in September, 2009, he rented a suite for a gym, in a building with an excellent location and a gutted interior–perfect for setting up the space exactly how he wanted it. It needed new flooring, plumbing, framing, drywall, venting, insulation, dropped ceiling, electricity, and a few other minor things. At the time, he told me they’d be putting the finishing touches on the build-out by mid-December. I remember thinking, “Wow. Three months. That’s a long time.”

As it turned out, construction wasn’t completed until late June, 2010, Seven months later than originally estimated.

Let’s think about that. Here’s a well-defined problem, with detailed plans (with drawings and precise measurements, even!) and a known scope, not prone to “scope creep.” The technology requirements for this kind of project are, arguably, on the low side–and certainly standardized and familiar. The job was implemented by skilled, experienced professionals, using specialized, efficiency-maximizing tools. And yet, it still took more than 3 times longer than estimated.

Contrast that with a software project. Often the requirements are incomplete, but even when they’re not, they’re still written in words, which are inherently ambiguous. What about tools? Sometimes even those have to be built, or existing tools need to be customized. And the analogy breaks down completely when you try to compare writing a line of code (or testing it) with, for example, hanging a sheet of drywall. Programmers are, by definition, attempting something that has never been done before. How do you come up with reasonable estimates in this situation?

This exact question was asked in an online discussion forum recently. A number of self-described “QA experts” chimed in with their answers. These all involved complex models, assumptions, and calculations based on things like “productivity factors,” “data-driven procedures,” “Markov chains,” etc. My eyes glazed over as I read them. If they weren’t all committing the Platonic fallacy then I don’t know what it is.

Firstly, at the start of any software project you are, as Jeffrey Friedman puts it, radically ignorant. You do not know what you do not know. The requirements are ambiguous and the code hasn’t even been written yet. This is still true for updates to existing products. You can’t be certain what effect the new features will have on the existing ones, or how many bugs will be introduced by re-factoring the existing features. How can you possibly know how many test cases you’re going to need to run? Are you sure you’re not committing the Ludic Fallacy when you estimate the “average time” per test case? Even if you’ve found the perfect estimation model (and how would you know this?), your inputs for it are bound to be wrong.

To attempt an estimate in that situation is to claim knowledge that you do not possess. Is that even ethical?

Secondly, your radical ignorance goes well beyond what the model’s inputs should be. What model takes into account events like the following (all of which actually happened, on projects I’ve been a part of)?

  1. The database containing the company’s live customer data–all of it–is inadvertently deleted by a programmer who thought at the time that he was working in the developer sandbox.
  2. The Director of Development, chief architect of the project, with much of the system design and requirements kept only in his head, fails to appear at work one day. Calls to his home go unanswered for two weeks. When someone finally gets in touch with him he says he won’t be coming back to work.
  3. A disgruntled programmer spends most of his time putting derogatory easter eggs in the program instead of actually working. When found by a particularly alert tester (sadly I can’t claim it was me) the programmer is fired.
  4. A version of the product is released containing an egregious bug, forcing the company to completely reassess  its approach to development (and blame the testers for missing the “obvious” bug, which then destroys morale and prompts a tester to quit).
  5. The company’s primary investor is indicted for running a ponzi scheme. The majority of the employees are simply let go, as there is not enough revenue from sales to continue to pay them.

The typical response from the “experts” has been, “Well, that’s where the ‘fudge factor’ comes in, along with the constant need to adjust the estimate while the project is underway.”

To that I ask, “Isn’t that just an implicit admission that estimates are no better than fortune-telling?”

I heard from Lynn McKee recently that Michael Bolton has a ready answer when asked to estimate testing time: “Tell me what all the bugs will be, first, then I can tell you how long it will take to test.”

I can’t wait to use that!


Testing’s Quiet Evidence

Since my post about Nassim Taleb’s The Black Swan I’ve continued to muse about the book and its ideas. In particular, while thinking about the notion of “silent evidence,” I realized there was a connection to testing that I hadn’t noticed before. I won’t flatter myself by thinking I missed the link due to some inherent subtlety in the concept. More likely it’s because I already associated the idea with something else: get-rich-quick schemes. Bear with me for a minute while I explain…

A few years ago I was consumed with debunking network marketing companies and real estate investing scams (If you’re genuinely curious why, I explain it all here). My efforts were focused primarily on a real estate “guru” who lives nearby, in Glendale, AZ. As with most con artists, his promotional materials include dozens of narrative fallac–uh… testimonials from former “students” who claim they achieved “great success” using the investment methods he teaches.

Of course, what the “guru” doesn’t promote are the undoubtedly much higher number of  people who attended his “boot camps” or bought his materials and either, A) did nothing, or B) tried his techniques and lost money (On the rare occasions such people are even mentioned, there’s always a ready explanation for them: The blame lies not with the technique, but with the practitioner. Voilà! The scammer has just removed your ability to falsify his claims!). Taleb calls these people the “silent evidence.” To ignore them when evaluating a population (in this case, the customers of a particular guru) is to engage in survivorship bias and miscalculate what you’re trying to measure. Con artists of all stripes make millions encouraging their customers to do this.

Testers are not con artists (as a rule, I mean), but we do have something that, while perhaps not silent, should be considered at least very quiet. In contrast to the scammers, it’s not to our advantage that it stay quiet. In fact, I’m starting to wonder if keeping it quiet is not at least in part to blame for some of the irrational practices you find in dysfunctional test teams, such as the obsession with test cases.

What am I talking about?

I am referring, dear reader, to all the bugs that were found and fixed prior to release. All those potentially serious issues that were neutralized before they could do any damage. No one thinks about them, because they don’t exist, except as forgotten items in a database no one cares about any more. But they’re there–hundreds, maybe thousands of them–quietly paying tribute to averted disasters, maintained reputations, even saved money (hence, why I call them “quiet” rather than silent: they’re still there if you look for them).

Meanwhile, the released product is out in the world, exposing its inevitable and embarrassing flaws for all to see, prompting CEOs and sales teams to wonder, “What are those testers doing all day? Why aren’t they assuring quality?” Note that this reaction is precisely the survivorship bias I mentioned above. The error causes them to undervalue the test team, in a way exactly analogous to how dupes of the real estate gurus overvalue the guru.

Okay, so what to do about this? I confess that, as yet, I do not know. Right now all I can say is it behooves us as testers to come up with ways of better publicizing the bugs that we find–to turn our quiet evidence into actual evidence. As to how to go about that, well, I’m open for suggestions.


Goodhart’s Law and Test Cases

I’d like to share a story about a glass factory in the Soviet Union. Being Soviet, the factory didn’t have to worry about the pesky things that a typical glass manufacturer has to pay attention to, like profits or appealing to customers. This factory was in the workers’ Paradise, after all! Their only charge was to “make glass”. Thus, the factory’s managers were left free to define exactly what that meant. Going through it in their minds, their solution was to take the factory’s output, and weigh it.

Over time–and, mind you, not a long time–the “product” became larger and heavier, until what was coming off the factory floor were giant cubes of glass. Very heavy, of course, but useful to no one.  The managers were forced to admit that their definition of success was flawed.

Thinking it over, management decided it would be better to measure the area of the glass produced.  They announced this change to the workers. Soon, the giant cubes were gone, replaced by enormous sheets, nearly paper-thin.  Lots of surface area per volume, but again, utterly useless outside the factory gates.

Now, I don’t remember when or where I first heard this story, and it may be apocryphal. However, even as a fable it contains an important lesson about the potential consequences of ignoring what has come to be known as Goodhart’s Law. Stated succinctly, it is this: When a measure becomes a target, it ceases to be a good measure.

What does any of this have to do with software testing, and test cases? I hope the answer is fairly obvious, but I’ll spell it out anyway. I’ve seen too many testing teams who think that it’s a QA “best practice” to focus on the test case as the sole unit of measure of “testing productivity”. The conventional wisdom in the industry appears to be: the more test cases, the better. The scope, quality, or risk of each test case taken individually, if considered at all, is of secondary importance.

I’ve seen situations where this myopia got so bad that all that mattered was that the team completed 130 test cases in a day. If that didn’t happen then the team was seen as not being as productive as they could have been. Never mind how many bugs had been found, or which test cases were actually executed.

I hope you can see how this sort of incentive structure can lead to perverse outcomes. Test cases will be written so as to maximize their number instead of their quality or importance. The testers will scour the test case repository for those items that can be done the fastest, regardless of the risk-level of the product area being tested. They’re likely to put blinders on against any tangential quality issues that may surface in the course of executing their test case. They’ll start seeing bugs as annoyances instead of things to be proud of having found. In other words, the test team’s output will quickly begin to resemble, metaphorically, that of the Soviet glass factory.

Joel Spolsky makes the same point here.