Tag Archives: Models

Estimating Testing Times: Glorified Fortune-Telling?


Hofstadter’s Law:
It always takes longer than you
expect, even when you take
into account Hofstadter’s Law.

Douglas Hofstadter

A good friend of mine is a trainer for CrossFit, and has been for years. For a long time he trained clients out of his house, but his practice started outgrowing the space. His neighbors were complaining about the noise (if you’ve ever been in a CrossFit gym you can easily imagine that they had a point). Parking was becoming a problem, too.

So, in September, 2009, he rented a suite for a gym, in a building with an excellent location and a gutted interior–perfect for setting up the space exactly how he wanted it. It needed new flooring, plumbing, framing, drywall, venting, insulation, dropped ceiling, electricity, and a few other minor things. At the time, he told me they’d be putting the finishing touches on the build-out by mid-December. I remember thinking, “Wow. Three months. That’s a long time.”

As it turned out, construction wasn’t completed until late June, 2010, Seven months later than originally estimated.

Let’s think about that. Here’s a well-defined problem, with detailed plans (with drawings and precise measurements, even!) and a known scope, not prone to “scope creep.” The technology requirements for this kind of project are, arguably, on the low side–and certainly standardized and familiar. The job was implemented by skilled, experienced professionals, using specialized, efficiency-maximizing tools. And yet, it still took more than 3 times longer than estimated.

Contrast that with a software project. Often the requirements are incomplete, but even when they’re not, they’re still written in words, which are inherently ambiguous. What about tools? Sometimes even those have to be built, or existing tools need to be customized. And the analogy breaks down completely when you try to compare writing a line of code (or testing it) with, for example, hanging a sheet of drywall. Programmers are, by definition, attempting something that has never been done before. How do you come up with reasonable estimates in this situation?

This exact question was asked in an online discussion forum recently. A number of self-described “QA experts” chimed in with their answers. These all involved complex models, assumptions, and calculations based on things like “productivity factors,” “data-driven procedures,” “Markov chains,” etc. My eyes glazed over as I read them. If they weren’t all committing the Platonic fallacy then I don’t know what it is.

Firstly, at the start of any software project you are, as Jeffrey Friedman puts it, radically ignorant. You do not know what you do not know. The requirements are ambiguous and the code hasn’t even been written yet. This is still true for updates to existing products. You can’t be certain what effect the new features will have on the existing ones, or how many bugs will be introduced by re-factoring the existing features. How can you possibly know how many test cases you’re going to need to run? Are you sure you’re not committing the Ludic Fallacy when you estimate the “average time” per test case? Even if you’ve found the perfect estimation model (and how would you know this?), your inputs for it are bound to be wrong.

To attempt an estimate in that situation is to claim knowledge that you do not possess. Is that even ethical?

Secondly, your radical ignorance goes well beyond what the model’s inputs should be. What model takes into account events like the following (all of which actually happened, on projects I’ve been a part of)?

  1. The database containing the company’s live customer data–all of it–is inadvertently deleted by a programmer who thought at the time that he was working in the developer sandbox.
  2. The Director of Development, chief architect of the project, with much of the system design and requirements kept only in his head, fails to appear at work one day. Calls to his home go unanswered for two weeks. When someone finally gets in touch with him he says he won’t be coming back to work.
  3. A disgruntled programmer spends most of his time putting derogatory easter eggs in the program instead of actually working. When found by a particularly alert tester (sadly I can’t claim it was me) the programmer is fired.
  4. A version of the product is released containing an egregious bug, forcing the company to completely reassess  its approach to development (and blame the testers for missing the “obvious” bug, which then destroys morale and prompts a tester to quit).
  5. The company’s primary investor is indicted for running a ponzi scheme. The majority of the employees are simply let go, as there is not enough revenue from sales to continue to pay them.

The typical response from the “experts” has been, “Well, that’s where the ‘fudge factor’ comes in, along with the constant need to adjust the estimate while the project is underway.”

To that I ask, “Isn’t that just an implicit admission that estimates are no better than fortune-telling?”

I heard from Lynn McKee recently that Michael Bolton has a ready answer when asked to estimate testing time: “Tell me what all the bugs will be, first, then I can tell you how long it will take to test.”

I can’t wait to use that!


The Black Swan

A Black SwanI hate the idea of writing a book review for a post. Somehow it strikes me as cheap and lazy to rely so heavily on the work of others for content, particularly when my blog is so new. Shouldn’t I be concerned with sharing my own thoughts instead of parroting the thoughts of others?

Even worse: choosing Nassim Taleb’s The Black Swan (second edition, just released a few weeks ago) as the review’s subject matter. Taleb has notorious disdain for reviewers, many of whom seem to either miss his message entirely* or distort it in some consequential fashion. Given the book’s Kolmogorov complexity, any attempt at encapsulation is bound to leave out something significant (in contrast to the easily summerizable journalistic “idea book of the week” that excites the MBAs and is the intellectual equivalent of fast food. Anyone remember Who Moved My Cheese??).

I think the book’s message is important, and Taleb, being a champion of the skeptical empiricist, says a great deal that should excite and inspire the software tester. So, I’m willing to risk appearing lazy, but let’s not call this post a review so much as a somewhat desultory sampler. The Black Swan is a philosophical essay that is both dense and broad, and explores many interesting ideas–irreverently, I might add. My aim here will be to stick to those ideas that pertain to testing. I’ll leave the rest for you to discover on your own if you should decide to pick up a copy of the book for yourself.

The Black Swan

“All swans are white.”

Before 1697, you could say this, and every sighting of another swan would add firmness to your conviction of its “truth”. But then Europeans discovered a black swan in Western Australia. A metaphor for the problem of induction was born.

Taleb’s Black Swan (note the capitalization) is distinct from the philosophical issue, however. I’ll let Taleb define it:

First, it is an outlier, as it lies outside the realm of regular expectations, because nothing in the past can convincingly point to its possibility. Second, it carries an extreme impact (unlike the bird). Third, in spite of its outlier status, human nature makes us concoct explanations for its occurrence after the fact, making it explainable and predictable.

I stop and summarize the triplet: rarity, extreme impact, and retrospective (though not prospective) predictability. A small number of Black Swans explain almost everything in our world, from the success of ideas and religions, to the dynamics of historical events, to elements of our own personal lives. [emphasis original]

I’m confident that you can already see where this applies in the world of software. A Black Swan would be any serious bug that made it into a released product and caused some sort of harm–either to customers or the company’s reputation (or both!).

Toyota’s recent brake system problems are a perfect example. Clearly they didn’t see this coming, and it’s cost them an estimated $2 billion.  You can bet they’re trying to figure out why they didn’t catch the problem earlier–and why they should have–and how to prevent similar problems in the future.

And there’s the rub! The problem with Black Swans is that they are unpredictable by nature. Reality has “epistemic opacity”, says Taleb, owing to various inherent limitations to our knowledge, coupled with how we often deal erroneously with the information we do have. Toyota might spend billions ensuring that their cars will never have brake problems of any kind ever again, only, perhaps, to find one day that, in certain rare situations, their fuel system catches fire. It happens precisely because it’s not planned for.

So, what can we, as testers, do about the Black Swans we might face? The Black Swan counsels primarily how not to deal with them, and Taleb openly laments the typical reaction to his “negative advice.”

…[R]ecommendations of the style “Do not do” are more robust empirically [see “Negative Empiricism,” below]. How do you live long? By avoiding death. Yet people do not realize that success consists mainly in avoiding losses, not in trying to derive profits.

Positive advice is usually the province of the charlatan [see “Narrative Fallacy,” below]. Bookstores are full of books on how someone became successful [see “Silent Evidence,” below]; there are almost no books with the title What I Learned Going Bust, or Ten Mistakes to Avoid in Life.

Linked to this need for positive advice is the preference we have to do something rather than nothing, even in cases when doing something is harmful. [emphasis original]

I’m reminded of a consulting gig where I explained to the test team’s managers that their method for tracking productivity was invoking Goodhart’s Law and was thus worse than meaningless, since it encouraged counterproductive behavior in the team. The managers agreed with my analysis, but did not change their methodology. After all, they said, they were required to report something to the suits above them. They didn’t seem to have an ethical problem with tracking numbers that they knew were bullshit.


The ancient Greek philosopher Plato had a theory that abstract ideas or “Forms,” such as the idea of the color red, were the highest kind of reality. He believed that Forms were the only means to genuine knowledge. The error of Platonicity, then, as defined by Taleb, is

…our tendency to mistake the map for the territory, to focus on pure and well-defined “forms,” whether objects, like triangles, or social notions, like utopias (societies built according to some blueprint of what “makes sense”), even nationalities. When these ideas and crisp constructs inhabit our minds, we privilege them over other less elegant objects, those with messier and less tractable structures…

Platonicity is what makes us think that we understand more than we actually do. But this does not happen everywhere. I am not saying that Platonic forms don’t exist. Models and constructions, these intellectual maps of reality, are not always wrong; they are wrong only in some specific applications. The difficulty is that a) you do not know beforehand (only after the fact) where the map will be wrong, and b) the mistakes can lead to severe consequences. These models are like potentially helpful medicines that carry random but very severe side effects.

The error of platonification has a lot in common with the error of reification, but there is a subtle difference. Platonification doesn’t require that you believe your model is real (as in, “concrete”), only that it is accurate.

Again I’m sure you’re already thinking of ways this applies in software testing. You build a model of a system you’re testing. Soon you forget that you’re using a model and become blind to scenarios that might occur outside of it. Even worse, you write a few hundred test cases based on your model and convince yourself that, once you’ve gone through them all, you’ve “finished testing.”

Negative Empiricism

I mentioned above that The Black Swan is almost entirely advice about what not to do. However, in the chapter he devotes to confirmation bias and its brethren, Taleb introduces the heuristic of “falsification.” I hope you’ll forgive my quoting rather liberally from the section, here. He seems, for a moment, to be speaking directly to software testers:

By a mental mechanism I call naïve empiricism, we have a natural tendency to look for instances that confirm our story and our vision of the world – these instances are always easy to find. Alas, with tools, and fools, anything can be easy to find. You take past instances that corroborate your theories and you treat them as evidence. For instance, a diplomat will show you his “accomplishments,” not what he failed to do. Mathematicians will try to convince you that their science is useful to society by pointing out instances where it proved helpful, not those where it was a waste of time, or, worse, those numerous mathematical applications that inflicted a severe cost on society owing to the highly unempirical nature of elegant mathematical theories.

The good news is that there is a way around this naïve empiricism. I am saying that a series of corroborative facts is not necessarily evidence. Seeing white swans does not confirm the nonexistence of black swans. There is an exception, however: I know what statement is wrong, but not necessarily what statement is correct. If I see a black swan I can certify that all swans are not white!

This asymmetry is immensely practical. It tells us that we do not have to be complete skeptics, just semiskeptics. The subtlety of real life over the books is that, in your decision making, you need to be interested only in one side of the story: if you seek certainty about whether the patient has cancer, not certainty about whether he is healthy, then you might be satisfied with negative inference, since it will supply you the certainty you seek. So we can learn a lot from data – but not as much as we expect. Sometimes a lot of data can be meaningless; at other times one single piece of information can be very meaningful. It is true that a thousand days cannot prove you right, but one day can prove you to be wrong.

The person who is credited with the promotion of this idea of one-sided semiskepticism is Sir Doktor Professor Karl Raimund Popper, who may be the only philosopher of science who is actually read and discussed by actors in the real world (though not as enthusiastically by professional philosophers)… He writes to us, not to other philosophers. “We” are the empirical decision makers who hold that uncertainty is our discipline, and that understanding how to act under conditions of incomplete information is the highest and most urgent human pursuit. [emphasis original]

It always rankles when I hear someone (who is – usually – not a tester) declare something like “We need to prove the program works.” Obviously anyone who says this has a fundamental misconception of what is actually possible. And how many times has a programmer come to you claiming that he tested his code and “the feature works” – but you discover after only a couple tests that his “tests” were within only a narrow range, outside of which the feature breaks immediately?

All The Rest

I’ve only touched on a very small part of the contents of The Black Swan, but hopefully enough to convince you that it’s required reading for software testers. I’ll close the post with short descriptions of a few of the bigger ideas in the book that I skipped:

  • Mediocristan – A metaphorical country where deviations from the median are small and relatively rare, and those deviations can’t meaningfully affect the total. Think heights and weights of people. Black Swans aren’t possible here.
  • Extremistan – A metaphorical country where Black Swans are possible, because single members of a population can affect the aggregate. Think income or book sales.
  • Ludic Fallacy – Roughly speaking, the belief that you’re dealing with a phenomenon from Mediocristan when it’s actually from Extremistan. The Ludic Fallacy is a special case of the Platonic Fallacy.
  • Narrative Fallacy – The tendency to believe or concoct explanations that fit a complicated set of historical facts because they sound plausible. Conspiracy theories are only a small facet of this. These narratives cause us to think that past events were more predictable than they actually were. We become, as Taleb puts it, “Fooled by Randomness.”
  • Silent Evidence – That part of a population that is ignored because it is “silent,” meaning either difficult or impossible to see. We see all the risk-takers who succeeded in business, but not all risk-takers who failed. The result is the logical error called survivorship bias.

*An example of this is found in the quote from GQ magazine that appears, ironically, on the front cover of the book itself: “The most prophetic voice of all.” Taleb’s point is to be wary of anyone who claims he can predict the future. He says of himself, “I know I cannot forecast.”