Monthly Archives: May 2010

(Good) Testers Are Not Robots!

Toy RobotReading James Bach’s recent blog post this morning, “The Essence of Heuristics” – in particular the list of questions at the end – I was reminded, by way of stark contrast, of the testing culture I found when I started my current consulting gig.

One of the first things I was told was one of their testing “rules” – every test case should be repeated, with different data, 15 times. At first I simply marveled at this, privately. I figured someone must have a good reason for choosing 15 as the magic number. Why not 5? Or, for that matter, 256? Why every test case? Surely my time would be better spent doing a new test case instead of the 15th iteration of the current one, right?

Sooner or later, I thought, the rule’s reasonableness should become apparent. After a couple weeks I knew the team a little better, but the rule still seemed as absurd to me as when I first heard it, so I broached the topic.

“Why do you run 15 iterations of every test case?”

“Well, sometimes when we run tests, the first 10 or 12 will pass, but then the 11th or 13th, for example, will fail.”

“Okay, well, do you ever then try to discover what exactly the differences were between the passing and failing tests? So that you can be sure in the future you’ll have tests for both scenarios?”

<blank stare>

I quickly came to realize that this testing “rule” was symptomatic of a larger issue: an attitude in management that the team couldn’t be trusted to approach the testing problem intelligently. I saw evidence of this attitude in other ways. For example, we were told that all bug descriptions needed to include the date and time the bug occurred, so that the programmers would know where to look in the log files. When I pointed out that not all bugs will involve issues with logged events, I was told that they just didn’t want to confuse the junior team members.

Another example – and a particular pet peeve of mine – is the requirement that every test case include detailed step-by-step instructions to follow, leaving no room for creative thinking, interpretation, or exploration. The reasoning behind the excruciating detail, of course, is so the newest team members can start testing right away. My first objection to this notion is that the fresh eyes of a new user can see problems that veterans have become blind to. As such, putting blinders on the newbies is not a good idea. Also, why bypass the testing of product’s usability and/or the help documentation and user manual? New users are a great resource for that.

In short, testers are not robots, and treating them like they are will result in lower quality testing efforts.


The Post Hoc Fallacy

Correlation is not causation.

It seems a simple statement when you look at it. Just because night follows day does not mean that day causes night. However, it’s clear that people fall prey to this fallacy all the time. It’s what’s behind, for example, the superstitious rituals of baseball pitchers.

A far less trite example is modern medicine. You have a headache. You take a pill. Your headache goes away. Did it go away because of the pill you took? Maybe it would have gone away on its own.  How do you know?

Teasing out causation from mere correlation in cases like that, with potentially dozens of unknown and uncontrolled variables, is notoriously difficult. The entire industry of complimentary and alternative medicine banks on the confusion.

I was thinking about all this the other day when I was testing a tool that takes mailed orders for prescription drugs, digitizes the data, and then adds it all to a central database. I was focusing specifically on the patient address information at the time, so the rest of the orders, like payment information, was fairly simple–meaning all my test orders were expected to get assigned a payment type of “invoice”, which they did. So in the course of my address testing I “passed” the test case for the invoice payment type.

It wasn’t until later that I realized I had committed the fallacy Post hoc ergo propter hoc (“After this, therefore because of this”), just like the person who attributes the disappearance of their headache to the sugar pill they’ve just taken. I discovered that all orders were getting a payment type of “Invoice”, regardless of whether they had checks or credit card information attached.

Inadvertently, I had succumbed to confirmation bias. I forgot, momentarily, that proper testing always involves attempting the falsification of claims, not their verification.


Goodhart’s Law and Test Cases

I’d like to share a story about a glass factory in the Soviet Union. Being Soviet, the factory didn’t have to worry about the pesky things that a typical glass manufacturer has to pay attention to, like profits or appealing to customers. This factory was in the workers’ Paradise, after all! Their only charge was to “make glass”. Thus, the factory’s managers were left free to define exactly what that meant. Going through it in their minds, their solution was to take the factory’s output, and weigh it.

Over time–and, mind you, not a long time–the “product” became larger and heavier, until what was coming off the factory floor were giant cubes of glass. Very heavy, of course, but useful to no one.  The managers were forced to admit that their definition of success was flawed.

Thinking it over, management decided it would be better to measure the area of the glass produced.  They announced this change to the workers. Soon, the giant cubes were gone, replaced by enormous sheets, nearly paper-thin.  Lots of surface area per volume, but again, utterly useless outside the factory gates.

Now, I don’t remember when or where I first heard this story, and it may be apocryphal. However, even as a fable it contains an important lesson about the potential consequences of ignoring what has come to be known as Goodhart’s Law. Stated succinctly, it is this: When a measure becomes a target, it ceases to be a good measure.

What does any of this have to do with software testing, and test cases? I hope the answer is fairly obvious, but I’ll spell it out anyway. I’ve seen too many testing teams who think that it’s a QA “best practice” to focus on the test case as the sole unit of measure of “testing productivity”. The conventional wisdom in the industry appears to be: the more test cases, the better. The scope, quality, or risk of each test case taken individually, if considered at all, is of secondary importance.

I’ve seen situations where this myopia got so bad that all that mattered was that the team completed 130 test cases in a day. If that didn’t happen then the team was seen as not being as productive as they could have been. Never mind how many bugs had been found, or which test cases were actually executed.

I hope you can see how this sort of incentive structure can lead to perverse outcomes. Test cases will be written so as to maximize their number instead of their quality or importance. The testers will scour the test case repository for those items that can be done the fastest, regardless of the risk-level of the product area being tested. They’re likely to put blinders on against any tangential quality issues that may surface in the course of executing their test case. They’ll start seeing bugs as annoyances instead of things to be proud of having found. In other words, the test team’s output will quickly begin to resemble, metaphorically, that of the Soviet glass factory.

Joel Spolsky makes the same point here.