The Blackest of Boxes

At a recent gig I was one of several veteran testers brought in to shore up a team that… I don’t want to say “lacked experience.” Probably the best description is that they were “insular.” They were basically re-inventing the wheel.

I had a number of discussions with the team’s manager and he seemed receptive to my suggestions, up to a point. Past that, though, all I heard was “We don’t do that.” When I first heard this my mind was suddenly struck with visions of Nomad, the robot from Star Trek, whose favorite thing to say was “Non sequitur. Your facts are uncoordinated.”

Don’t worry. While I was tempted to, I didn’t actually say it out loud.

Let me back up a little. The job involved testing a system that is sandwiched between two others–a digital scanner and a database. The system is designed to take the scanned information and process it so that the right data can be updated and stored, with a minimum of human intervention. That last part is key. Essentially most of what happens is supposed to be automatic and invisible to the end user. You trust that checks are in place such that any question about data integrity is flagged and presented to a human.

Testers, however, are not end users. This is a complicated system, with a lot of moving parts (metaphorically speaking). When erroneous data ends up in the database it’s not a simple matter to figure out where the problem lies. When I asked for tools to help the team “look under the hood” (for example, at the original OCR data or the API calls sent to the database) and eliminate much of the overhead involved with some of the tests, I was told “But we are User Acceptance Testing”, as if that were somehow an argument.

Worse, the tool was not being developed in-house. This meant that our interaction with the programmers was essentially non-existant. We’d toss our bugs over a high wall and they’d toss their fixes back over it. The team was actively discouraged from asking questions to programmers directly (the programmers were in another state, so obviously face-to-face communication was not possible, but even emails and phone calls were verboten).

One thing I’ve learned over the years is that it’s never a good idea to separate programmers and testers. It fosters an us-versus-them mentality instead of a healthy one where the testers are viewed as helping the programmers. It lengthens the lifespan of bugs since a programmer doesn’t have the option to walk over to a tester and ask to be shown the issue. Testers are deprived of the opportunity to learn important information about how the system is built–stuff that’s not at all apparent from the user interface, like snippets of code used in several unrelated places. A change to one area could break things elsewhere and the testers might otherwise never think to look there.

So, when I hear “We don’t do that” it strikes me as a cop out. It’s an abdication of responsibility. A non sequitur.

Share

Opportunity Cost

I’ve always had an abiding love of economics – which, contrary to what seems to be popular belief, is not about how to balance your checkbook, but about what it means to make choices in the face of scarcity. Once you’re even just a little familiar with the economic way of thinking then you never see the world in quite the same way again.

A fundamental component of economic thought is the notion of “opportunity cost.” The idea was probably most famously and elegantly described by the French economist Frédéric Bastiat in his essay “What Is Seen and What Is Not Seen”, in 1848. Its basics are this: The real cost of a thing is not the time or the money you spend on it, but the alternate choice that you’ve given up. For example, the opportunity cost of the omelet you had for breakfast is the pancakes you didn’t have.

Testers – ever familiar with tight schedules, under-staffing, and a potentially infinite list of tests to perform – probably understand opportunity cost at a more visceral level than your typical academic economist. When testers keep opportunity cost in mind, they’ll constantly be asking themselves, “Is this the most valuable test I could be doing right now?”

This brings us to the perennial question: Should testing be automated?

Setting aside the deeper question of “Can it?”, the answer, obviously, should involve weighing the benefits against the costs – including, most importantly, the opportunity cost. The opportunity cost of automated testing is the manual testing you have to give up, because – let’s not kid ourselves – it’s the rare software company that will hire a whole new team whose sole job is planning, writing, and maintaining the test scripts.

And let’s say the company does hire specialist automation testers. Well, it seems there’s always an irresistible inclination to, in a pinch (and Hofstadter’s Law ensures there’s always a pinch), have the automation team help with the manual testing efforts – “only to get us through this tough time”, of course. Funny how that “tough time” has a tendency to get longer and longer. Meanwhile, the automation scripts are not being written, or else they’re breaking or becoming obsolete. (Note that, of course, when a company puts automation on hold in favor of doing manual testing, it’s an implicit recognition of the opportunity cost of the automation.)

There’s another factor to consider: How valuable are these automated tests going to be?

I’ve already made the argument that good testers are not robots. Do I need to point out what automated testing is? Scripts are the pinnacle of inflexible and narrowly focused testing. A human performing the same test is going to see issues that the machine will miss. This is why James Bach prefers to call manual testing “sapient”, instead. He has a point!

Furthermore, except in particular domains (such as data-driven testing) any single test will lose value each time it is performed, thanks to something else from economic theory: the law of diminishing returns. Sure, sometimes developers break stuff that was working before, but unless there’s no chance the manual – sorry sapient – testers were going to timely find the problem in the course of their – ahem! – sapient tests, the value of any given automated test will asymptotically approach zero. Meanwhile its cost of maintenance and execution will only increase.

I won’t go so far as to say that test automation is never worth it. I’m sure there are situations where it is. However, given the precious opportunity cost involved, I think it’s called for less frequently than is generally believed.

Share

(Good) Testers Are Not Robots!

Toy RobotReading James Bach’s recent blog post this morning, “The Essence of Heuristics” – in particular the list of questions at the end – I was reminded, by way of stark contrast, of the testing culture I found when I started my current consulting gig.

One of the first things I was told was one of their testing “rules” – every test case should be repeated, with different data, 15 times. At first I simply marveled at this, privately. I figured someone must have a good reason for choosing 15 as the magic number. Why not 5? Or, for that matter, 256? Why every test case? Surely my time would be better spent doing a new test case instead of the 15th iteration of the current one, right?

Sooner or later, I thought, the rule’s reasonableness should become apparent. After a couple weeks I knew the team a little better, but the rule still seemed as absurd to me as when I first heard it, so I broached the topic.

“Why do you run 15 iterations of every test case?”

“Well, sometimes when we run tests, the first 10 or 12 will pass, but then the 11th or 13th, for example, will fail.”

“Okay, well, do you ever then try to discover what exactly the differences were between the passing and failing tests? So that you can be sure in the future you’ll have tests for both scenarios?”

<blank stare>

I quickly came to realize that this testing “rule” was symptomatic of a larger issue: an attitude in management that the team couldn’t be trusted to approach the testing problem intelligently. I saw evidence of this attitude in other ways. For example, we were told that all bug descriptions needed to include the date and time the bug occurred, so that the programmers would know where to look in the log files. When I pointed out that not all bugs will involve issues with logged events, I was told that they just didn’t want to confuse the junior team members.

Another example – and a particular pet peeve of mine – is the requirement that every test case include detailed step-by-step instructions to follow, leaving no room for creative thinking, interpretation, or exploration. The reasoning behind the excruciating detail, of course, is so the newest team members can start testing right away. My first objection to this notion is that the fresh eyes of a new user can see problems that veterans have become blind to. As such, putting blinders on the newbies is not a good idea. Also, why bypass the testing of product’s usability and/or the help documentation and user manual? New users are a great resource for that.

In short, testers are not robots, and treating them like they are will result in lower quality testing efforts.

Share

The Post Hoc Fallacy

Correlation is not causation.

It seems a simple statement when you look at it. Just because night follows day does not mean that day causes night. However, it’s clear that people fall prey to this fallacy all the time. It’s what’s behind, for example, the superstitious rituals of baseball pitchers.

A far less trite example is modern medicine. You have a headache. You take a pill. Your headache goes away. Did it go away because of the pill you took? Maybe it would have gone away on its own.  How do you know?

Teasing out causation from mere correlation in cases like that, with potentially dozens of unknown and uncontrolled variables, is notoriously difficult. The entire industry of complimentary and alternative medicine banks on the confusion.

I was thinking about all this the other day when I was testing a tool that takes mailed orders for prescription drugs, digitizes the data, and then adds it all to a central database. I was focusing specifically on the patient address information at the time, so the rest of the orders, like payment information, was fairly simple–meaning all my test orders were expected to get assigned a payment type of “invoice”, which they did. So in the course of my address testing I “passed” the test case for the invoice payment type.

It wasn’t until later that I realized I had committed the fallacy Post hoc ergo propter hoc (“After this, therefore because of this”), just like the person who attributes the disappearance of their headache to the sugar pill they’ve just taken. I discovered that all orders were getting a payment type of “Invoice”, regardless of whether they had checks or credit card information attached.

Inadvertently, I had succumbed to confirmation bias. I forgot, momentarily, that proper testing always involves attempting the falsification of claims, not their verification.

Share

Goodhart’s Law and Test Cases

I’d like to share a story about a glass factory in the Soviet Union. Being Soviet, the factory didn’t have to worry about the pesky things that a typical glass manufacturer has to pay attention to, like profits or appealing to customers. This factory was in the workers’ Paradise, after all! Their only charge was to “make glass”. Thus, the factory’s managers were left free to define exactly what that meant. Going through it in their minds, their solution was to take the factory’s output, and weigh it.

Over time–and, mind you, not a long time–the “product” became larger and heavier, until what was coming off the factory floor were giant cubes of glass. Very heavy, of course, but useful to no one.  The managers were forced to admit that their definition of success was flawed.

Thinking it over, management decided it would be better to measure the area of the glass produced.  They announced this change to the workers. Soon, the giant cubes were gone, replaced by enormous sheets, nearly paper-thin.  Lots of surface area per volume, but again, utterly useless outside the factory gates.

Now, I don’t remember when or where I first heard this story, and it may be apocryphal. However, even as a fable it contains an important lesson about the potential consequences of ignoring what has come to be known as Goodhart’s Law. Stated succinctly, it is this: When a measure becomes a target, it ceases to be a good measure.

What does any of this have to do with software testing, and test cases? I hope the answer is fairly obvious, but I’ll spell it out anyway. I’ve seen too many testing teams who think that it’s a QA “best practice” to focus on the test case as the sole unit of measure of “testing productivity”. The conventional wisdom in the industry appears to be: the more test cases, the better. The scope, quality, or risk of each test case taken individually, if considered at all, is of secondary importance.

I’ve seen situations where this myopia got so bad that all that mattered was that the team completed 130 test cases in a day. If that didn’t happen then the team was seen as not being as productive as they could have been. Never mind how many bugs had been found, or which test cases were actually executed.

I hope you can see how this sort of incentive structure can lead to perverse outcomes. Test cases will be written so as to maximize their number instead of their quality or importance. The testers will scour the test case repository for those items that can be done the fastest, regardless of the risk-level of the product area being tested. They’re likely to put blinders on against any tangential quality issues that may surface in the course of executing their test case. They’ll start seeing bugs as annoyances instead of things to be proud of having found. In other words, the test team’s output will quickly begin to resemble, metaphorically, that of the Soviet glass factory.

Joel Spolsky makes the same point here.

Share