Testers Cannot Be Perfect Proxies for End Users

The following is a snippet of a Skype conversation I was a part of recently. I wanted to share it, since I think it’s a perfect example of a perennial tester conversation. See if you recognize yourself in it. The participants’ names have been removed in the interest of privacy. Other minor changes were made for the sake of clarity.

[11:25:37 AM] Dev Mgr: Programmer– i really need you to look into that “flash wont resize without a reload” fact
[11:25:42 AM] Dev Mgr: its not being believed [by the CEO]
[11:25:50 AM] Dev Mgr: and quite honestly
[11:26:02 AM] Dev Mgr: i went to some other pages that clearly have graphs on them
[11:26:07 AM] Dev Mgr: that dont need to be reloaded
[11:26:29 AM] Dev Mgr: when they get resized
[11:26:29 AM] Programmer: i’ll read fusionchart docs
[11:26:32 AM] Dev Mgr: thats
[11:26:33 AM] Dev Mgr: thanks
[11:26:40 AM] Dev Mgr: also Tester 1/Tester 2
[11:26:49 AM] Dev Mgr: i’m not really sure how that was not part of a QA
[11:27:08 AM] Dev Mgr: when you are looking for UI stuff before a release
[11:27:16 AM] Dev Mgr: not everyone has a big ass monitor
[11:27:25 AM] Dev Mgr: please add qa on your laptop screen to the test
[11:27:49 AM] Dev Mgr: this is not a successful launch
[11:28:18 AM] Tester 1: I agree that it should be added as a regular part of our tests–NOW that we know it’s importantl.
[11:28:34 AM] Tester 1: There are any number of things that one could complain about with these pages.
[11:28:59 AM] Tester 1: I strongly suspect that, had Tester 2 or I complained about this to Programmer prior to the launch we would have been seen as horrible nit pickers.
[11:29:16 AM] Dev Mgr: What??????????????
[11:29:26 AM] Tester 1: “Why are you complaining about something so trivial when we’re trying to get this out?”
[11:29:28 AM] Dev Mgr: how can the UI not be improtant
[11:29:34 AM] Tester 1: I’m not saying it’s not.
[11:29:43 AM] Tester 1: I’m saying that there are MANY things that one could pick to complain about.
[11:29:49 AM] Tester 1: The colors are one.
[11:29:58 AM] Tester 1: The fact that many of the graphs don’t have proper labels is another.
[11:30:22 AM] Tester 1: I could come up with more.
[11:31:19 AM] Tester 1: Unfortunately, prior to today, we weren’t aware that resizing would be the important thing.
[11:32:46 AM] Programmer: i agree with Tester 1, on a scale of 1 to 10, resizing is low on the totem pole since there is an almost obvious solution vs. the colors
[11:33:09 AM] Dev Mgr: i strongly disagree
[11:34:38 AM] Tester 1: Dev Mgr, could you have predicted that CEO would be up-in-arms about resizing?
[11:34:53 AM] Tester 2: This was my responsibility, and I will take the hit for it, that’s fine. That being said, however, considering that there were over a dozen pieces to this [specification], and we, as a team, fixed things that were given to us incorrectly in the 1040 from CEO, I would consider this issue is not the barometer or whether or not this was a successful launch.
[11:34:53 AM] Dev Mgr: this isn’t about CEO
[11:35:02 AM] Dev Mgr: i went to BBW on my laptop
[11:35:06 AM] Dev Mgr: and i had to resize it
[11:35:11 AM] Dev Mgr: and it happened to me
[11:36:19 AM] Tester 1: No doubt. But the fact that it happened is distinct from whether or not it is an important enough issue to have said “Let’s hold off on release until this is resolved.”
[11:36:37 AM] Dev Mgr: i would not have released that
[11:36:51 AM] Dev Mgr: USER EXPERIENCE
[11:36:55 AM] Tester 1: I’m very glad that I, Tester 2, and Programmer now know this.
[11:37:05 AM] Dev Mgr: this makes us look bad
[11:37:07 AM] Dev Mgr: to others
[11:37:10 AM] Dev Mgr: its a window into us
[11:37:20 AM] Dev Mgr: people/ clients see this page
[11:37:21 AM] Dev Mgr: we’re a tech company
[11:37:28 AM] Dev Mgr: and we can’t get our allignment right?
[11:38:49 AM] Dev Mgr: you didn’t see it and decide whether or not to push it
[11:38:55 AM] Dev Mgr: you said you never tested for it
[11:39:07 AM] Dev Mgr: did you test in IE?
[11:39:26 AM] Tester 2: yes, but did not resize window from full laptop view
[11:39:37 AM] Tester 1: What I said was that I believe that, HAD we tested it, Programmer, Tester 2, and I would’ve all concluded that it wasn’t a big enough issue to stop the launch.
[11:39:59 AM] Dev Mgr: then we have a bigger issue to discuss separately
[11:40:14 AM] Dev Mgr: image is everything
[11:40:19 AM] Dev Mgr: we are part of a public company
[11:40:25 AM] Dev Mgr: this is not an internal tool
[11:40:37 AM] Dev Mgr: our visual output
[11:40:49 AM] Dev Mgr: is more important to some than the guts that go into it
[11:41:52 AM] Tester 1: I wonder if perhaps the deeper lesson learned here is that sometimes we *need* to get buy-in from CEO prior to going live with stuff.
[11:42:18 AM] Dev Mgr: this isnt’ about CEO
[11:42:29 AM] Dev Mgr: other than it sucks that he found it first
[11:42:54 AM] Tester 1: But many other people aside from CEO saw it and didn’t say anything about it being important.
[11:42:56 AM] Dev Mgr: if you have a question about a UI that you can decide you can use me as a gauge
[11:43:18 AM] Tester 1: My fundamental issue is that it’s very hard for us to be certain that we are truly aware of all the important things.
[11:43:20 AM] Dev Mgr: Tester 1…you all say this wasn’t tested
[11:43:25 AM] Dev Mgr: so that’s all that matters
[11:43:38 AM] Dev Mgr: not if you had what you would have done
[11:43:45 AM] Tester 1: Had we tested it and decided it wasn’t important we’d be in the same boat.
[11:44:07 AM] Tester 1: The fact that we didn’t test it probably means that if we had we wouldn’t have concluded it needed to be fixed prior to launch.
[11:44:49 AM] Tester 1: I’m trying to get at the deeper point:
[11:44:58 AM] Tester 1: We won’t always know what is important to the end users.
[11:49:26 AM] Dev Mgr: well if you dont think that this opens up some eyes to what needs to be accpetable to users than i will now have to approve everything with a UI component
[11:49:55 AM] Dev Mgr: and that is not the right answer…might i add
[11:53:17 AM] Tester 1: The best answer I can give you at this point: We have learned as a result of this. Our testing in the future will be better. Unfortunately we aren’t omniscient, and don’t have unlimited amounts of time, so I fear we will make other mistakes, which we will also learn from.

My Wish List for a Test Case Tracking Tool

Let’s talk for a bit about tracking test cases. Now, before we get bogged down in semantics and hair-splitting, let me point out that I’ve already made my case against them, as have others. I want to focus here on the tracking, so I’ll just speak broadly about the “test case” – which, for the present discussion, is meant to include everything from test “checks” to testing “scenarios“, or even test “charters“.

Speaking provisionally, it’s probably a good idea to keep track of what your team has tested, when it was tested, by whom, and what the results of the test were. Right? I’ll assume for now that this is an uncontroversial claim in the testing world. I’ll venture further and assert that it’s also probably a good idea to keep a record of the clever (and even no-so-clever) testing ideas that strike you from time-to-time but that you can’t do at the moment, for whatever reason. Even more riskily, I’ll assert that, in general, less documentation is preferable to more.

Assuming you agree with the previous paragraph, how are you tracking your sapient testing?

I’ve used a number of methods. They’ve each had their good and bad aspects. All have suffered from annoying problems. I’ll detail those here, then talk about my imagined “ideal” tool for the job, in the hopes that someone can tell me either a) “I’ll build that for you” (ha ha! I know I’m being wacky), or b) “It already exists and its name is <awesome tracking tool x>.”

MS Word (or Word Perfect or Open Office)

The good: Familiar to everyone. Flexible. I’ve used these only when required by (apparently horribly delusional) managers to do so.

The bad: Flat files. Organizational nightmare. I’ve never seen a page layout for a test case template that didn’t make me depressed or annoyed (perhaps this can be chalked up to a personal problem unique to me, though). Updates to “fields” require typing everything out manually, which is time consuming and error-prone.


The good: Familiar to everyone. Flexible. The matrix format lends itself to keeping things relatively organized and sortable. Easy to add new test cases right where they’re most logical by adding a new row where you want it.

The bad: It’s still basically a flat file with no easy way to track history or generate reports.  Long test descriptions look awful in the cells (though in some ways this can be seen as a virtue). Large matrices become unwieldy, encouraging the creation of multiple spreadsheets, which leads to organizational headaches.


The good: Flexible. Generally the wiki tool automatically stores document revision histories. Everyone is always on the same page about what and where the latest version is. Wikis are now sophisticated enough to link to definable bug lists (see Confluence and Jira, for example).

The bad: Still essentially flat files. Barely better than MS Word, really, except for the history aspect.

FileMaker Pro

The good: It’s an actual database! You can customize fields and page layout exactly how you want them without needing to be a DB and/or Crystal Reports expert. I was in love with FileMaker Pro when I used it, actually.

The bad: It’s been a long time since I’ve used it. I stopped when we discovered that it was prone to erasing records if you weren’t careful. I’m sure that bug has been fixed, but I haven’t had a mind to check back. It’s hard to do some things with it that I started seeing as necessary for true usability (I’ll get to those in my wish list below).


This is a proprietary, web-based database tool in use at “Mega-Corp.”

The good: It’s a database. It tracks both bugs and test cases, and links the two, as appropriate. Can store files related to the test case, if needed. Stores histories of the test cases, and you can attach comments to the test case, if needed.

The bad: Slow. Horrible UI. Test team relies on convoluted processes to get around the tool limitations.

Test Director

The good: A database. A tool designed specifically for testers, so it tracks everything we care about, including requirements, bugs, and test cases. It takes screen shots and automatically stores them, making it easy to “prove” that your test case has passed (or failed). Plus, it helps create your entry into the bug database when you fail a test case.

This tool really has come the closest to my ideal tool as anything I’ve used so far.

The bad: The UI for test set organization leaves a lot to be desired. It forces a particular framework that I don’t particularly agree with, though I can see why they made the choices they did. I also think it doesn’t need to be as complicated as they made it. It would be nice to be given the flexibility to strip out the stuff in the UI that I didn’t care about. Lastly, this tool is exorbitantly expensive! Yikes! For it to be useful at all you need to buy enough seats to cover the entire test team plus at least two, for the business analysts and programmers to have access.

My Imaginary Ideal Tool

What I want most…

I want a tool that organizes my tests for me! I want to be able to quickly add a new test to it at any time without worrying about “where to put it.” This is perhaps the biggest failing of flat files. Some tests just defy quick categorization, so they don’t easily “fit” anywhere in your list.

The database format takes care of this problem, to a large degree, to be sure. The tool will have fields that, among other things, specify the type of test (function, data, UI, integration, etc…), the location of the test, both in terms of the layout of the program from the user standpoint, or of what parts of the code it exercises, et cetera.

All that is great, but I’d like the system to go a step beyond that. I want it to have an algorithm that uses things like…

  • the date the test was last executed
  • the date the related source code was updated (note that this implies the tool should be linked to the programmers’ source control tool)
  • the perceived importance and/or risk level of the test and/or the function being tested
  • other esoteric stuff that takes too many words to explain here

I want it to use that algorithm to determine which test in the database is, at this moment, the most important test (or set of tests, if you choose) I could be running. I then want it to serve it up to me. When I accept it by putting it into a “testing” status, the system will know to serve up the next most important test to whatever tester comes along later. Same goes for when I pass or fail the given test. It “goes away” and is replaced by whatever the system has determined is now “most important” according to the heuristic.

The way I see it, what this does for me is free me from the hassle of document maintenance and worrying about test coverage.  The tests become like a set of 3×5 cards all organized according to importance. You can add more “cards” to the stack, as you think of them, and they’re organized for you. You may not have time to get through the whole “pile” before you run out of time, but at least you can be reasonably confident that the tests you did run were the “right” ones.

The other stuff…

Aside from “what I want most,” this list is in no particular order. It’s not exhaustive either, though I tried my best to cover the essentials. Obviously the tool should include all of the “good” items I’ve already listed above.

  • It should have a “small footprint mode” (in terms of both UI and system resources) so it can run while you’re testing (necessary so you can refer to test criteria, or take and store screen shots) but have a minimal impact on the actual test process.
  • As I said above it should link to the programmers’ source control tool, so that when the programmers check in updates to code it will flag all related test cases so you can run them again.
  • It should link to your bug tracking tool (this will probably require that the tool be your bug tracking database, too. Not ideal, but perhaps unavoidable).
  • It should make bug creation easy when a test has failed (by, e.g., filling out all the relevant bug fields with the necessary details automatically). Conversely, it should make test case creation easy when you’ve found a bug that’s not covered by existing test cases, yet.
  • It should be possible to create “umbrella” test scenarios that supersede other test cases, because those tests are included implicitly. In other words, if you pass one of these “über-cases,” the other test cases must be considered “passed” as well, because they’re inherent in the nature of the über-case. The basic idea here is that the tool should help you prevent avoidable redundancies in your testing efforts.
  • Conversely, the failing of a test case linked to one or more über-cases should automatically mark those über-cases as not testable.

I’d love comments and criticisms on all this. Please feel free to suggest things that you’d like to see in your own ideal tool. Maybe someone will actually be inspired to build it for us!


Likely Posting Rates for the Near Future

In May when I started this blog I was working the final weeks as a contractor at a soul-killing corporation on a testing project that was as mind-numbing as it was dysfunctional. I started the blog as a creative outlet for me; a means of venting my frustrations constructively, since I felt like nothing I said at “Mega-Corp” made any difference.

In addition, I thought, I’d soon be back on the job market. The blog might become a good extension to the tired and typical job-seeker’s resume and cover letter. I saw it as a potential means of showcasing my philosophy and thought processes, as well as my writing style and personality, outside the tight confines of a job interview.

I had no expectations beyond that. I figured blog traffic would max out at around a visit a week. Probably those would be my polite and supportive friends, whom I’d pester to check out my latest ramblings (even though they had no interest in testing, software, or epistemology).

Then something funny happened. As near as I can figure it, a friend tweeted about one of the posts. This tweet was apparently seen by Michael Bolton, who presumably read it, liked it, and also tweeted about it. Suddenly there were intelligent comments from strangers (and respected industry celebrities) who were located all over the world. Suddenly posts were being mentioned elsewhere and included in blog carnivals. Suddenly people other than me were tweeting my posts. Wow!!! Who knew there was a large and vibrant testing community out there? Who knew I actually said anything interesting? Suddenly I felt pressure to maintain a consistent output of new, interesting material.

My contract with Mega-Corp ended. Based on the sparse job prospects over the previous six months, I fully expected to be facing a long stretch of unemployment. I have significant savings, so the idea didn’t scare me. In fact, I was genuinely looking forward to it. Aside from now having ample time to write blog posts, I could engage with this newly discovered testing community via Twitter, their own blogs, LinkedIn, the Software Testing Club, and elsewhere. I could spend a few hours a day learning Ruby–something I’d wanted to do for a while but seemed never to have time for.

Although I went to one interview during the first week of unemployment at the behest of the staffing firm I’d been contracting with, I wasn’t particularly interested in looking for work. I jokingly referred to my unemployment as an “involuntary sabbatical.” What little effort I put toward a job search was haphazard and frivolous. The few job listings that turned up were basically of the sort that had been appearing for the previous several months. They fit into one of three categories:

  1. Positions for which I was overqualified
  2. Positions which I knew I could do but I’d never get the interview for, since they listed specific technical requirements I couldn’t in good conscience put on my resume or in my cover letter
  3. Positions that were the software development equivalent of Gitmo prisoner stress positions

Then something funny happened. On day 11 of the sabbatical I got an email from a headhunter asking if I were looking for work. I wrote back and said that I was. She called. We talked for about 20 minutes. I think most of that time was me saying that my technical skills didn’t match what they had on the list of requirements. She said “Let’s submit anyway.” I said, “Sure. What the hell?” I was convinced it would go nowhere and went back to the exercises in my Ruby book. Less than an hour later the headhunter called back and said that the company wanted to interview me the next day at 9 a.m. I said, “Sure. What the hell?”

Armed with the company’s name and address, I started the requisite Googling. I found out that the company’s culture included things like letting people bring their dogs to work, giving everyone a Nerf gun and, most importantly to me, no dress code (based on photos on the company’s blog, shorts and flip-flops were standard fare, so my Vibrams would fit right in). So far, so good. Even better, the company was apparently wildly profitable and newly purchased by a larger firm, also profitable. No worries about job evaporation due to investor indictment!

I’ve been on a lot of job interviews this year. In all of them I felt a lack of control, like I was being forced to justify myself or excuse myself. For this one, though, I decided to take a different tack, since I truly didn’t care if I got the job or not. I took a copy of the advertised job requirements with me and went through them line-by-line with the interviewer, saying “What do you mean by X? My current experience with it is limited. I have no doubt I can learn it, but if it’s really important to you, then I’m probably not your guy.” I must’ve said some variant of that a half dozen times. I felt like I was trying to talk them out of picking me.

Somehow the interview lasted three hours. They told me they were going to talk with two more people, but that they wanted to move fast on a decision, so I would know either way by the next day. I could tell they liked me. For my part, the company struck me as a happy place, and what they wanted to hire me for seems to have become my own career specialty: Use your skills and expertise to do whatever is necessary to create a test department where there is none. As I was driving home I was thinking, “Dammit! I may have to cut my sabbatical short.”

I got a call from the headhunter less than two hours later. They were offering me the job. They wanted me to start tomorrow, if I was willing. I agonized over the decision for most of the afternoon. Three to six months of taking it easy, blogging, and learning Ruby, while looking for the perfect job–I had a really hard time giving up this romantic notion, but it seemed like the perfect job had already arrived, just way ahead of schedule. What if I turned it down and the next one didn’t come along for another year, well after my savings had evaporated?

I took the job.

This post has turned into something much more long-winded and shamelessly self-indulgent than I imagined it would be. Thanks for putting up with it. My only point has been to explain that my new job responsibilities over the coming weeks will probably sap my time and my creative energies. The testing problem I’ve been given is very interesting, and I need to focus on how to solve it. So, for the next few weeks, at least, there’s little chance I’ll be writing a post per week. I can’t imagine, though, that it will be too long before I feel a strong urge to vent again.


Requirements: Placebo or Panacea?

Perhaps the definitive commentary on "requirements"I vividly remember the days when I matured as a tester. I was the fledgling manager of a small department of folks who were equal parts tester and customer support. The company itself was staffed by primarily young, enthusiastic but inexperienced people, perhaps all given a touch of arrogance by the company’s enormous profitability.

We had just released a major update to our software–my first release ever as a test manager–and we all felt pretty good about it. For a couple days. Then the complaints started rolling in.

“Why didn’t QA find this bug?” was a common refrain. I hated not having an answer.

“Well… uh… No one told me the software was supposed to be able to do that. So how could we know to test for it? We need more detailed requirements!” (I was mindful of the tree cartoon, which had recently been shared around the office, to everyone’s knowing amusement.)

The programmers didn’t escape the inquisition unscathed, either. Their solution–and I concurred–was, “We need a dedicated Project Manager!”

Soon we had one. In no time, the walls were papered with PERT charts. “Critical path” was the new buzzword–and, boy, did you want to stay off that thing!

You couldn’t help but notice that the PERT charts got frequent revisions. They were pretty, and they gave the impression that things were well-in-hand; our path forward was clear. But they were pretty much obsolete the day after they got taped to the wall. “Scope creep” and “feature creep” were new buzzwords heard muttered around the office–usually after a meeting with the PM. I also found it odd that the contents of the chart would change, but somehow the target release date didn’t move.

As for requirements, soon we had technical specs, design specs, functional specs, specs-this, specs-that… Convinced that everything was going to be roses, I was off and running, creating test plans, test cases, test scripts, bla bla…

The original target release date came and went, and was six months gone before the next update was finally shipped. Two days later? You guessed it! Customers calling and complaining about bugs with things we’d never thought to test.

Aside from concluding that target release dates and PERT charts are fantasies, the result of all this painful experience was that I came to really appreciate a couple things. First, it’s impossible for requirements documents to be anything close to “complete” (yes, a heavily loaded word if I’ve ever seen one. Let’s provisionally define it as: “Nothing of any significance to anyone important has been left out”). Second, having document completeness as a goal means spending time away from other things that are ultimately more important.

Requirements–as well as the team’s understanding of those requirements–grow and evolve throughout the project. This is unavoidable, and most importantly it’s okay.

Apparently, though, this is not the only possible conclusion one can reach after having such experiences.

Robin F. Goldsmith, JD, is a guy who has been a software consultant since 1982, so I hope he’s seen his fair share of software releases. Interestingly, he asserts here that “[i]nadequate requirements overwhelmingly cause most project difficulties, including ineffective ROI and many of the other factors commonly blamed for project problems.” Here, he claims “The main reason traditional QA testing overlooks risks is because those risks aren’t addressed in the system design… The most common reason something is missing in the design is that it’s missing in the requirements too.”

My reaction to these claims is: “Oh, really?”

How do you define “inadequate” in a way that doesn’t involve question begging?
How do you know it’s the “main” reason?
What do you mean by “traditional QA testing”?

Goldsmith addresses that last question with a bit of a swipe at the context-driven school:

Many testers just start running spontaneous tests of whatever occurs to them. Exploratory testing is a somewhat more structured form of such ad hoc test execution, which still avoids writing things down but does encourage using more conscious ways of thinking about test design to enhance identification of tests during the course of test execution. Ad hoc testing frequently combines experimentation to find out how the system works along with trying things that experience has shown are likely to prompt common types of errors.

Spontaneous tests often reveal defects, partly because testers tend to gravitate toward tests that surface commonly occurring errors and partly because developers generally make so many errors that one can’t help but find some of them. Even ad hoc testing advocates sometimes acknowledge the inherent risks of relying on memory rather than on writing, but they tend not to realize the approach’s other critical limitations.

By definition, ad hoc testing doesn’t begin until after the code has been written, so it can only catch — but not help prevent — defects. Also, ad hoc testing mainly identifies low-level design and coding errors. Despite often being referred to as “contextual” testing, ad hoc methods seldom have suitable context to identify code that is “working” but in the service of erroneous designs, and they have even less context to detect what’s been omitted due to incorrect or missing requirements.

I’m not sure where Goldsmith got his information about the Context-driven approach to testing, but what he’s describing ain’t it! See here and here for much better descriptions.

Goldsmith contrasts “traditional QA testing” with something he calls “proactive testing.” Aside from “starting early by identifying and analyzing the biggest risks,” the proactive tester

…enlists special risk identification techniques to reveal many large risks that are ordinarily overlooked, as well as the ones that aren’t. These test design techniques are so powerful because they don’t merely react to what’s been stated in the design. Instead, these methods come at the situation from a variety of testing orientations. A testing orientation generally spots issues that a typical development orientation misses; the more orientations we use, the more we tend to spot. [my emphasis]

What are these “special risk identification techniques”? Goldsmith doesn’t say. To me, this is an enormous red flag that we’re probably dealing with a charlatan, here. Is he hoping that desperate people will pay him to learn what these apparently amazing techniques are?

His advice for ensuring a project’s requirements are “adequate” is similarly unhelpful. As near as I can figure it, reading his article, his solution amounts to making sure that you know what the “REAL” [emphasis Goldsmith] requirements are at the start of the project, so you don’t waste time on the requirements that aren’t “REAL”.

Is “REAL” an acronym for something illuminating? No. Goldsmith says he’s capitalizing to avoid the possibility of page formatting errors. He defines it as “real and right” and “needed” and “the requirements we end up with” and “business requirements.” Apparently, then, “most” projects that have difficulties are focusing on “product requirements” instead of “business requirements.”

Let’s say that again: To ensure your requirements are adequate you must ensure you’ve defined the right requirements.

I see stuff like this and start to wonder if perhaps my reading comprehension has taken a sudden nose dive.


Interview with a CEO

“So, tell me: How are you going to guarantee the accuracy and integrity of the data?” he asked.

I glanced at the clock on the wall: 2:25 p.m. The CEO and I had been talking since 2:00, and he had to be at his next meeting in 5 minutes.

I felt frozen, like a tilted pinball machine. For a moment I wasn’t even sure I’d heard the question right. He couldn’t seriously be asking a tester for… whaaa??? I could feel my adrenal glands dumping their contents into my blood stream.

“This is the moment,” I thought. “The point when this interview goes South.”

Part of me wanted to simply stand up, shake the CEO’s hand, thank him for the opportunity, and walk out. I could still salvage a nice afternoon before I had to be back at the airport.

Time seemed to slow to an agonizing crawl. Involuntarily, I pondered the previous 12 hours…

2:45AM Wake up, shower… 3AM Dress up in suit and tie (20 minutes devoted to fighting with tie)… 3:45AM Drive to airport… 5AM Sit in terminal… 6AM Board flight to San Francisco… 9AM Arrive SFO… 9:15AM Sit in (completely stationary) BART train… 10:00AM Miss Caltrain connection… 10:30AM Arrive at office, thanks to a ride from their helpful administrative assistant… 10:45AM Interview with the head of products… 11:30AM Interview with the head of development… 12:15PM Lunch… 2:00PM Interview with CEO…

As the epinephrine circulated through my body, creating a sensation akin to somersaulting backwards, I began to feel resentful. I’d flown there on my own dime, after having already talked with these guys by phone for several hours. I was under the impression that the trip would be more of a “meet & greet the team” social hour. Not a repeat of the entire interview process, from square one. The Head of Products had given me several assurances that I was his top choice and that they’d only be asking me to fly out if the position were essentially mine to refuse.

So, there I was. The CEO sat across the table from me, expecting an answer.

What I wanted to say was that I was in no position to guarantee anything of the sort, given my radical ignorance of the data domain, the data’s source(s), the sources’ track record(s) for accuracy, or how the data get manipulated by the in-house systems.

What I wanted to say was that his question was prima facie absurd. That I, as a tester, couldn’t “guarantee” anything other than that I would use my skills and experience to find as many of the highest risk issues as quickly as possible in the given time frame. However, when you’re dealing with any black box, you can’t guarantee that you’ve found all the problems. Certainty is not in the cards.

What I wanted to say was that anyone who sat in front of the CEO claiming that they could guarantee the data’s accuracy and integrity was clearly a liar and should be drummed out of the profession of software testing.

I wanted to say all that and more, but I didn’t. Given the day’s exhausting schedule, all these thoughts were little more than fleeting, inchoate, nebulous impressions. Plus, it seemed highly unlikely that the CEO, who struck me as an impatient man (your typical “Type A” personality), would be interested in spending the remaining 4 or 5 minutes discussing epistemology with me. Honestly, I’m not sure what I said, exactly. The question, and the CEO’s demeanor while asking it, had drained away any enthusiasm I had for the position. In all likelihood, my response was along the lines of “I have no idea how to answer that question.”

Whatever I said, it was obviously not how to impress an MBA from Wharton. I didn’t get offered the job.


Irreverence Versus Arrogance

Everything sacred is a tie, a fetter.
— Max Stirner

I am an irreverent guy. I’m a fan of South Park and QA Hates You, for example. Furthermore, I think it’s important–nay, essential–for software testers to cultivate a healthy irreverence. Nothing should be beyond question or scrutiny. “Respecting” something as “off limits” (also known as dogmatism) is bound to lead to unexamined assumptions, which in turn can lead to missed bugs and lower quality software. If anything, I think testers should consider themselves akin to the licensed fools of the royal court: Able–and encouraged–to call things as they see them and, especially, to question authority.

Contrast that with arrogance–an attitude often confused with irreverence. The distinction between them may be subtle, but it is key. Irreverence and humility are not mutually exclusive, whereas arrogance involves a feeling of smug superiority; a sense that one is “right.” Arrogance thus contains a healthy dose of dogmatism. The irreverent, on the other hand, are comfortable with the possibility that they’re wrong. They question all beliefs, including their own. The arrogant only question the beliefs of others.

I pride myself (yes, I am being intentionally ironic, here) on knowing this difference. So, it pains me to share the following email with you. It’s an embarrassing example of a moment when I completely failed to keep the distinction in mind. Worse, I had to re-read it several times before I could finally see that my tone was indeed arrogant, not irreverent, as I intended it. I’ll spare you my explanations and rationalizations about how and why this happened (though I have a bunch, believe me!).

The email–reproduced here unmodified except for some re-arranging, to improve clarity–was meant only for the QA team, not the 3rd-party developer of the system. In a comedy of errors and laziness it ended up being sent to them anyway. Sadly, I think its tone ensured that none of the ideas for improvements were implemented.

After you’ve read the email, I invite you to share any thoughts you have about why it crosses the line from irreverence into arrogance. Naked taunts are probably appropriate, too. On the other hand, maybe you’ll want to tell me I’m wrong. It really isn’t arrogant! I won’t hold my breath.

Do you have any stories of your own where you crossed the line and regretted it later?

The user interface for OEP has lots of room for improvement (I’m trying to be kind).

Below are some of my immediate thoughts while looking at the OEP UI for the front page. (I’ll save thoughts on the other pages for later)

1. Why does the Order Reference Number field not allow wildcards? I think it should, especially since ORNs are such long numbers.

2. Why can you not simply click a date and see the orders created on that date? The search requires additional parameters. Why? (Especially if the ORN field doesn’t allow wildcards!)

3. Why, when I click a date in the calendar, does the entire screen refresh, but a search doesn’t actually happen? I have to click the Search button. This is inconsistent with the way the Process Queue drop down works. There, when I select a new queue, it shows me that instantly. I don’t have to click the “Get Orders” button.

5. What does “Contact Name” refer to? When is anyone going to search by “Contact Name”? I don’t even know what a Contact Name is! Is it the patient? Is it the OEP user???

Click for full size

Click for full size

4. In fact, I *never* have to click the Get Orders button. Why is it even there on the screen?

6. Why waste screen space with a “Select” column (with the word “Select” repeated over and over again–this is UGLY) when you could eliminate that column and make the Order Reference number clickable? That would conserve screen space.

7. Why does OEP restrict the display list to only 10 items? It would be better if it allowed longer lists, so that there wouldn’t need to be so much searching around.

8. Why are there “View Notes” links for every item, when most items don’t have any notes associated with them? It seems like the View Notes link should only appear for those records that actually have notes.

9. Same question as above, for “Show History Records”.

10. Also, why is it “Show History Records” instead of just “History”, which would be more elegant, given the width of the column?

11. Speaking of that, why not just have “History” and “Notes” as the column headers, and pleasant icons in those rows where History or Notes exist? That would be much more pleasing to the eye.

Click for full size

Click for full size

12. In the History section, you have a “Record Comment” column and an “Action Performed” column. You’ll notice that there is NEVER a situation where the “Action Performed” column shows any useful information beyond what you can read in the “Record Comment” field. Why include something on the screen if it’s not going to provide useful information to the user?

For example:

Record Comment: Order checked out by user -TSIAdmin-
Action Performed: CheckOut

That is redundant information.

In addition to that, in this example the Record Create User ID field says “TSIAdmin”. That’s more redundant information.

There must be some other useful information that can be put on this screen.

13. Why does the History list restrict the display to only 5 items? Why not 20 items? Why not give the user the option to “display all on one page”?

Click for full size

Click for full size

14. In Notes section of the screen, the column widths seem wrong. The Date and User ID columns are very wide, leaving lots of white space on the screen.


Estimating Testing Times: Glorified Fortune-Telling?


Hofstadter’s Law:
It always takes longer than you
expect, even when you take
into account Hofstadter’s Law.

Douglas Hofstadter

A good friend of mine is a trainer for CrossFit, and has been for years. For a long time he trained clients out of his house, but his practice started outgrowing the space. His neighbors were complaining about the noise (if you’ve ever been in a CrossFit gym you can easily imagine that they had a point). Parking was becoming a problem, too.

So, in September, 2009, he rented a suite for a gym, in a building with an excellent location and a gutted interior–perfect for setting up the space exactly how he wanted it. It needed new flooring, plumbing, framing, drywall, venting, insulation, dropped ceiling, electricity, and a few other minor things. At the time, he told me they’d be putting the finishing touches on the build-out by mid-December. I remember thinking, “Wow. Three months. That’s a long time.”

As it turned out, construction wasn’t completed until late June, 2010, Seven months later than originally estimated.

Let’s think about that. Here’s a well-defined problem, with detailed plans (with drawings and precise measurements, even!) and a known scope, not prone to “scope creep.” The technology requirements for this kind of project are, arguably, on the low side–and certainly standardized and familiar. The job was implemented by skilled, experienced professionals, using specialized, efficiency-maximizing tools. And yet, it still took more than 3 times longer than estimated.

Contrast that with a software project. Often the requirements are incomplete, but even when they’re not, they’re still written in words, which are inherently ambiguous. What about tools? Sometimes even those have to be built, or existing tools need to be customized. And the analogy breaks down completely when you try to compare writing a line of code (or testing it) with, for example, hanging a sheet of drywall. Programmers are, by definition, attempting something that has never been done before. How do you come up with reasonable estimates in this situation?

This exact question was asked in an online discussion forum recently. A number of self-described “QA experts” chimed in with their answers. These all involved complex models, assumptions, and calculations based on things like “productivity factors,” “data-driven procedures,” “Markov chains,” etc. My eyes glazed over as I read them. If they weren’t all committing the Platonic fallacy then I don’t know what it is.

Firstly, at the start of any software project you are, as Jeffrey Friedman puts it, radically ignorant. You do not know what you do not know. The requirements are ambiguous and the code hasn’t even been written yet. This is still true for updates to existing products. You can’t be certain what effect the new features will have on the existing ones, or how many bugs will be introduced by re-factoring the existing features. How can you possibly know how many test cases you’re going to need to run? Are you sure you’re not committing the Ludic Fallacy when you estimate the “average time” per test case? Even if you’ve found the perfect estimation model (and how would you know this?), your inputs for it are bound to be wrong.

To attempt an estimate in that situation is to claim knowledge that you do not possess. Is that even ethical?

Secondly, your radical ignorance goes well beyond what the model’s inputs should be. What model takes into account events like the following (all of which actually happened, on projects I’ve been a part of)?

  1. The database containing the company’s live customer data–all of it–is inadvertently deleted by a programmer who thought at the time that he was working in the developer sandbox.
  2. The Director of Development, chief architect of the project, with much of the system design and requirements kept only in his head, fails to appear at work one day. Calls to his home go unanswered for two weeks. When someone finally gets in touch with him he says he won’t be coming back to work.
  3. A disgruntled programmer spends most of his time putting derogatory easter eggs in the program instead of actually working. When found by a particularly alert tester (sadly I can’t claim it was me) the programmer is fired.
  4. A version of the product is released containing an egregious bug, forcing the company to completely reassess  its approach to development (and blame the testers for missing the “obvious” bug, which then destroys morale and prompts a tester to quit).
  5. The company’s primary investor is indicted for running a ponzi scheme. The majority of the employees are simply let go, as there is not enough revenue from sales to continue to pay them.

The typical response from the “experts” has been, “Well, that’s where the ‘fudge factor’ comes in, along with the constant need to adjust the estimate while the project is underway.”

To that I ask, “Isn’t that just an implicit admission that estimates are no better than fortune-telling?”

I heard from Lynn McKee recently that Michael Bolton has a ready answer when asked to estimate testing time: “Tell me what all the bugs will be, first, then I can tell you how long it will take to test.”

I can’t wait to use that!


Testing’s Quiet Evidence

Since my post about Nassim Taleb’s The Black Swan I’ve continued to muse about the book and its ideas. In particular, while thinking about the notion of “silent evidence,” I realized there was a connection to testing that I hadn’t noticed before. I won’t flatter myself by thinking I missed the link due to some inherent subtlety in the concept. More likely it’s because I already associated the idea with something else: get-rich-quick schemes. Bear with me for a minute while I explain…

A few years ago I was consumed with debunking network marketing companies and real estate investing scams (If you’re genuinely curious why, I explain it all here). My efforts were focused primarily on a real estate “guru” who lives nearby, in Glendale, AZ. As with most con artists, his promotional materials include dozens of narrative fallac–uh… testimonials from former “students” who claim they achieved “great success” using the investment methods he teaches.

Of course, what the “guru” doesn’t promote are the undoubtedly much higher number of  people who attended his “boot camps” or bought his materials and either, A) did nothing, or B) tried his techniques and lost money (On the rare occasions such people are even mentioned, there’s always a ready explanation for them: The blame lies not with the technique, but with the practitioner. Voilà! The scammer has just removed your ability to falsify his claims!). Taleb calls these people the “silent evidence.” To ignore them when evaluating a population (in this case, the customers of a particular guru) is to engage in survivorship bias and miscalculate what you’re trying to measure. Con artists of all stripes make millions encouraging their customers to do this.

Testers are not con artists (as a rule, I mean), but we do have something that, while perhaps not silent, should be considered at least very quiet. In contrast to the scammers, it’s not to our advantage that it stay quiet. In fact, I’m starting to wonder if keeping it quiet is not at least in part to blame for some of the irrational practices you find in dysfunctional test teams, such as the obsession with test cases.

What am I talking about?

I am referring, dear reader, to all the bugs that were found and fixed prior to release. All those potentially serious issues that were neutralized before they could do any damage. No one thinks about them, because they don’t exist, except as forgotten items in a database no one cares about any more. But they’re there–hundreds, maybe thousands of them–quietly paying tribute to averted disasters, maintained reputations, even saved money (hence, why I call them “quiet” rather than silent: they’re still there if you look for them).

Meanwhile, the released product is out in the world, exposing its inevitable and embarrassing flaws for all to see, prompting CEOs and sales teams to wonder, “What are those testers doing all day? Why aren’t they assuring quality?” Note that this reaction is precisely the survivorship bias I mentioned above. The error causes them to undervalue the test team, in a way exactly analogous to how dupes of the real estate gurus overvalue the guru.

Okay, so what to do about this? I confess that, as yet, I do not know. Right now all I can say is it behooves us as testers to come up with ways of better publicizing the bugs that we find–to turn our quiet evidence into actual evidence. As to how to go about that, well, I’m open for suggestions.


Assumptions Are Dangerous Things

Sometimes – heh, usually – your product oracles are vague.  There’s little or no user manual, requirements specification, online help, tool tips, etc. In these situations I generally become “Annoying Question Man,” constantly badgering the programmers, project managers, sales team, or anyone else, asking: “How’s this thing supposed to work?” I’ve learned from hard experience that assumptions about expectations can and will come back to bite you and quite often, unfortunately, it’s difficult to know that you’re even making them.

It seems I am doomed to learn this particular lesson over and over again.

Last week I took a trip to the Great Smoky Mountains of Tennessee. My sole purpose for going there was to see their synchronous fireflies. Sadly, I was late to the show by about 2 weeks. They were nowhere to be found. I was crushed!

My experience with fireflies comes from living in Northern Virginia for several years, where the fireflies all appear between roughly June 20th and July 5th. Since (you’ll notice) this page – as well as several others I looked at while I was researching for the trip back in December – does not give dates, I assumed that meant they’d be on the same schedule. All fireflies are the same, right? That must be why no one mentions specific dates, right? Uh, apparently not. I’m often wrong about things, but most of the time it doesn’t sting this bad! “I should have known!” has been my mantra the last few days.

In an effort to learn what I may have done differently to prevent this mistake I went looking everywhere (meaning in nearby Gatlinburg and the Park’s visitor’s center) for information on the fireflies. I could find nothing. This was weird. I can’t be the only one who thinks these things sound super cool to see! I started wondering if I wasn’t suffering from some sort of psychological self-protection mechanism, forcing me to miss all the signs proving that I was a dummy and it was all my fault. Confirmation bias writ large and very pathological.

Regardless of blame, however, I literally missed the bugs because of my hidden assumptions!


The Black Swan

A Black SwanI hate the idea of writing a book review for a post. Somehow it strikes me as cheap and lazy to rely so heavily on the work of others for content, particularly when my blog is so new. Shouldn’t I be concerned with sharing my own thoughts instead of parroting the thoughts of others?

Even worse: choosing Nassim Taleb’s The Black Swan (second edition, just released a few weeks ago) as the review’s subject matter. Taleb has notorious disdain for reviewers, many of whom seem to either miss his message entirely* or distort it in some consequential fashion. Given the book’s Kolmogorov complexity, any attempt at encapsulation is bound to leave out something significant (in contrast to the easily summerizable journalistic “idea book of the week” that excites the MBAs and is the intellectual equivalent of fast food. Anyone remember Who Moved My Cheese??).

I think the book’s message is important, and Taleb, being a champion of the skeptical empiricist, says a great deal that should excite and inspire the software tester. So, I’m willing to risk appearing lazy, but let’s not call this post a review so much as a somewhat desultory sampler. The Black Swan is a philosophical essay that is both dense and broad, and explores many interesting ideas–irreverently, I might add. My aim here will be to stick to those ideas that pertain to testing. I’ll leave the rest for you to discover on your own if you should decide to pick up a copy of the book for yourself.

The Black Swan

“All swans are white.”

Before 1697, you could say this, and every sighting of another swan would add firmness to your conviction of its “truth”. But then Europeans discovered a black swan in Western Australia. A metaphor for the problem of induction was born.

Taleb’s Black Swan (note the capitalization) is distinct from the philosophical issue, however. I’ll let Taleb define it:

First, it is an outlier, as it lies outside the realm of regular expectations, because nothing in the past can convincingly point to its possibility. Second, it carries an extreme impact (unlike the bird). Third, in spite of its outlier status, human nature makes us concoct explanations for its occurrence after the fact, making it explainable and predictable.

I stop and summarize the triplet: rarity, extreme impact, and retrospective (though not prospective) predictability. A small number of Black Swans explain almost everything in our world, from the success of ideas and religions, to the dynamics of historical events, to elements of our own personal lives. [emphasis original]

I’m confident that you can already see where this applies in the world of software. A Black Swan would be any serious bug that made it into a released product and caused some sort of harm–either to customers or the company’s reputation (or both!).

Toyota’s recent brake system problems are a perfect example. Clearly they didn’t see this coming, and it’s cost them an estimated $2 billion.  You can bet they’re trying to figure out why they didn’t catch the problem earlier–and why they should have–and how to prevent similar problems in the future.

And there’s the rub! The problem with Black Swans is that they are unpredictable by nature. Reality has “epistemic opacity”, says Taleb, owing to various inherent limitations to our knowledge, coupled with how we often deal erroneously with the information we do have. Toyota might spend billions ensuring that their cars will never have brake problems of any kind ever again, only, perhaps, to find one day that, in certain rare situations, their fuel system catches fire. It happens precisely because it’s not planned for.

So, what can we, as testers, do about the Black Swans we might face? The Black Swan counsels primarily how not to deal with them, and Taleb openly laments the typical reaction to his “negative advice.”

…[R]ecommendations of the style “Do not do” are more robust empirically [see “Negative Empiricism,” below]. How do you live long? By avoiding death. Yet people do not realize that success consists mainly in avoiding losses, not in trying to derive profits.

Positive advice is usually the province of the charlatan [see “Narrative Fallacy,” below]. Bookstores are full of books on how someone became successful [see “Silent Evidence,” below]; there are almost no books with the title What I Learned Going Bust, or Ten Mistakes to Avoid in Life.

Linked to this need for positive advice is the preference we have to do something rather than nothing, even in cases when doing something is harmful. [emphasis original]

I’m reminded of a consulting gig where I explained to the test team’s managers that their method for tracking productivity was invoking Goodhart’s Law and was thus worse than meaningless, since it encouraged counterproductive behavior in the team. The managers agreed with my analysis, but did not change their methodology. After all, they said, they were required to report something to the suits above them. They didn’t seem to have an ethical problem with tracking numbers that they knew were bullshit.


The ancient Greek philosopher Plato had a theory that abstract ideas or “Forms,” such as the idea of the color red, were the highest kind of reality. He believed that Forms were the only means to genuine knowledge. The error of Platonicity, then, as defined by Taleb, is

…our tendency to mistake the map for the territory, to focus on pure and well-defined “forms,” whether objects, like triangles, or social notions, like utopias (societies built according to some blueprint of what “makes sense”), even nationalities. When these ideas and crisp constructs inhabit our minds, we privilege them over other less elegant objects, those with messier and less tractable structures…

Platonicity is what makes us think that we understand more than we actually do. But this does not happen everywhere. I am not saying that Platonic forms don’t exist. Models and constructions, these intellectual maps of reality, are not always wrong; they are wrong only in some specific applications. The difficulty is that a) you do not know beforehand (only after the fact) where the map will be wrong, and b) the mistakes can lead to severe consequences. These models are like potentially helpful medicines that carry random but very severe side effects.

The error of platonification has a lot in common with the error of reification, but there is a subtle difference. Platonification doesn’t require that you believe your model is real (as in, “concrete”), only that it is accurate.

Again I’m sure you’re already thinking of ways this applies in software testing. You build a model of a system you’re testing. Soon you forget that you’re using a model and become blind to scenarios that might occur outside of it. Even worse, you write a few hundred test cases based on your model and convince yourself that, once you’ve gone through them all, you’ve “finished testing.”

Negative Empiricism

I mentioned above that The Black Swan is almost entirely advice about what not to do. However, in the chapter he devotes to confirmation bias and its brethren, Taleb introduces the heuristic of “falsification.” I hope you’ll forgive my quoting rather liberally from the section, here. He seems, for a moment, to be speaking directly to software testers:

By a mental mechanism I call naïve empiricism, we have a natural tendency to look for instances that confirm our story and our vision of the world – these instances are always easy to find. Alas, with tools, and fools, anything can be easy to find. You take past instances that corroborate your theories and you treat them as evidence. For instance, a diplomat will show you his “accomplishments,” not what he failed to do. Mathematicians will try to convince you that their science is useful to society by pointing out instances where it proved helpful, not those where it was a waste of time, or, worse, those numerous mathematical applications that inflicted a severe cost on society owing to the highly unempirical nature of elegant mathematical theories.

The good news is that there is a way around this naïve empiricism. I am saying that a series of corroborative facts is not necessarily evidence. Seeing white swans does not confirm the nonexistence of black swans. There is an exception, however: I know what statement is wrong, but not necessarily what statement is correct. If I see a black swan I can certify that all swans are not white!

This asymmetry is immensely practical. It tells us that we do not have to be complete skeptics, just semiskeptics. The subtlety of real life over the books is that, in your decision making, you need to be interested only in one side of the story: if you seek certainty about whether the patient has cancer, not certainty about whether he is healthy, then you might be satisfied with negative inference, since it will supply you the certainty you seek. So we can learn a lot from data – but not as much as we expect. Sometimes a lot of data can be meaningless; at other times one single piece of information can be very meaningful. It is true that a thousand days cannot prove you right, but one day can prove you to be wrong.

The person who is credited with the promotion of this idea of one-sided semiskepticism is Sir Doktor Professor Karl Raimund Popper, who may be the only philosopher of science who is actually read and discussed by actors in the real world (though not as enthusiastically by professional philosophers)… He writes to us, not to other philosophers. “We” are the empirical decision makers who hold that uncertainty is our discipline, and that understanding how to act under conditions of incomplete information is the highest and most urgent human pursuit. [emphasis original]

It always rankles when I hear someone (who is – usually – not a tester) declare something like “We need to prove the program works.” Obviously anyone who says this has a fundamental misconception of what is actually possible. And how many times has a programmer come to you claiming that he tested his code and “the feature works” – but you discover after only a couple tests that his “tests” were within only a narrow range, outside of which the feature breaks immediately?

All The Rest

I’ve only touched on a very small part of the contents of The Black Swan, but hopefully enough to convince you that it’s required reading for software testers. I’ll close the post with short descriptions of a few of the bigger ideas in the book that I skipped:

  • Mediocristan – A metaphorical country where deviations from the median are small and relatively rare, and those deviations can’t meaningfully affect the total. Think heights and weights of people. Black Swans aren’t possible here.
  • Extremistan – A metaphorical country where Black Swans are possible, because single members of a population can affect the aggregate. Think income or book sales.
  • Ludic Fallacy – Roughly speaking, the belief that you’re dealing with a phenomenon from Mediocristan when it’s actually from Extremistan. The Ludic Fallacy is a special case of the Platonic Fallacy.
  • Narrative Fallacy – The tendency to believe or concoct explanations that fit a complicated set of historical facts because they sound plausible. Conspiracy theories are only a small facet of this. These narratives cause us to think that past events were more predictable than they actually were. We become, as Taleb puts it, “Fooled by Randomness.”
  • Silent Evidence – That part of a population that is ignored because it is “silent,” meaning either difficult or impossible to see. We see all the risk-takers who succeeded in business, but not all risk-takers who failed. The result is the logical error called survivorship bias.

*An example of this is found in the quote from GQ magazine that appears, ironically, on the front cover of the book itself: “The most prophetic voice of all.” Taleb’s point is to be wary of anyone who claims he can predict the future. He says of himself, “I know I cannot forecast.”