Testing pattern: don’t test too much at once

This has been said before, I know, but it’s worth re-iterating: a test should test one thing, and one thing only.

First, some scope definition. Using Kent Beck’s terminology, I’m talking about developer tests, not acceptance tests. Also, by one thing, I mean that there should be only one thing that breaks the test (which is very different from saying any failure should only break one test…). In addition, the one thing that breaks should provide diagnostic information – a test failure shouldn’t leave you scratching your head to determine the immediate cause

Problem Definition

The example I’m going to use is a real-world one from a conversation I had recently. The scenario is this: there’s a scheduled task that occurs each day. If it fails, it needs to send an email notification. The email notification, in turn, needs to be generated from a template using parameters from the scheduled task. It’s reasonable to assume (based on the fact that they are widely used and have their own tests) that the scheduler works, that the templating mechanism will work (provided that the template and data are okay), and that the email delivery will work. How to test this?

One way, of course, is to simply run the task manually, force it to fail, wait a little bit, then look in a mail box, get the mail, and examine it. What’s wrong with this approach?

Well, one thing that’s wrong is the fact that an email isn’t delivered provides no diagnostic information. But the real problem is that there are lots of reasons that this can break, which means that the test is excessively fragile.

What can go wrong?

So, why can this example break? Let’s see what sort of situations could occur…

  • Maybe the task wasn’t kicked off. Low probability – there would be tests around the task’s “happy case”, and the mechanism would be the same. Assume that’s a “can’t happen” situation.
  • Maybe the task didn’t fail. Easy enough to imagine… causing code to break on purpose can be tricky (though it happens by accident easily enough)
  • Maybe the email template doesn’t exist. Depending on the situation, this may range from likely (the template is looked up by a string name, which may or may not match the actual name of the template) to unlikely (the template name is provided by a defined enum, and there are tests that verify that all the values in the enum have corresponding templates). In this situation, this was a likely possibility.
  • Maybe the wrong template was picked; this would result in the wrong email being sent.
  • Maybe the template was badly formed. This is actually very likely if you don’t have a verification mechanism for the template, and if it’s being developed as part of the work-at-hand.
  • Maybe the parameters passed into the templating process were wrong. This could result in a failure to process the template, or maybe it would just result in an email that doesn’t match the expected message.
  • Maybe the generated email wasn’t handed over to the delivery service.
  • Maybe the email was sent to the wrong address.
  • Or it could be that the email was made fine, sent fine, but some environmental issue prevented it working (like the SMTP server crashed, or the mail spool overflowed, and so forth)

So, that’s nine reasons off the top of my head that could cause the test to fail – eight if you discount the first one, which admittedly I intend to. In addition, we should probably discount the environmental issues; instead, our tests should be reliable without a dependency on a potentially broken environment. So there are seven issues that we need to be able to test for.

Just to re-iterate, we are assuming that the following things actually do work:

  • The task can be kicked off okay.
  • The scheduler works just fine, and is configured correctly.
  • Templates in general can be processed.
  • Emails can be delivered.

This list, by the way, is an example of the XP principle of “Test Everything That Can Possibly Break”. We’ve determined what can possibly break, and what can’t (because we’ve got good tests around those areas already). This means we need to have at least one test for all of the “possible break” situations.

So how should this be tested?

Now, I don’t have a brain the size of a planet, so I’m not sure how I can write one good test that can fail for seven different reasons. In fact, I’m not sure how I can do it in less than seven tests. I might even need a few more, depending on what sort of variations I want to try (for example, I might want to look into different parameter sets). So, what would the tests look like? Well, let’s tackle them in order.

  1. You need a test to ensure that you can know that the task failed. How you do this depends on your design, but I would recommend the Observer pattern here. You can then drop your test in as an observer to the task and see the failure occur. Of course, for this to work, the email notification needs to be set up as an observer as well… it’s a bit pointless to use one mechanism in a test to notice the task failed, and use a different one in the actual code (not to mention a failure of the DRY principle).
  2. Making sure the template exists is a trickier one. The one I would prefer I hinted at earlier; define the template name in an enum, then have a test that verifies all the templates in the enum exist. As an alternative, however, you could use a constant for the template name, and verify just that template; the problem here is that, as the number of templates increase, you end up with lots of tests essentially doing the same thing. Please note that I’m saying “verify the template exists”, not saying “fetch the template”.
  3. Picking the right template is a little trickier… To be honest, I’m not sure I’d bother testing this one. If the template choice is hard coded, then I’d be able to verify this one by reusing constants between the production code and the upcoming test to verify that the template actually works. If it’s dynamic, then I don’t particularly care if it’s the wrong one – that’s a configuration error, not a coding error. Configurations can be tested by an inspection process (more on this later)
  4. Verifying the template is well-formed requires the actual template. It probably also takes a bit of time, because there probably isn’t any way to do this besides running it through the template processor. However, again, this can be largely automated if there is some sort of template repository in the system; one test (a long running one, admittedly) can simply obtain _all_ the templates and verify them. Alternatively, if the templates are user-entered, then you can verify them on the point of entry into the repository; this then makes verification a runtime aspect of the program, rather than a static aspect to be inspected via a test.
  5. Verifying the parameters maybe easier or hard. It will be hard if you have to pass them into the template engine with the template, particularly if you need to do various combinations. However, you should be able to generate various sets of parameters and inspect them to see if they would work.
  6. You can ensure that the message gets sent to the email service by using a mock object for the email service. You can then assert that the delivery message was called.
  7. You can verify the email addresses are correct with by examining the email delivered to the mock email service.
  8. # Intermittent environmental problems aren’t something to test for. Furthermore, in this scenario, we can assume the bits affected (mainly the email delivery service) have their own tests to ensure they can deal with flaky environments.

In general, these tests can be categorised as either an interaction test or a state test. State tests take production code, calls methods on it, and inspects the return values and the state changes in the production code. Interaction tests take small modules of production code in isolation, replaces the various dependencies in the code with mock objects, and verifies the interactions were correct.

When should you choose state vs interaction tests? Well, let’s look at the seven tests. Broadly speaking, I’d say that 1, 3, 5, and 6 are interaction tests, while 2, 4, and 7 are state tests. Essentially, the interaction tests mark the boundaries between object responsibilities, whilst the state tests ensure that the responsibilities are actually honoured.

Trust but verify

A running theme here is that you need to trust that other parts of the system actually work! So, for example, when you want to use an email delivery service, you don’t need to verify that it did its job; you can trust that it will work if you call it right. That’s the essence of a good interaction test.

This approach lets you focus emphasis on the essentials. So, for example, in the case of a template engine, you can verify that all the templates are well-formed by processing them with static data. If your templating system could generate multiple output formats, you can test a template against only one output format (presumably the fastest one); you don’t need to test all your templates against all the output formats – as long as you have tests around the template engine itself, to verify that the other formats would work as well.

Here’s another example: when using Hibernate, there is a potential for mismatch between your Java objects, your descriptor file (or annotations), and your database. Hibernate will verify the objects when it loads up at runtime, but the database may not match up – maybe you’ve got a bad column name, or you have the wrong type, and so forth. How do you solve this?

One fairly traditional way is that you can create tests that create various instances of your classes, and attempt to save them. This is time-consuming to write, and often fairly slow to run. What’s the alternative?

Well, how about a verification tool? It is certainly feasible to write a tool that inspects your Hibernate configuration file and obtains database metadata to ensure that it all can work – all fields are the right type and length, NULL constraints are in place, indexes exist on the right columns, and so forth. The nice thing here is that you do it once, get it right (with effort), and then you don’t have to worry about it anymore. For bonus points, you could use an exported schema, along with another tool to verify that a given database instance conforms to the schema. This would let you test without a database, whilst also allowing you to ensure that various deployment regions will work (possibly as an installation-time check)

Let’s look at another example, at the opposite end of the application tier. JSP pages are notoriously hard to test. You need to test them in-container, after all, and testing page flows usually means jumping through hoops. Or does it?

The Model2 architecture is more-or-less the standard way of doing web-apps these days. In this model, you have a controller servlet which accepts web requests, sets up data in either the session or request scope, then forwards on to the JSP. This means that the JSP itself has a fairly well-defined contract: there will be data in particular places, and it is expected to submit to a particular place. This can be leveraged.

What you can do here is package the JSPs differently. Build a test WAR file to deploy the JSPs as normal, but with a completely different controller. This controller can map incoming requests to hard-coded text files. These text files can be used to load object graphs which are dropped into the right scopes as you’d expect, then forward on to the right JSP. This means that you can very easily and quickly develop your JSPs. Furthermore, you can re-use some of this test infrastructure – the list of known requests can be used in unit tests to verify your controller behaviour.

You can do this in lots of other places as well. Maybe you use EJBs, and you want to insure that your transaction handling is done correctly. Rather than call methods and verify the transactions, inspect the deployment descriptors. Or you’re using Spring and you want to make sure your configuration wiring is correct. Simply inspect the configuration file.

So what does all this buy you, anyway?

Mostly what it buys is faster builds and increased developer productivity. You get faster builds because you vastly increase the amount of testing you can do as fast unit tests, and decrease the scope of slower tests that require more infrastructure (such as application servers, databases, messaging systems, etc). You also get increased developer productivity over time because developers spend less time trying to diagnose test failures and bugs, and because you tend to get more re-usable infrastructure.

With faster tests, particularly with data-driven tests, you also end up with better test coverage. With data-driven tests, it’s very easy to add a new test case to cover a new edge scenario – copy a similar file, edit it as needed, and run your tests again. So you get better quality systems.

You also tend to get better designed and more flexible systems. In particular, a heavy focus on interaction tests seems to promote smaller objects that conform better to the Single Responsibility Principle. This means that restructuring your system (say, to replace EJBs with Spring) should be fairly straightforward. Adding new functionality should also become easier, as they have less infrastructure issues to deal with.

What does this cost you?

Time and effort, mostly. These techniques require investment. You have to be willing not to take a short cut to get this immediate task done now, but to take the time to do it right. Unfortunately, the cost of doing it right is usually obvious and immediate, whilst the cost of the shortcut is usually a thousand niggling cuts that build up over time.

For example: if it takes me an hour to save 1 second in my build, that sounds like a bad choice. It will take 3600 builds to get that time back, right? But how often do you run your builds? When I’m on a project that has fast builds (under 10 minutes), I run them a lot – up to 20 times a day. This means that I would reach break-even in about 180 days, or about 9 months. If I’m on a team of 9 developers, I reach break-even in 1 month, as they save a second each time they do a build as well. In two month’s time, I’ll be up an hour. Over a nine month project, spending that hour saves the team an entire developer day. And that’s from a single investment out of many.

Conversely, when I’m on a project that has slow builds, I don’t run the builds as often. When I do, I’m often wasting the time that the build takes doing low-priority work, because I don’t want to lose the context I’m in. I become willing to commit unverified code because I know the build system will catch any errors. All in all, I become less productive. If I spend as little as 12 minutes a day staring at the screen waiting for builds, that’s an hour a week. Would it be good if I was spending that hour saving 1 second at a time? You bet.

Summing up

Big monolithic tests are easy to write, but they are easy to break, and usually slow. Spending the time to write the tests better so that any given test only breaks for one reason will result in faster running tests and less time wasted trying to diagnose errors. This, in turn, will dramatically improve, over time, developer productivity. However, it does require investment. This is why software really is too expensive to build cheaply.

Author: Robert Watkins

My name is Robert Watkins. I am a software developer and have been for over 20 years now. I currently work for people, but my opinions here are in no way endorsed by them (which is cool; their opinions aren’t endorsed by me either). My main professional interests are in Java development, using Agile methods, with a historical focus on building web based applications. I’m also a Mac-fan and love my iPhone, which I’m currently learning how to code for. I live and work in Brisbane, Australia, but I grew up in the Northern Territory, and still find Brisbane too cold (after 22 years here). I’m married, with two children and one cat. My politics are socialist in tendency, my religious affiliation is atheist (aka “none of the above”), my attitude is condescending and my moral standing is lying down.

2 thoughts on “Testing pattern: don’t test too much at once”

  1. Be careful with your Hibernate assumptions. I found multiple instances where Hibernate would persist an object for a given mapping – and then spectacularly fail to unpersist it. Mostly this was to do with dates and times (notoriously tricky in SQL). I could even verify that it was saved in the database. Very annoying.

    Database tests can be tricky beasts, you can exercise SPUD in the order of persist, update, delete (with a copious sprinkling of selects to make sure they all work) – but because of things like auto-incrementing indexes the tests are not exactly reproduceable. I like your solution – verify it till you are confident the mapping works, and then have a test which ensures that mapping hasn’t been fiddled with.

    I’d have liked to have seen more detail on your JSP testing idea, as this is also (as you say) a problematic area with testing.

    In general I like to do a lot of testing, but over time I tend to move the tests up to higher and higher levels of abstraction. In your example I might test to see if the email part is working (even if someone else wrote it). But then when its reached a fairly stable point I might drop those tests and instead have a test which tests a swathe of (what should be stable stuff) all at once.

    In terms of unit testing, my preferred way of working is to test as I go. I might test part of a method (as I’m partway through writing it), then once I’m confident that part works I might write and test the rest of the method, then once I’m confident the second half works I might test the method as a whole, then once I’ve built confidence in the methods I might move my testing up to the class or module level, and eventually even ‘package’ level testing (ie testing a bunch of classes which work together as though they were a black box).

    If I was still running all the other tests down to the tiny little partial method tests then it would take absolutely ages to do a build. So I think there’s a natural tension between wanting to have all the tests so as to be able to do refactoring (ala Martin Fowler et al), and wanting to just get on with building the next piece of the puzzle. The practical programming book touches on this. Perhaps the testing frameworks need to support the concept of different testing modes better – eg if I’m refactoring I need a different kind of tests to be run (eg the low level regression tests), whereas if I’m building something new I don’t need to do exhaustive testing of every other part of the application at the same time. (Until of course I decide to declare the code ‘done’ and try to integrate it with the rest of the application)

    Joel Spolsky has some interesting things to say about being ‘in the flow’. His example is of why it is good to give programmers their own offices. If A needs to know something which will take 30 seconds to look up, but only 15 seconds to ask B (because they are next to each other in the cube farm), then he will ask B. Unfortunately this knocks B out of ‘the zone’, and it may take them 15 minutes or more to get back into ‘the zone’. Whereas if A and B had their own offices, then it would take 45 seconds for A to get up out of the chair and go interrupt B, so A would just look up the information he needed himself.

    The relevance to the current topic is that the same thing can happen with builds – if the build takes too long, then it can knock you out of the zone – hence your natural aversion to doing frequent builds where builds are expensive.

    There are many other areas in which development can be a lot more heavyweight. (Where heavier = more work done when just one class is changed) I was pleasantly surprised on my latest project to discover I could test my hibernate mapping files very quickly – because instead of doing the whole nine yards with a full build cycle (approx 20-30 minutes – how soul destroying is that?) I could just change them in situ and then reboot the server(s) (its a multi tier thing, so I restart the web server as well just to be sure).

    In general I think the tools need to be better to support lightweight and faster development. Eg if I’m working on something which needs to be warred or earred up and deployed, do I really need to do a clean and full build of everything else all the time? (In many places sadly the answer is yes, whereas of course it should be no) If I change just one class, I should be able to compile it, and if it compiles the tool should slap a copy of it into the appropriate place. (Of course I’m not suggesting that for prod, but only for dev).

    The J2EE development cycle desperately needs to be sped up to be competitive with other platforms. We have lots of great frameworks, but it seems like every new one just slows the process down even more.

  2. Wow, that’s a heck of a long comment, Rick. 🙂

    You are right that in order to do this sort of thing, you need to have confidence in your infrastructure. If you don’t have confidence, then you need better tests in place. Infrastructure, in this sense, is _everything besides the class you want to use_! Under the SRP, if you can move the responsibility to another class, then that other class becomes infrastructure!

    Building and testing as you go works well also. Certainly that’s how I evolve a codebase (as I’m sure you remember 😉 The two things to watch out for, in my experience, are that you do perform the necessary refactorings to extract infrastructure out, and that you extract out the tests at the same time.

    Build times are definitely a huge “zone” problem; I’m working between two related projects at the moment. One has a build time measured in the tens of seconds, while the other has a build time measured in the tens of minutes. Guess which one we make more progress in? 🙂

    J2EE development… well, I’m of the opinion that you should make as much as you can be fully testable out-of-container. I view the J2EE parts (and the database parts) as being a service provider layer that can be abstracted away. This means that most of my code tends to be unit testable very easily. It also means that sometimes I get some nasty bugs that more in-container testing would have shown; I’m still learning the balance here… I’m settling into a preferred pattern of extensive unit testing, with occasional periods of exploratory (manual) integrated testing, coupled with aggressive performance tests (which, of course, are integrated); the manual tests let you periodically re-verify your assumptions, and, of course, you apply the “find a bug, write a test” rule.

    Testing things out of container is particularly important. You mention testing your hibernate configuration file by editing it in situ. Well, I do the same thing, but I don’t need to deploy it; an out-of-container JNDI context lets me resolve the datasource name, and away I go. Simply edit the config file, and run the test. Use a generated config file? Then edit the generated one until the tests pass, then apply the changes back into the code base for subsequent regeneration. Simple. 🙂

    Do you need to package an EAR every time? Um, no… app servers these days do normally support an “exploded EAR” format. My problem, at least under WebLogic 8 (the only server I tried it with) is that it wasn’t very good at spotting class-file changes; you needed to trigger a redeploy of the component (at the least). I’m pretty sure that this was intentional… too much mucking around in a class can invalidate the run-time (for example, by screwing up the serialisation ID, or by changing static constants). Still, redeploying an exploded EAR is a bit faster than doing an unexploded EAR.

    Can J2EE development be slow? Yes, particularly if done naively or with poor tools (why _does_ WebLogic Workshop spring to mind?). However, it doesn’t have to be; with a deliberate focus on avoiding in-container tests, you can make J2EE development quite rapid (I’ll talk more about the JSP idea in another blog post).

    On Joel’s point: I personally think that certain dynamics change the underlying assumptions. In particular, pairing makes it a null issue – you _can_ get “into the zone” as a pair (sometimes even easier than as an individual); as a pair, the “navigator” can look up information on demand (which is why pairs should have two computers) without getting out of the zone at all; and if one partner gets interrupted, he can just jump right back into the flow. Mind you, Joel never bought into the pair programming concept, anyway. 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: