This has been said before, I know, but it’s worth re-iterating: a test should test one thing, and one thing only.
First, some scope definition. Using Kent Beck’s terminology, I’m talking about developer tests, not acceptance tests. Also, by one thing, I mean that there should be only one thing that breaks the test (which is very different from saying any failure should only break one test…). In addition, the one thing that breaks should provide diagnostic information – a test failure shouldn’t leave you scratching your head to determine the immediate cause
The example I’m going to use is a real-world one from a conversation I had recently. The scenario is this: there’s a scheduled task that occurs each day. If it fails, it needs to send an email notification. The email notification, in turn, needs to be generated from a template using parameters from the scheduled task. It’s reasonable to assume (based on the fact that they are widely used and have their own tests) that the scheduler works, that the templating mechanism will work (provided that the template and data are okay), and that the email delivery will work. How to test this?
One way, of course, is to simply run the task manually, force it to fail, wait a little bit, then look in a mail box, get the mail, and examine it. What’s wrong with this approach?
Well, one thing that’s wrong is the fact that an email isn’t delivered provides no diagnostic information. But the real problem is that there are lots of reasons that this can break, which means that the test is excessively fragile.
What can go wrong?
So, why can this example break? Let’s see what sort of situations could occur…
- Maybe the task wasn’t kicked off. Low probability – there would be tests around the task’s “happy case”, and the mechanism would be the same. Assume that’s a “can’t happen” situation.
- Maybe the task didn’t fail. Easy enough to imagine… causing code to break on purpose can be tricky (though it happens by accident easily enough)
- Maybe the email template doesn’t exist. Depending on the situation, this may range from likely (the template is looked up by a string name, which may or may not match the actual name of the template) to unlikely (the template name is provided by a defined
enum, and there are tests that verify that all the values in the enum have corresponding templates). In this situation, this was a likely possibility.
- Maybe the wrong template was picked; this would result in the wrong email being sent.
- Maybe the template was badly formed. This is actually very likely if you don’t have a verification mechanism for the template, and if it’s being developed as part of the work-at-hand.
- Maybe the parameters passed into the templating process were wrong. This could result in a failure to process the template, or maybe it would just result in an email that doesn’t match the expected message.
- Maybe the generated email wasn’t handed over to the delivery service.
- Maybe the email was sent to the wrong address.
- Or it could be that the email was made fine, sent fine, but some environmental issue prevented it working (like the SMTP server crashed, or the mail spool overflowed, and so forth)
So, that’s nine reasons off the top of my head that could cause the test to fail – eight if you discount the first one, which admittedly I intend to. In addition, we should probably discount the environmental issues; instead, our tests should be reliable without a dependency on a potentially broken environment. So there are seven issues that we need to be able to test for.
Just to re-iterate, we are assuming that the following things actually do work:
- The task can be kicked off okay.
- The scheduler works just fine, and is configured correctly.
- Templates in general can be processed.
- Emails can be delivered.
This list, by the way, is an example of the XP principle of “Test Everything That Can Possibly Break”. We’ve determined what can possibly break, and what can’t (because we’ve got good tests around those areas already). This means we need to have at least one test for all of the “possible break” situations.
So how should this be tested?
Now, I don’t have a brain the size of a planet, so I’m not sure how I can write one good test that can fail for seven different reasons. In fact, I’m not sure how I can do it in less than seven tests. I might even need a few more, depending on what sort of variations I want to try (for example, I might want to look into different parameter sets). So, what would the tests look like? Well, let’s tackle them in order.
- You need a test to ensure that you can know that the task failed. How you do this depends on your design, but I would recommend the Observer pattern here. You can then drop your test in as an observer to the task and see the failure occur. Of course, for this to work, the email notification needs to be set up as an observer as well… it’s a bit pointless to use one mechanism in a test to notice the task failed, and use a different one in the actual code (not to mention a failure of the DRY principle).
- Making sure the template exists is a trickier one. The one I would prefer I hinted at earlier; define the template name in an enum, then have a test that verifies all the templates in the enum exist. As an alternative, however, you could use a constant for the template name, and verify just that template; the problem here is that, as the number of templates increase, you end up with lots of tests essentially doing the same thing. Please note that I’m saying “verify the template exists”, not saying “fetch the template”.
- Picking the right template is a little trickier… To be honest, I’m not sure I’d bother testing this one. If the template choice is hard coded, then I’d be able to verify this one by reusing constants between the production code and the upcoming test to verify that the template actually works. If it’s dynamic, then I don’t particularly care if it’s the wrong one – that’s a configuration error, not a coding error. Configurations can be tested by an inspection process (more on this later)
- Verifying the template is well-formed requires the actual template. It probably also takes a bit of time, because there probably isn’t any way to do this besides running it through the template processor. However, again, this can be largely automated if there is some sort of template repository in the system; one test (a long running one, admittedly) can simply obtain _all_ the templates and verify them. Alternatively, if the templates are user-entered, then you can verify them on the point of entry into the repository; this then makes verification a runtime aspect of the program, rather than a static aspect to be inspected via a test.
- Verifying the parameters maybe easier or hard. It will be hard if you have to pass them into the template engine with the template, particularly if you need to do various combinations. However, you should be able to generate various sets of parameters and inspect them to see if they would work.
- You can ensure that the message gets sent to the email service by using a mock object for the email service. You can then assert that the delivery message was called.
- You can verify the email addresses are correct with by examining the email delivered to the mock email service.
- # Intermittent environmental problems aren’t something to test for. Furthermore, in this scenario, we can assume the bits affected (mainly the email delivery service) have their own tests to ensure they can deal with flaky environments.
In general, these tests can be categorised as either an interaction test or a state test. State tests take production code, calls methods on it, and inspects the return values and the state changes in the production code. Interaction tests take small modules of production code in isolation, replaces the various dependencies in the code with mock objects, and verifies the interactions were correct.
When should you choose state vs interaction tests? Well, let’s look at the seven tests. Broadly speaking, I’d say that 1, 3, 5, and 6 are interaction tests, while 2, 4, and 7 are state tests. Essentially, the interaction tests mark the boundaries between object responsibilities, whilst the state tests ensure that the responsibilities are actually honoured.
Trust but verify
A running theme here is that you need to trust that other parts of the system actually work! So, for example, when you want to use an email delivery service, you don’t need to verify that it did its job; you can trust that it will work if you call it right. That’s the essence of a good interaction test.
This approach lets you focus emphasis on the essentials. So, for example, in the case of a template engine, you can verify that all the templates are well-formed by processing them with static data. If your templating system could generate multiple output formats, you can test a template against only one output format (presumably the fastest one); you don’t need to test all your templates against all the output formats – as long as you have tests around the template engine itself, to verify that the other formats would work as well.
Here’s another example: when using Hibernate, there is a potential for mismatch between your Java objects, your descriptor file (or annotations), and your database. Hibernate will verify the objects when it loads up at runtime, but the database may not match up – maybe you’ve got a bad column name, or you have the wrong type, and so forth. How do you solve this?
One fairly traditional way is that you can create tests that create various instances of your classes, and attempt to save them. This is time-consuming to write, and often fairly slow to run. What’s the alternative?
Well, how about a verification tool? It is certainly feasible to write a tool that inspects your Hibernate configuration file and obtains database metadata to ensure that it all can work – all fields are the right type and length, NULL constraints are in place, indexes exist on the right columns, and so forth. The nice thing here is that you do it once, get it right (with effort), and then you don’t have to worry about it anymore. For bonus points, you could use an exported schema, along with another tool to verify that a given database instance conforms to the schema. This would let you test without a database, whilst also allowing you to ensure that various deployment regions will work (possibly as an installation-time check)
Let’s look at another example, at the opposite end of the application tier. JSP pages are notoriously hard to test. You need to test them in-container, after all, and testing page flows usually means jumping through hoops. Or does it?
The Model2 architecture is more-or-less the standard way of doing web-apps these days. In this model, you have a controller servlet which accepts web requests, sets up data in either the session or request scope, then forwards on to the JSP. This means that the JSP itself has a fairly well-defined contract: there will be data in particular places, and it is expected to submit to a particular place. This can be leveraged.
What you can do here is package the JSPs differently. Build a test WAR file to deploy the JSPs as normal, but with a completely different controller. This controller can map incoming requests to hard-coded text files. These text files can be used to load object graphs which are dropped into the right scopes as you’d expect, then forward on to the right JSP. This means that you can very easily and quickly develop your JSPs. Furthermore, you can re-use some of this test infrastructure – the list of known requests can be used in unit tests to verify your controller behaviour.
You can do this in lots of other places as well. Maybe you use EJBs, and you want to insure that your transaction handling is done correctly. Rather than call methods and verify the transactions, inspect the deployment descriptors. Or you’re using Spring and you want to make sure your configuration wiring is correct. Simply inspect the configuration file.
So what does all this buy you, anyway?
Mostly what it buys is faster builds and increased developer productivity. You get faster builds because you vastly increase the amount of testing you can do as fast unit tests, and decrease the scope of slower tests that require more infrastructure (such as application servers, databases, messaging systems, etc). You also get increased developer productivity over time because developers spend less time trying to diagnose test failures and bugs, and because you tend to get more re-usable infrastructure.
With faster tests, particularly with data-driven tests, you also end up with better test coverage. With data-driven tests, it’s very easy to add a new test case to cover a new edge scenario – copy a similar file, edit it as needed, and run your tests again. So you get better quality systems.
You also tend to get better designed and more flexible systems. In particular, a heavy focus on interaction tests seems to promote smaller objects that conform better to the Single Responsibility Principle. This means that restructuring your system (say, to replace EJBs with Spring) should be fairly straightforward. Adding new functionality should also become easier, as they have less infrastructure issues to deal with.
What does this cost you?
Time and effort, mostly. These techniques require investment. You have to be willing not to take a short cut to get this immediate task done now, but to take the time to do it right. Unfortunately, the cost of doing it right is usually obvious and immediate, whilst the cost of the shortcut is usually a thousand niggling cuts that build up over time.
For example: if it takes me an hour to save 1 second in my build, that sounds like a bad choice. It will take 3600 builds to get that time back, right? But how often do you run your builds? When I’m on a project that has fast builds (under 10 minutes), I run them a lot – up to 20 times a day. This means that I would reach break-even in about 180 days, or about 9 months. If I’m on a team of 9 developers, I reach break-even in 1 month, as they save a second each time they do a build as well. In two month’s time, I’ll be up an hour. Over a nine month project, spending that hour saves the team an entire developer day. And that’s from a single investment out of many.
Conversely, when I’m on a project that has slow builds, I don’t run the builds as often. When I do, I’m often wasting the time that the build takes doing low-priority work, because I don’t want to lose the context I’m in. I become willing to commit unverified code because I know the build system will catch any errors. All in all, I become less productive. If I spend as little as 12 minutes a day staring at the screen waiting for builds, that’s an hour a week. Would it be good if I was spending that hour saving 1 second at a time? You bet.
Big monolithic tests are easy to write, but they are easy to break, and usually slow. Spending the time to write the tests better so that any given test only breaks for one reason will result in faster running tests and less time wasted trying to diagnose errors. This, in turn, will dramatically improve, over time, developer productivity. However, it does require investment. This is why software really is too expensive to build cheaply.