Andy Marks recently posted a dissection of various categories of build failures. In general, I agree that there are definitely different severities of build failures. The question is: is there a time when a build failure is not important?
The point of a continual build isn’t to make sure that the build is working. The point is to let you know when it is broken. No matter how seriously you take a broken build, it’s a lot more serious if you don’t know when it’s broken.
I mean, we’ve all been there: spent ages trying to work out why we can’t get the tests passing (or worse!) when it’s because of somebody else’s changes. The continuous build server helps to avoid this scenario: you should know that the build is in a good state before you fetch the latest changes.
The important thing, to me, is that when I go to integrate my changes (which I start by updating my local codebase), I know the build is clean when I start. It doesn’t matter if the build was broken 10 minutes before: it’s clean now. As long as broken builds get fixed fast, then it doesn’t matter that much if they occur.
A continual build server is a tool. Teams should adapt the tool to suit themselves, not the other way around. If a team is getting too worked up about build failures, they should relax, and look at what the build failures really mean. What a build failure means is that someone didn’t run the build locally first. A pattern of build failures means that there’s a pattern of not running local builds first. Is this important? Well, that depends.
If you have a lot of people in the team, and people are checking in and out of the repository a lot, then keeping the build server pristine becomes more important. The reason is simple: the window to fix a broken build in is smaller, so a broken build makes it more likely that the team will get delayed. With a smaller team, there’s a bigger window. So there may not be a delay; in fact, the occasional broken build may be making you go faster. Bear with me here… 🙂
One of the practices that Beck lists in the new edition of “XP Explained” is the 10-minute build. If a build takes more than 10 minutes, it’s too long. But even 10 minutes can be too long if you check in a lot (e.g. more than once every few hours). Running a smaller set of the builds, for say a 1 minute build cycle, makes more sense in those circumstances. This often means pushing things like integration tests, code coverage, etc, into the longer build cycle that the build server runs. However, doing this will mean the occasional build failure. In general, if a safety measure (such as a continual build) isn’t being triggered every so often, it means that its redundant. So don’t stress about failing builds too much; save the stress for continually breaking builds.
The catch here are the social issues. Remember: build failures mean that people aren’t running their builds locally. If a team has committed to running builds locally, then build failures are serious. The social construct has a lot more weight than the technical construct.
I like referring to the “live build” concept in the SCM patterns book. To guarantee that a build is never broken, all you have to do is never check-in…
With live builds, I’d expect some degree of breakage and I wouldn’t make a big deal of it unless it was indicating some “opportunity for improvement”.