As services improve, merge and become virtual, they disappear into the cloud. Which is fine when it’s fine, but sometimes the cloud bursts. We are, we hope, creating better ways of getting things done, but we are also unintentionally and perhaps unavoidably creating new ways for systems and services to fail. Those potential failures need as much thought as any other aspect of the service, but all too often don’t get it.
To make the problem trickier, some of those new points of failure may not be part of what is conventionally thought of as the service concerned at all. The service can fail even if the service provider performs faultlessly. One example of that comes from Emma Mulqueeny, who blogged earlier today about having to risk a parking ticket because she had forgotten the phone which was the only means of paying. Another comes from my last minute attempt to complete my tax return before the deadline today. I was uncharacteristically ahead of schedule this year, all ready to go as long ago as Saturday. But then I hit a snag. There was a piece of information I wanted to check. The easiest way would have been to log on quickly to the office payroll system. But it turned out that somebody had decided that this would be a good weekend to take the system down for maintenance. Suddenly the promise of access to my data vanishes just at the point when I might need it.
Those two examples suggest that there is a need for buffers and for leeway.
A buffer can take many forms. For my tax return, it is paper records which had appeared to be redundant. For Emma’s parking it could be a phone in the car park directly linked to the call centre. For my personal data, it is a first layer of backup which mitigates some risks and keeps everything easily accessible and a second layer of offsite backup which mitigates much more catastrophic risks but with less immediately convenient access to the data.
Leeway can also be found in many forms. It’s a term I first came across not far short of ten years ago in David Weinberger’s proto-blog, and I still think it’s enormously important and relevant. Human operated systems always have slack in them. That slack can very easily be seen as waste, but it isn’t, it’s lubrication (except, of course, some of it is waste, so it’s essential to be able to tell the difference). Computer operated systems usually have the slack designed out of them, partly because that’s the easier way of developing them and partly because of the perceived benefits of consistency and conformity. The result is a rigid, and therefore brittle, system.
The London congestion charge is a great example of a system which started without leeway and gradually has had it added, with the result that the pain of compliance has been dramatically reduced. Originally, the system was completely unforgiving of any lapse of memory. Failure to pay on the required day triggered a penalty charge. Some time later that was changed to allow payment to be made on the following day as well, at a slightly higher cost. From this month, it has been possible to sign up – again at a higher price – and have payments triggered by the act of driving into the charging zone.
The need for buffers and leeway will not go away. Sometimes – perhaps often – the best place for them is outside what is perceived as and designed as the core system. Service designers need to reflect on that, and to do so in the context of the system experienced by the user, not the more narrowly conceived system offered by the provider. As Emma observed,
Whilst it is true that savings can be made and that consumers are becoming used to expecting there to be a digital option for pretty much everything – it is a mistake to cut out humanity completely. It is the kind of counter-productive behaviour that makes people very cross and frustrated, normally in times of deep stress or just general state of worry such as we find ourselves in today.
This reminds me of the time when my bank introduced a new authentication system for their telephone banking on the assumption that everyone had the required credentials.
I didn’t, but I couldn’t get through to a human on the telephone banking to tell them that without having the credentials to login.
So an automated service always needs at least one fallback:
– if possible, let the customer achieve their goal without authenticating/paying/meeting the formal requirements up front so long as no significant harm could be caused,
or
– provide access to a human operator within a reasonable timescale (i.e. soon enough so that the customer can achieve their goal in a timely way)
What we’re trying to avoid is situations where “computer says no” and there is no other way.
It amazes me that system designers these days don’t seem to be able to get their heads around the idea of contingency. All systems have dependencies and assumptions: that the equipment won’t fail; that software is bug free; that users will follow the rules; and so on. You have to ask the question “what will you do when (not if) each assumption doesn’t materialise?” and build in contingency. On the same basis you have to prove that what should have happened actually did (this is called “reconcilliation”) in case something failed but appeared to have worked.