I used to think that we were good at designing systems. We would phase test them, user test them, stress test them and more, and eventually we would roll out the system and it would work. And life was good.
Then we started getting into a new world of developments where systems relied on networks, networks relied on servers, servers relied on mirroring, mirroring relied on programs and programs relied on programmers.
The interlinkage and interdependcies became more and more complex and clouded, rather than simpler and easier, and the systems started to fail. And life was bad.
This is the state of today’s nation, where we see increasing numbers of systems failures across the markets.
It used to be the odd outage, but most banks and exchanges reported 99.99999999999% uptime and, if they had that 0.00000000001% downtime, it was so minimal as not ot be noticed.
This has changed, as I seem to blog more and more about systems outages.
The last time was after the RBS glitch of summer 2012 – maybe this is a summertime thing? – and noted a lot of other outages around the same time:
- The RBS glitch
- The Flash Crash of 2010
- The issues faced by Aussie banks
- The London Stock Exchange outages
- Santander’s systems consolidation issues
- NASDAQ’s failures during the Facebook IPO
- The $440 million Knight Capital glitch
- BATS going batty
- Madrid and Tokyo stock exchanges outages
- The Nationwide glitch
Since then, there have been several others noted:
- NatWest hit by system failure less than a year after last outage
- Lloyds' banking systems failure hits 22m retail customers
- Lloyds Banking Group suffers (another) system failure ahead of NYE celebrations
And the real biggie is then Nasdaq’s outage, again. But is this Nasdaq or NYSE or something inbetween?
From Reuters today:
Five days after a glitch that paralyzed Nasdaq-listed stocks for three hours on all U.S. markets, Nasdaq and NYSE have a different understanding of what happened in the period preceding and during the blackout, with each side blaming the other for the outage, according to the sources.
At the center of the disagreement is the role of Arca, NYSE's fully electronic stock market. The blackout, which saw trading in about 3,200 Nasdaq-listed stocks such as Apple, Google and Facebook grind to a halt, was preceded by connectivity problems between Arca and the Nasdaq-operated Securities Information Processor (SIP). The SIP consolidates stock prices and distributes them to the market.
What's not clear is whether the problem at the SIP was caused by issues at Arca or technical flaws at the processor.
The likelihood is that these things will get worse.
For example, I remember a recent discussion with a bank about cloud computing, and what would happen if the systems were down?
We would blame the service provider, said the bank.
The service provider would blame their service provider, says I.
The banks says that’s their problem.
Not if it’s your downtime, says I.
In other words, you can have all the Service Level Agreements (SLAs) and penalty clauses in the world, but the world does not work the way you think anymore.
If the systems are down is it the network, the cloud storage system, the SaaS, the interconnectivity, the latency, the … the … the … the …
The world of today is incredibly complex, operating on systems that are interdependent and highly reliant upon each other.
Even when you run your own, you may find the issue is not your own but your partner’s or your partner’s partner or your partner’s partner’s partner or …
Well, it’s obvious isn’t it?