Is the cloud getting less reliable? Various cloud outages are being reported and it seems they are happening more often.
The answer, as always, is not a simple “yes” or “no”. It’s a bit more complicated. But it’s true that companies should consider this factor and prepare for possible cloud outages so that they can handle them better.
More outages, but it’s to be expected
According to Uptime Institute’s 2022 report, the public cloud outages over the past three years are occurring at relatively the same historical rate. The survey notes that 80% of data center managers and operators say they have experienced at least one outage in the last three years.
So, that would mean a lot of outages, even short ones. But there’s more to the picture. The rate of new workloads added to the cloud is exceeding the rate of growth for outages. So, technically, there are more outages, but not as many as one would expect, considering the growth rate of new workloads being added to the cloud all the time.
If this trend continues, eventually the outage numbers should flatten out and even start going down, Uptime forecasts. It also notes that a lot of the data center operators are investing a lot of money and effort in order to improve their services. As time goes by, data center operators become better at managing complex, at-scale architectures, the report also notes.
Common reasons for outages
So, what are the most common reasons for cloud outages? They aren’t changing much, according to Andy Lawrence, founding member, and executive director of Uptime Institute Intelligence. The leading cause continues to be human error, he says to TechBeacon. Turns out there are various types of human error and it’s not always easy to measure and qualify them.
Still, Uptime has found a way to track the failures that are caused by those mistakes. The report says about 40% of organizations have experienced a major cloud outage because of a human error. 85% of those were caused by employees who didn’t follow established processes or by flaws in those procedures from the start.
Interestingly enough, the loss of power is the historically most common issue for significant outages. And even then, most of them are again caused by human error. The report goes even further back in time and finds out that over the last 25 years, electrical failures have accounted for 80% of all IT load losses in data centers.
The next common issue is the networking and connectivity woes. Both architectures and topologies are becoming more and more complex. Especially when organizations are embracing hybrid cloud setups. “On the whole, cloud architectures provide high levels of service availability at scale,” reads the report. “Despite this, no architecture is fail-safe, and many failures have now been recorded that can be attributed to the difficulties of managing such complex, at-scale software, data, and networks.”
So, outages are to be expected. And organizations have accepted this fact as only 13% of the survey respondents in the report say the cloud is resilient enough to carry all of their workloads. The report also shows that’s actually surprising there aren’t more cloud outages, considering the sheer scale and quantity of the workloads out there.
Recovery is complex and can be expensive
Cloud outages happen and then someone has to fix them. Depending on the scale of the outage, the recovery might be quite costly and time-consuming. Human error plays a factor here, too. In more ways than one.
Often communication between different teams isn’t good. Sometimes it may even be non-existent and everyone is simply scrambling to fix their side of the issue. But some fixes may depend on others’ work, so the lack of communication may “break” more stuff or leave it unfixed for longer. Also, the actual fix might not be good and could require another fix. Or the scrambling caused other misconfigurations that have to be fixed either immediately or later when they are rediscovered.
As you can imagine, the costs also pile up. In 2019, 60% of respondents said their average cost of an outage was less than $100 000. In 2021, just 39% say the same. And the outages costing between $100 000 and $1 million went from 28% in 2019 to 47% two years later.
“Things used to be a little simpler when it was just a VM because you’d just restart a VM. Now, you’ve got containers, you’ve got Kubernetes, you’ve got everything out there in the environment. It’s, in some ways, a more fragile mix”, says Neil Miles, senior product marketing manager for Micro Focus’s ITOM Portfolio to TechBeacon.
And that’s just digital workloads. Sometimes, an outage can have real-world implications. For example, at the beginning of March 2023, the trains in Sydney, Australia experienced an hour-long outage. This was the first outage for the new system which went into operation in 2016. And what’s worse, the system failed to switch to a backup network and a data center so all trains were halted. The switch had to be done manually and an investigation is underway to find out why it didn’t happen automatically as programmed.
The outage caused severe transport disruptions with tens of thousands of stranded passengers. Most of them rushed to alternative services which means more costs for not only the train companies but the clients, too.
How to handle a cloud outage
As with many other IT tasks, preparation is key. If you plan to tackle a cloud outage when it happens and just decide what to do on the fly, you will be in for a world of pain. Companies have to have a strategy in place, so they are better equipped and able to handle the situation.
That strategy would depend on a lot of factors. It should cover not only trivial outages but worst-case scenarios, as well. Some of them might even be a data center fire that destroys servers and backups. There are a lot of risks that can pop up when a company starts to evaluate cloud outages. This is why it’s important to identify them and prepare even if the majority have a very low chance of happening.
Outage response will vary depending on the type of cloud setup. For example, public cloud companies will have to take into account the types of services they use and how. Are they relying on Infrastructure-as-a-Service and Virtual Machines on them? What features does the public cloud offer for recovery? Can you add more such features? Are they included in the plan or do you have to pay additionally?
Also, each public cloud will have different rules and setups. Some even have different zones and regions with differences between them. Preparing for public cloud outages might mean balancing costs as outage preparations can mean additional investments beforehand. Especially for bigger deployments with multiple VMs and workloads.
Private clouds give you more agility as you can decide what to do, but also greater responsibility. If something goes wrong, it’s the company’s main responsibility to handle it, even if it’s using a service provider for the infrastructure. A hybrid cloud could be the most complex, depending on the ways the various components are interconnected and how the outage will affect each of them.
Resiliency is a critical factor for cloud outages, DataCenterKnowledge reports. If that’s not possible, then duplicating the service/data in a different zone or region of a public cloud is a must. Using two or more data centers though is possible only for a public cloud and some of the richest companies running private clouds.
Even when you do use a public cloud and have the ability to enjoy two or more data centers, that still requires some additional effort to reroute requests and data. The solution here is load balancers which are able to do this rerouting. If configured properly, this is where we go back to the human error risk.
Also, plan ahead. Often the most common choice to handle a cloud outage is to indeed simply to switch to another region. The thing is, the outage will affect thousands of other customers, DataCenterKnowledge notes. So, a lot of them could have the same idea. And they will all jump to another region that is already busy with other clients. Chances are there won’t be enough capacity and VMs to go around for everyone.
Of course, few companies could afford to pay for such redundancy on a continuous basis in case there’s an outage. So, often the solution will be to simply be very fast and act accordingly when there’s an outage in order to be able to quickly switch to another zone. This is where having an outage strategy in place beforehand can be vital to weigh the available options and make fast decisions.
Also, don’t neglect security. Cloud outages are emergency situations and are treated differently. Often people tend to neglect other responsibilities during an emergency. The same can happen with cybersecurity. In the rush to keep the services running or return them, many access controls might get changed, encryption may be turned off, ports might get open or misconfigured, etc.
But hackers won’t be simply sitting around waiting for the outage to be fixed. They may be actively exploring any possible weaknesses to take advantage of. So, your outage strategy should also feature steps to ensure the security features remain at a good level and there are no blind spots. Otherwise, the company risks having to tackle two very different and very business-critical emergencies at the same time. And that’s not going to be fun.