The essential steps for data center recovery

14.08.2024 1,560 0

Data centers are a critical part of the IT infrastructure. They are also among the most durable, well-constructed, and secure buildings. Despite that, they can still malfunction in a variety of ways. Natural disasters could also strike and damage them.

Whatever the case, data center recovery is a big and complex process that’s also incredibly important to be done right and quickly. Each minute of downtime can cause major disruptions that are felt in entire regions and even globally. It’s no wonder that data center recovery has become a hot topic, especially now when the global reliance on these facilities starts to grow exponentially.

According to a 2023 Uptime Institute survey, 55% of data center operators have had at least one outage in the past three years. Only 10% of the outages were considered serious or severe in 2023. Overall, the number of outages is going down. So, that’s good news, but not until we see more figures. It turns out 70% of data center incidents cause additional costs of $100,000 or more. An additional survey by Veritiv shows that on average, there are 2.4 total facility shutdowns per year and the average duration is 138 minutes. Needless to say, those 138 minutes (about 5 hours) are quite stressful.

The importance of a data center recovery strategy

Even though a data center outage might be rare, when it happens, it’s painful and expensive. As such, it’s best to have a proper data center recovery strategy in place and ensure it’s well thought out, practiced, and it’s possible to use it as a good basis for any disaster. The key to any such strategy is to provide the foundation for disaster response and recovery and be a guide during stressful moments.

It’s impossible to prevent or predict every disaster. That’s why the strategy shouldn’t even try to do so as it will simply lead to more issues. The strategy should segment the main types of data center disasters and then build on top of that for each. Of course, some disasters might combine two or more types together, but having a plan for each will help guide people through the process.

The main types of data center disasters

As mentioned, one disaster could lead to another, or they can layer up. Despite that, there are different categories to make it easier to know what to do for each one.

One of the most common issues is a power outage. They often cause major downtime and additional system failures. Powering each server back up usually requires time, additional attention for each configuration, error handling, etc.

The next one is human error, says DataCenterKnowledge. According to Uptime’s 2022 report, nearly two-thirds of data center outages are caused by human error. In most cases, it’s not even an “honest error”, meaning something done unintentionally because the employee wasn’t informed or lacked skills, experience, etc. In fact, in 85% of human error outages, the reasons are because employees didn’t follow the procedures or there were flaws with the processes themselves. Among the most common human errors are accidental disconnection of power sources, overloading circuits, misconfigurations, etc.

Cyber-attacks are the next most common data center disaster. According to the 2023 State of the Data Center report by AFCOM, two-thirds of organizations around the world suffered at least one cyber-attack in 2022. The average disruption time was five days.

Finally, we have natural disasters. While rare, they can still happen. Depending on the disaster, there can be different types of damage, including physical and structural. Fires, floods, earthquakes, landslides, tornadoes – it all depends on the location of the data center and each facility is exposed to different risks.

The basics of data center recovery

Right, as we have pinpointed some of the main disasters, time to plan for them. The plan should feature a few basic steps as a starting foundation. It all starts with considering the location of the data center, Rahi notes.

When the data center was built at the given location, it was already well known what the most likely natural disasters are. Ideally, the data center was built with those disasters in mind and to be able to handle them up to a certain point. Meaning the strength of the building and the placement of the servers. For example – keeping them higher if the main risk of the area is flooding. Or as low as possible, even underground if the biggest danger is a tornado.

Next, the second – andmost important – step is backup power. Of course, every data center has backup power generators, UPS, etc. Many of them though are surprised to find out that they fail to kick in properly when needed, aren’t sufficient, or just malfunction. This just lengthens the downtime and can worsen it. As such, many big data centers are starting to rely on more than one energy source for their main supply and setting up more than one backup option. This diversification can be a bit expensive to setup but it will pay off in the long run, and can also be a good reason to ask for higher prices from the customer by showing them guaranteed multiple power sources and backups.

The protection of the data center continues with the interior. One of the most important measures is fire suppression. No matter where the facility is located, internal fires are always a risk. Rahi recommends data centers to use a dry “pre-action” system which can extinguish most fires before the conventional sprinkler system is activated. These dry systems often use inert gases which removes oxygen and suppresses fire. Regular testing of the systems and alarms is also a must, to be sure everything will work as intended when needed.

For data centers in flood areas, a pumping system is also a must. And it should be automated along with guaranteed power so that it will work when the electric grid is damaged. For earthquake-prone areas you will have to choose racks and cabinets which are rated for seismic activity. They can have special mounting brackets, giving extra support and security for the servers.

If possible, set up a redundancy data center. One that can step in when there’s a disaster in the primary facility. This is among the best solutions, but it’s also the most expensive as it obviously requires doubling everything. And of course, it should be far enough away to be safe from any regional disaster, but not too far to be in a different region and thus too far from the clients. Usually, 150-200 km. is considered a good distance.

The next layer of data center recovery

The data center recovery plan must then move to the next frontier. Here are some of the best practices and actions to take first when disaster strikes.

The priority should be the employees. Their safety is paramount and should be ensured before any next step is taken. Ensuring said safety includes a lot of preemptive work. This can include effective training, good communication, and creating and implementing said recovery strategy.

Then the data center operator should move to the active testing of the strategy. This will help pinpoint additional weak points that must be addressed, along with discovering if the decided measures and actions are suitable. The test run exercises should start small and be expanded as the results are coming.

Another best practice is to establish vendor contacts that need to be contacted in an emergency. They might be the ones to provide new equipment, tools, or other supplies that can be vital for the timely recovery.

Finally, be flexible. Data center recovery will require flexibility and timely reactions from everyone involved. The recovery strategy should also take this into account and include flexibility in the planning. The processes should be tested and updated often and should be developed in a way that allows the employees to adapt to the given situation and motivates them to make decisions as needed to ensure the best possible recovery results.

Having all this in place is important to increase and retain your customers’ trust. They will be far more interested in joining your data center if they see that you have thought about the details as much as possible and have set up competent procedures and measures keeping the data center, and thus their data and/or colocated servers (and reputation) safe.

Leave a Reply

Your email address will not be published.