Data centers are modern facilities and as such they house many of the latest and greatest technologies from the world of IT. Despite that, they still rely on humans. A lot. And humans it turns out – tend to be very important for the overall operations of data centers.
Despite this popularity and reliance upon them, data center operators have been struggling to find and retain enough qualified staff. This is a problem that has been plaguing the industry for years. The issue was further exacerbated in 2020 during the start of the pandemic. Back then 70% of data center managers reduced on-site staffing and many of them started to work with a fraction of the usual employees at the site, data from FacilitiesNet showed.
This is a problem that has been years in the making. A separate survey by Uptime Institute from 2018 shows that over 50% of data center operators are struggling to hire and retain qualified staff.
“As we have seen in 2020, it’s often the unplanned events that most challenge our preparation for change. Technology has helped businesses adapt. However, that doesn’t mean that IT professionals haven’t had to overcome their own challenges. 2020 may go down as the year in which the good and bad of IT have both been amplified. For example, IT organisations using cloud services before the pandemic were able to lean on their provider to support their changing business environment. However, organisations that managed their own infrastructure, and were burdened with a talent shortage prior to the lockdowns, likely saw that risk become more pronounced,” Justin Augat, Vice President of Marketing for iland says to DataCentreMagazine.
Fast forward nearly four years later and we see that the situation is basically the same. Data center operators are still struggling to fill all the positions in their facilities, and this is starting to reflect on the quality of service.
More interruptions, higher expectations
In 2023 there was an increase of data center outages which could have been avoided or were longer than they would have been if there were enough staff at the facilities, notes DataCenterDynamics. The publication gives a specific example of the data center outage Microsoft had in Australia on the 30th of August in 2023.
That outage happened in Sydney in one of the key data centers for the Australia East region of the company. Customers had issues with accessing and using Azure, Microsoft 365, and Power Platform. The problems lasted for 46 hours, which – in the industry – is regarded as quite a lot. This is especially true in today’s environment where customers expect 100% availability of the services and even the slightest disruptions irritate them. Now imagine 46 hours of issues.
According to Microsoft, it all started from a power sag. “This power sag tripped a subset of the cooling system chiller units offline and, while working to restore cooling, temperatures in the data center increased to levels above operational thresholds. We powered down a small subset of selected compute and storage scale units, both to lower temperatures and to prevent damage to hardware,” says the company.
Microsoft also says the issue was caused by a lightning strike on electrical infrastructure which was situated 29 km (18 miles) away from the data centers. The company explains in detail what happened:
“The voltage sag caused cooling system chillers for multiple data centers to shut down. While some chillers automatically restarted, 13 failed to restart and required manual intervention. To do so, the onsite team accessed the data center rooftop facilities, where the chillers are located, and proceeded to sequentially restart chillers moving from one data center to the next. By the time the team reached the final five chillers requiring a manual restart, the water inside the pump system for these chillers (chilled water loop) had reached temperatures that were too high to allow them to be restarted. In this scenario, the restart is inhibited by a self-protection mechanism that acts to prevent damage to the chiller that would occur by processing water at the elevated temperatures. The five chillers that could not be restarted supported cooling for the two adjacent data halls which were impacted in this incident.”
As a result, some of the IT equipment started shutting down automatically as temperatures were rising. The onsite team also had to start progressively to shut down additional networking, computing and storage infrastructure to protect the data and equipment. This further worsened the situation. And as we know, the process to power up and restore services is also something that requires time and manpower. Often, some services then start to “act up” and require additional effort to bring back online to the required level.
Insufficient staffing to blame
The post-incident review that Microsoft did highlighted that staffing was among the reasons for the issue. The staff present were qualified and had the skills and experience required to handle the issue, it’s just that there weren’t enough of them to do all the work quickly enough. Instead, they had to rush from issue to issue and they were physically not able to do more. Microsoft also says there were some mistakes made by the team present, but not because of lack of skills, but simply missing sequence procedures which have been implemented since then.
As such, Microsoft has also increased staffing, including for the night shift from three to seven technicians. “Data center staffing levels published in the Preliminary PIR only accounted for “critical environment” staff onsite. This did not characterize our total data center staffing levels accurately. To alleviate this misconception, we made a change to the preliminary public PIR posted on the Status History page,” the company says.
Further analysis of the incident though discovered more inconsistencies. There were more people at the facilities, but not all of them were in the operations center. Some of the procedures could have also been handled by remote workers, too.
On the whole, the company said the staff did everything possible and there was a lot of manual work to be done. For example, 20 chillers were in an error state, with 13 requiring manual restart which means “You’ve got to run out onto the roof of the building to go and manually reset the chiller, and you’re on the clock”.
A risk for all
Uptime Institute notes that these types of issues are a risk for all data center operators. “This happens. And it can potentially happen to any organization. Data center operations are critical. From a facilities standpoint, uptime and availability is a primary mission for data centers, to keep them up and running,” says Ron Davis, vice president of digital infrastructure operations at Uptime.
The good thing is that technologies are developing and improving constantly. So the equipment, systems, and skills are all better than they were a year ago, let alone further back in time. Remote monitoring and data center automation are also improving and giving more options for operators. Despite that, there will always be the need for people physically present at the data center. And as data centers become bigger and more complex, the on-site people must be better prepared.
This leads us to the question what are the optimal staffing levels? As you can expect, there’s no universal answer. Each data center is specific and has its own needs, unique design, requirements, levels of automation, etc. Thus, data center operators must decide the optimal staff individually for each facility, says John Boot, chair of the Energy Efficiency Group of the Data Centre Alliance. In a comment for DataCenterDynamics Booth says that there are multiple factors operators must consider.
For example, consider if there are options for outsourced personnel able to respond to specific maintenance and emergency tasks within a certain timeframe, i.e. less than four hours. Also run specific analysis of the internal procedures and discern if they are efficient enough and how many members of staff are needed to carry them out properly within a reasonable time schedule. It turns out many operators are simply guessing whether a certain number of staff is enough or not.
Then comes the question of finding and retaining staff. That can be a problem, especially for data centers which are further out in remote areas. One obvious solution is for operators to simply open the checkbook. Investing in training staff, providing them with good pay, a lot of benefits, good housing or transport would surely help retain staff, right? The problem is operators are weary of this approach as they don’t want to spend so much money on staff, if they can’t be sure they will stay. Or that they will need in a few years’ time as technologies further develop and automation conquers more and more from data center operations.
This is why Taj El-Khayat, regional director MENA at Citrix, recommends strategic recruitment, DataCentreMagazine notes. “The digital transformation’s acceleration across organisations due to the pandemic has exposed skill shortages. This has especially been the case for data center operators, requested to provide the best and most stable service while facing a drastic and sudden increase in load,” he says.
This is why operators should work with recruitment teams, building long-term talent pipelines and investing in university and college programs. “University programs are essential for building future capacity and specially to create incentives for a more diverse workforce – for example encouraging more female students to join the IT sector,” El-Khayat says. The only way for the IT industry, data center operators included, to solve the staffing shortage is long-term planning. Even if it means investing in unpredictable factors like how much staff will be needed, where and when.