The focus of the data center industry in general has followed a pattern of technological innovations in mechanical cooling (e.g. containment), electrical efficiency, new energy, greater integration and management (e.g. rack as a computer – Open compute, DCIM etc) and so on.
One factor that don’t stand out but impact all of the above is the human factor, be it in planning, selection and procurement, design, testing and commissioning, or operations.
A not so well (i.e. bad) designed data center can hum along fine through the hard work and effort of the operations team. A well designed and implemented data center can have multiple unplanned downtime incidents due to one inexperienced or careless operations member.
Separate studies by Uptime Institute and Ponemon pegged the percentage of data center outages ranging from 22% to 70%. The figure of 22% is still a significant percentage.
I have personally experienced the consequences of several outages caused by human error. And post incident review had shown in all such cases that they can all be avoided. Even problem caused by design lapse, can be mitigated through 3rd party review and executing mitigation plan. And design limits may be exceeded or equipment limits deteriorated through prolonged maintenance window or environmental factor (high humidity Singapore is way tougher on equipment) and so forth.
A well run place, after several years of stringent adherence to operations is like a well-run military camp, everything is labelled and no wire is left dangling. And every operations staff knows their job when you asked them, such as which is the right operations manual and where is it located say to perform maintenance on a diesel generator. You know you will have a problem when he needs to ask someone else and if this happens at night when this chap is on duty, guess how long it takes to call his colleague or his superior to react to an issue?
Whenever a serious data center occurs is traced to human error, it may take years for the error to manifest and the person being blamed is the current manager in charge. The hidden problem may be made worse through years of ignoring the problem.
There was a job opening in a Singapore data center complex that has been vacant for more than 2 years and hardly anyone apply for it, because the site had been known to have power outages due to insufficient capacity problem. No one wants to get into the hot seat. Eventually that entire building will be upgraded while all the existing tenants have to move out to facilitate the upgrading works.
When I do a data center operations audit, one area I pay particular attention to is the data center organization chart and the authority that the data center manager has. In one instance, there isn’t a well-defined organization chart and a crucial post is vacant and is “covered” by the subordinate, as in one case, I highlight in the report that this is a critical gap and needs to be addressed right away.
While it cost lots of investment to build a data center, and technically and financially challenging to upgrade any component of a data center, it is worthwhile to upgrade and enhance the people running and managing the data center. Annual training plan should be drafted and regularly reviewed. Table top exercises planned, executed and reviewed. Regular data center operations meeting and sharing sessions should be held to share potential problems and solutions. One thing that marks a well-run data center is dealing with the issue of near-miss whereby such a data center operations team will review the reasons for the near-miss (irregardless if it relates to safety, or impact to operations) and create measures to mitigate and reduce the risks that lead to the near-miss. A data center power outage incident was caused by general cleaner wiping the anti-static mat in the low-voltage switch board room with a wet mop, and when the facility engineer was doing his inspect rounds wearing a worn-out shoes slipped and he grab on to a power switch and caused it to become OFF. The incident should have been avoided through many possible ways which I will leave it to the reader to work out.
One area to help where in Asia generally it is not happening is sharing of accident to avoid future occurrences. I agree with Ed Ansett when he commented (see reference link 4) that problem can be avoided if repeated problems elsewhere can be avoided. I think we cannot afford to given that cost of outages have become more expensive and more impactful.
Training and development of data center fundamentals and operations is one aspect where I see greater investment and I think it is an upward trend given that data center is becoming more mission critical than before for many enterprises, large and small.
This is an area that warrants greater thoughts and much work to enhance the availability, prevent outage and improve recovery time, and the people working in the data center operations team will love to be appreciated for their hard work and will work better at it.