HUMAN ERROR: The Biggest Challenge to Data Center Availability and how we can mitigate it – Part 2

it_engineer_with_server

The previous article on this topic can be found via this link.

The layered approach to upkeep data center infrastructure availability should not look like the Swiss cheese, i.e. the hazard or trigger should be eventually stopped and preferably as early as possible.

swiss-cheese

The layers should include the following:

  • Design (in accordance with design intent of owner) with either concurrent maintainability objective or fault tolerance
  • Implementation (in accordance with design brief) and fully tested via comprehensive testing and commissioning phase before handover with fully documented SOPs/MOPs/EOPs
  • Maintenance and Operations Management, work by equipment service providers or any work on site through Method of Statement and Risk Assessment matrix by suitably qualified person/persons
  • Incident and Problem management process, escalation management and mitigation process

and so forth

Possible problems arising from inadequacy in each of the layer can result in:

  • Inherent Design / Setting flaw
    • Outdated / swiss cheese situation
    • Requires analysis and manual intervention
    • Error Producing Conditions (EPC)
  • Weakness in manual processes
    • Inadequate automation
    • Inadequate training / familiarity
    • Inadequate operations procedures
  • Insufficient Information / knowledge
    • Capacity limit reached earlier than design intent
    • Inadequate training / knowledge
    • Inadequate documentations
  • Insufficient Risk Assessment
    • MOS / RA, risk matrix
    • Vendor experience

 

Learn from other industry

Our data center industry is a relatively young industry and there are other industries with mission critical infrastructure that have undergone extensive research and iterative enhancements which we can learn from and adopt.

  • Airline’s Crew Resource Management
    • Checklist and double checking by pilot and co-pilot on the airplane air-worthiness
    • Communications within the cockpit and the cabin staff with the cockpit to ensure timely and prioritized response
  • US Nuclear Regulatory Commission
    • Standardized Plant Analysis Risk – Human Reliability Analysis (SPAR-H) method to take account of the potential for human error
  • OECD’s Nuclear Energy Agency
    • Ways to avoid human error, e.g.,
      • Systems should also be designed to limit the need for human intervention
      • distinctive and consistent labelling of equipment, control panels & documents;
      • displaying information concerning the state of the plant so that the operator don’t need to guess and make a faulty diagnosis; and
      • designing systems to give unambiguous responses to operator actions so incorrect actions can be easily identified.
      • operators to be better trained for plant emergencies, use of simulators

 

 

Error Reduction Strategy and Error Predictor

In addition, error reducing strategies can be applied in all areas of data center maintenance and operations management to reduce the probability of occurrence of human error. Whether in design of the data center power and cooling infrastructure, or determining the risk of operations of particular maintenance operations (e.g. power switch-over exercise to perform UPS or back-up generator maintenance), all the strategy below should be applied.

Take for example the case of the AWS US-East-1 outage incident (http://mashable.com/2017/03/02/what-caused-amazon-aws-s3-outage/), the command set is powerful and a typo could bring down a lot of servers real quick. So AWS said in their post incident summary (https://aws.amazon.com/message/41926/) that they will limit the speed of the effect of the command and tool, i.e. put in safety check which is basically an application of the constraint strategy.

error-reduction-strategy

When service or repair task are assigned to operations staff, or by qualified technicians of equipment service provider, an evaluation of the existence of error precursors and eliminating these pre-cursors will reduce the likelihood of human error. For example, the combination of time pressure, inexperienced staff already at the end of the long work shift and ambiguous task objective are all contributor to a higher risk of the assigned task. Eliminate, reduce, and re-direct to an experienced staff at the start of the work shift with clear task objective will reduce the risk of the assigned task.

error-precursors.jpg

Risk Mitigation is a Continuous Process

A multi-prong multi-layer and attention to details approach are required to mitigate the risk of human error causing an outage in a data center facility.

Risk Mitigation Process Flow.jpg

 

Design and implementation of a data center to a set of clear and tested design intent (e.g. objective of data center being concurrently maintainable). Day-in and day-out the operations staff, vendors, client personnel interact with the system within the data center. So there need to be a well oil system in place, not only just documentations, that works 24×7 for as long as the data center is in existence.

An iterative risk mitigation system, relying upon consistent management support and attention, with knowledge learned from near misses and incidents are key attributes of an environment that is resilient in terms of the human aspect.

We Human can reduce Human Error, effort required

We should look at the data center organization especially the operations team, the resources and tools, the capability of the operations team, and so forth. A culture of no blame, and encouraging active participation by all staff to address potential weakness or error precursors, addressing near-miss which is a sign of error inducing conditions, are important to mitigate effects of human errors. We should get away from pointing fingers and to learn from past problems, like what AWS did with their incidents. And our data center industry can do more to share and learn from one another, to prevent future occurrence of issues that were faced and dealt with elsewhere.

This built-up knowledge of good practices should be documented and disseminated, with management support. The weakest link is an inexperience staff hesitating or worse making a wrong decision, so training everyone on the operations team is critical to maintaining availability of data center.

A periodic (for example annual basis) no-nonsense third party data center operations and management review coupled with improvement plans to strengthen those weakest links will boost insight and assurance to data center C-level executives, data center operations managers, and clients. Most operations manager will be too busy to review their own data center operations, coupled with the difficult position of finding your own fault and limited experience if the staff have not worked in more than one or two data center sites, therefore a third party operations and management review is the next best thing to enhance resilience against human error provided it has the full co-operation from top to bottom of the data center staff.

Furthermore, if a data center service provider has grown beyond 2 to 3 data centers, it will be difficult to consistently manage the data center operations across them especially if they are managed independently. A third party review that is applied to all of them will help to reign in inconsistent operations processes, subject to having a central data center operations programme function within the data center service provide, of course.

Therefore, a data center facility is ultimately dependent on well trained and knowledgeable staff, whom are clear about their data center facility information or knows where to quickly find the documentation that contains the detail information, do the risk assessment work of evaluating equipment service vendor or upgrade works properly.

In summary,

  • It is worthwhile to commit resources to reduce errors
  • We can do improve our resiliency and thereby uptime through available methods and tools
  • There are proven methods and tools we can borrow from other mission critical environments
  • Third party data center operations and management review coupled with improvement plan should be considered for large data center operations especially those that have multiple sites

 

References:

  1. https://en.wikipedia.org/wiki/Human_error_assessment_and_reduction_technique
  2. http://news.delta.com/chief-operating-officer-gives-delta-operations-update
  3. https://journal.uptimeinstitute.com/examining-and-learning-from-complex-systems-failures/
  4. http://www.enterpriseinnovation.net/system/files/whitepapers/1_2016-cost-of-data-center-outages-final-2.pdf
  5. https://www.oecd-nea.org/brief/brief-02.html
  1. http://www2.lbl.gov/ehs/training/assets/docs/Error-Precursors.pdf
  2. https://www.linkedin.com/pulse/data-center-human-factor-james-soh-%E8%8B%8F%E6%97%AD%E6%B1%9F

 

  1. https://www.linkedin.com/pulse/human-errors-biggest-challenge-data-center-how-we-can-james-soh-%E8%8B%8F%E6%97%AD%E6%B1%9F
Advertisements
HUMAN ERROR: The Biggest Challenge to Data Center Availability and how we can mitigate it – Part 2

Data Center: The human factor

The focus of the data center industry in general has followed a pattern of technological innovations in mechanical cooling (e.g. containment), electrical efficiency, new energy, greater integration and management (e.g. rack as a computer – Open compute, DCIM etc) and so on.

One factor that don’t stand out but impact all of the above is the human factor, be it in planning, selection and procurement, design, testing and commissioning, or operations.

A not so well (i.e. bad) designed data center can hum along fine through the hard work and effort of the operations team. A well designed and implemented data center can have multiple unplanned downtime incidents due to one inexperienced or careless operations member.

Separate studies by Uptime Institute and Ponemon pegged the percentage of data center outages ranging from 22% to 70%. The figure of 22% is still a significant percentage.

I have personally experienced the consequences of several outages caused by human error. And post incident review had shown in all such cases that they can all be avoided. Even problem caused by design lapse, can be mitigated through 3rd party review and executing mitigation plan. And design limits may be exceeded or equipment limits deteriorated through prolonged maintenance window or environmental factor (high humidity Singapore is way tougher on equipment) and so forth.

A well run place, after several years of stringent adherence to operations is like a well-run military camp, everything is labelled and no wire is left dangling. And every operations staff knows their job when you asked them, such as which is the right operations manual and where is it located say to perform maintenance on a diesel generator. You know you will have a problem when he needs to ask someone else and if this happens at night when this chap is on duty, guess how long it takes to call his colleague or his superior to react to an issue?

Whenever a serious data center occurs is traced to human error, it may take years for the error to manifest and the person being blamed is the current manager in charge. The hidden problem may be made worse through years of ignoring the problem.

There was a job opening in a Singapore data center complex that has been vacant for more than 2 years and hardly anyone apply for it, because the site had been known to have power outages due to insufficient capacity problem. No one wants to get into the hot seat. Eventually that entire building will be upgraded while all the existing tenants have to move out to facilitate the upgrading works.

When I do a data center operations audit, one area I pay particular attention to is the data center organization chart and the authority that the data center manager has. In one instance, there isn’t a well-defined organization chart and a crucial post is vacant and is “covered” by the subordinate, as in one case, I highlight in the report that this is a critical gap and needs to be addressed right away.

While it cost lots of investment to build a data center, and technically and financially challenging to upgrade any component of a data center, it is worthwhile to upgrade and enhance the people running and managing the data center. Annual training plan should be drafted and regularly reviewed. Table top exercises planned, executed and reviewed. Regular data center operations meeting and sharing sessions should be held to share potential problems and solutions. One thing that marks a well-run data center is dealing with the issue of near-miss whereby such a data center operations team will review the reasons for the near-miss (irregardless if it relates to safety, or impact to operations) and create measures to mitigate and reduce the risks that lead to the near-miss. A data center power outage incident was caused by general cleaner wiping the anti-static mat in the low-voltage switch board room with a wet mop, and when the facility engineer was doing his inspect rounds wearing a worn-out shoes slipped and he grab on to a power switch and caused it to become OFF. The incident should have been avoided through many possible ways which I will leave it to the reader to work out.

One area to help where in Asia generally it is not happening is sharing of accident to avoid future occurrences. I agree with Ed Ansett when he commented (see reference link 4) that problem can be avoided if repeated problems elsewhere can be avoided. I think we cannot afford to given that cost of outages have become more expensive and more impactful.

Training and development of data center fundamentals and operations is one aspect where I see greater investment and I think it is an upward trend given that data center is becoming more mission critical than before for many enterprises, large and small.

This is an area that warrants greater thoughts and much work to enhance the availability, prevent outage and improve recovery time, and the people working in the data center operations team will love to be appreciated for their hard work and will work better at it.

References:

  1. http://searchdatacenter.techtarget.com/feature/The-causes-and-costs-of-data-center-system-downtime-Advisory-Board-QA
  2. https://gcn.com/Articles/2016/02/09/data-center-outages.aspx
  3. http://money.cnn.com/2016/08/08/technology/delta-airline-computer-failure/
  4. http://www.datacenterdynamics.com/content-tracks/security-risk/dcd-se-asia-many-data-center-failures-are-due-to-secrecy/94803.fullarticle
  5. http://www.enterpriseinnovation.net/article/lack-skills-infrastructure-hamper-govts-use-technology-southeast-asia-872772562
Data Center: The human factor