HUMAN ERROR: The Biggest Challenge to Data Center Availability and how we can mitigate it – Part 2

it_engineer_with_server

The previous article on this topic can be found via this link.

The layered approach to upkeep data center infrastructure availability should not look like the Swiss cheese, i.e. the hazard or trigger should be eventually stopped and preferably as early as possible.

swiss-cheese

The layers should include the following:

  • Design (in accordance with design intent of owner) with either concurrent maintainability objective or fault tolerance
  • Implementation (in accordance with design brief) and fully tested via comprehensive testing and commissioning phase before handover with fully documented SOPs/MOPs/EOPs
  • Maintenance and Operations Management, work by equipment service providers or any work on site through Method of Statement and Risk Assessment matrix by suitably qualified person/persons
  • Incident and Problem management process, escalation management and mitigation process

and so forth

Possible problems arising from inadequacy in each of the layer can result in:

  • Inherent Design / Setting flaw
    • Outdated / swiss cheese situation
    • Requires analysis and manual intervention
    • Error Producing Conditions (EPC)
  • Weakness in manual processes
    • Inadequate automation
    • Inadequate training / familiarity
    • Inadequate operations procedures
  • Insufficient Information / knowledge
    • Capacity limit reached earlier than design intent
    • Inadequate training / knowledge
    • Inadequate documentations
  • Insufficient Risk Assessment
    • MOS / RA, risk matrix
    • Vendor experience

 

Learn from other industry

Our data center industry is a relatively young industry and there are other industries with mission critical infrastructure that have undergone extensive research and iterative enhancements which we can learn from and adopt.

  • Airline’s Crew Resource Management
    • Checklist and double checking by pilot and co-pilot on the airplane air-worthiness
    • Communications within the cockpit and the cabin staff with the cockpit to ensure timely and prioritized response
  • US Nuclear Regulatory Commission
    • Standardized Plant Analysis Risk – Human Reliability Analysis (SPAR-H) method to take account of the potential for human error
  • OECD’s Nuclear Energy Agency
    • Ways to avoid human error, e.g.,
      • Systems should also be designed to limit the need for human intervention
      • distinctive and consistent labelling of equipment, control panels & documents;
      • displaying information concerning the state of the plant so that the operator don’t need to guess and make a faulty diagnosis; and
      • designing systems to give unambiguous responses to operator actions so incorrect actions can be easily identified.
      • operators to be better trained for plant emergencies, use of simulators

 

 

Error Reduction Strategy and Error Predictor

In addition, error reducing strategies can be applied in all areas of data center maintenance and operations management to reduce the probability of occurrence of human error. Whether in design of the data center power and cooling infrastructure, or determining the risk of operations of particular maintenance operations (e.g. power switch-over exercise to perform UPS or back-up generator maintenance), all the strategy below should be applied.

Take for example the case of the AWS US-East-1 outage incident (http://mashable.com/2017/03/02/what-caused-amazon-aws-s3-outage/), the command set is powerful and a typo could bring down a lot of servers real quick. So AWS said in their post incident summary (https://aws.amazon.com/message/41926/) that they will limit the speed of the effect of the command and tool, i.e. put in safety check which is basically an application of the constraint strategy.

error-reduction-strategy

When service or repair task are assigned to operations staff, or by qualified technicians of equipment service provider, an evaluation of the existence of error precursors and eliminating these pre-cursors will reduce the likelihood of human error. For example, the combination of time pressure, inexperienced staff already at the end of the long work shift and ambiguous task objective are all contributor to a higher risk of the assigned task. Eliminate, reduce, and re-direct to an experienced staff at the start of the work shift with clear task objective will reduce the risk of the assigned task.

error-precursors.jpg

Risk Mitigation is a Continuous Process

A multi-prong multi-layer and attention to details approach are required to mitigate the risk of human error causing an outage in a data center facility.

Risk Mitigation Process Flow.jpg

 

Design and implementation of a data center to a set of clear and tested design intent (e.g. objective of data center being concurrently maintainable). Day-in and day-out the operations staff, vendors, client personnel interact with the system within the data center. So there need to be a well oil system in place, not only just documentations, that works 24×7 for as long as the data center is in existence.

An iterative risk mitigation system, relying upon consistent management support and attention, with knowledge learned from near misses and incidents are key attributes of an environment that is resilient in terms of the human aspect.

We Human can reduce Human Error, effort required

We should look at the data center organization especially the operations team, the resources and tools, the capability of the operations team, and so forth. A culture of no blame, and encouraging active participation by all staff to address potential weakness or error precursors, addressing near-miss which is a sign of error inducing conditions, are important to mitigate effects of human errors. We should get away from pointing fingers and to learn from past problems, like what AWS did with their incidents. And our data center industry can do more to share and learn from one another, to prevent future occurrence of issues that were faced and dealt with elsewhere.

This built-up knowledge of good practices should be documented and disseminated, with management support. The weakest link is an inexperience staff hesitating or worse making a wrong decision, so training everyone on the operations team is critical to maintaining availability of data center.

A periodic (for example annual basis) no-nonsense third party data center operations and management review coupled with improvement plans to strengthen those weakest links will boost insight and assurance to data center C-level executives, data center operations managers, and clients. Most operations manager will be too busy to review their own data center operations, coupled with the difficult position of finding your own fault and limited experience if the staff have not worked in more than one or two data center sites, therefore a third party operations and management review is the next best thing to enhance resilience against human error provided it has the full co-operation from top to bottom of the data center staff.

Furthermore, if a data center service provider has grown beyond 2 to 3 data centers, it will be difficult to consistently manage the data center operations across them especially if they are managed independently. A third party review that is applied to all of them will help to reign in inconsistent operations processes, subject to having a central data center operations programme function within the data center service provide, of course.

Therefore, a data center facility is ultimately dependent on well trained and knowledgeable staff, whom are clear about their data center facility information or knows where to quickly find the documentation that contains the detail information, do the risk assessment work of evaluating equipment service vendor or upgrade works properly.

In summary,

  • It is worthwhile to commit resources to reduce errors
  • We can do improve our resiliency and thereby uptime through available methods and tools
  • There are proven methods and tools we can borrow from other mission critical environments
  • Third party data center operations and management review coupled with improvement plan should be considered for large data center operations especially those that have multiple sites

 

References:

  1. https://en.wikipedia.org/wiki/Human_error_assessment_and_reduction_technique
  2. http://news.delta.com/chief-operating-officer-gives-delta-operations-update
  3. https://journal.uptimeinstitute.com/examining-and-learning-from-complex-systems-failures/
  4. http://www.enterpriseinnovation.net/system/files/whitepapers/1_2016-cost-of-data-center-outages-final-2.pdf
  5. https://www.oecd-nea.org/brief/brief-02.html
  1. http://www2.lbl.gov/ehs/training/assets/docs/Error-Precursors.pdf
  2. https://www.linkedin.com/pulse/data-center-human-factor-james-soh-%E8%8B%8F%E6%97%AD%E6%B1%9F

 

  1. https://www.linkedin.com/pulse/human-errors-biggest-challenge-data-center-how-we-can-james-soh-%E8%8B%8F%E6%97%AD%E6%B1%9F
HUMAN ERROR: The Biggest Challenge to Data Center Availability and how we can mitigate it – Part 2

2+2000+3000 = 1 big challenge

thousands-of-app-in-a-bank

You read it right.

2 DCs + 2,000 servers with 3,000 applications are going into a new Data Center in three years time, and the man/woman to do it is yet to be found.

I have had the unfortunately experience of network interruption that caused slow and unacceptable access to hundreds (my then estimate) of online application in a large Enterprise data center, but then again we were an internal shared co-location data center so we only count the projects and departments but never the number of application systems.

 

In the above picture is an advertisement that calls for a data center migration project manager, and it has been repeatedly put up for more than 6 months and I have seen the indicated salary range gone up (from 10k SGD per month to now 15k SGD). The scale and complexity is also fully spelled out, as previously their advert did not indicate two data centers moving into a new one, and 2,000 servers with 2,000 application systems, and to move all in (I presume by phases, it will be nearly mission impossible if all at once) by 2020. Don’t forget, system change and new system and services will not stay frozen during this period, while new system may still be added to the current data centers given the inter-dependencies of system and data required to introduce these new systems and services. Hopefully, no IP address change for any of the system is required, oh that is only one out of many possible things to consider for such a move.

I had one data center move done in the mid 1990s with the enterprise having one mini-computer sytem as the one and only mission critical system, we went beyond the planned downtime window of 24 hours by additional 30 hours because the new site’s telecommunication cables are digital compared to the old site which was analog, and our analog modem was unable to work and we had to bring in new ones while the migration took place on a Sunday when the vendor’s warehouse are closed.

I had on occasions when talking to data center facility owners, sales people and fellow consultants about the mission critical nature of IT for most of today’s enterprises and mentioned hundreds of applications are in use on average by medium to large enterprises.

098733b

3,000 applications is one big number. I hope only 10% of them are mission critical and the entire application system portfolio are prioritized and inter-dependencies already mapped out.

When a bank data center ran into problem (see reference section below), what we see externally is ATM being down, and counter staff may switch over to back-up system and service became slower. What really happen will involve lots more effort to bring the critical applications back to service.

What makes me wonder though, shouldn’t the bank have identified such a role and bring him in earlier in the process of decision making to have a new data center, and this person or better a team of data center migration experts be better in all sorts of way rather than bring someone from outside and who is to manage any knowledge and therefore mitigate any migration risks? I am pretty sure that the local financial regulator will dive in to audit and assess the bank’s migration plan.

Anyway, I have learnt a number (1,000s of servers, applications) for a Singapore bank is probably typical.

Best of luck to their data center migration.

Reference:

  1. http://www.datacenterknowledge.com/archives/2013/12/16/year-downtime-top-10-outages-2013/
  2. https://www.theregister.co.uk/2017/01/13/lloyds_bank_in_talks_to_outsource_bit_barns_to_ibm/
  3. https://www.linkedin.com/pulse/20140616192008-655694-best-practices-for-data-center-migration
2+2000+3000 = 1 big challenge

Can you help us build a tier 5 data center?

the-data-center-build-photo

A data center consultant K told me this story. In around 2005 or 2006 he gave a talk at a data center conference at a famous financial and resort city somewhere in Asia. A gentleman J walked up to K afterwards and introduced himself as a property developer who are looking into building a new data center. The topic that K spoke about on the stage was about data center standards and mentioned Uptime Institute Tiers and TIA942, and J said he wanted to build a Tier 5 data center.

As an aside, let me defer to other posts/websites on the design standards and Tier level/Rated/Facility class (see reference 1 and 2). Generally speaking, most define the data center design based on resiliency required up to four levels.

K was taken aback and asked if J is aware that the Tier levels tops off at IV / 4, J said he knows and he wanted to go one up better than Tier IV / 4. J shared that given the city he planned to build the new data center has not had any standalone data center facility, he wants to stand out and that city is well known to have extravagant hotels and malls and such.

The idea that build them and they will come

K is kind enough to ask J if he had done a market study and knows whether potential clients demand a highly resilient and fault tolerant data center, and J replied no he has not but he thinks that demands will rush in for his data center once it is announced that such a data center will be built. Well, maybe if you have done your study and knows where the competition for starters. But if you have not done any of the study of market demand and competition, then what you built may be over built, or way ahead of demand and will take longer than your optimistic timeframe to sell them.

I had on multiple occasions met with potential data center owner who are considering to build their first data center non first tier data center market in Asia. Surprisingly, a common central theme of their plan hinges on “build them and they will come” mindset. Today, several Asia cities are in over-supply not only in residential / industrial sectors but also the sub-sector of data center of which the over-confidence of demand will come when supply is there is one contributor to the situation. Data center facility is a huge investment, a China data center company I have known to have a data center facility in Beijing that is well sought after, but expanded in other cities which they are less familiar with and suffers losses for years, which drags down their overall finances and they are forced to sell their crown jewel at less than preferred circumstances and numbers.

Client needs and supply / demand

I have two points to make. Firstly, know your market and competition, and your financial strength. If all your competition in the market are building to get shared hosting type which only demands a UPS backed electrical supply to their IT servers, then building it to higher level of resiliency makes your data center space more pricy and will take longer to fill up, if ever. There was a few such cases in Singapore and some have folded after building a data center, and some have spent millions of dollars and the project cannot take off and are now in limbo. Many such cases also exist in China. While one case in Singapore had prevailed, they built their data center during the dot com boom but were caught in the downturn of dot com bust which had several casualties, and this one data center managed to survived through building up their data center facility on a floor by floor basis unlike the other two, thus less demand on their financials compared to the others during that period.

More prudent to match cost outlay to take-up

Secondly, the main technical infrastructure design parameter of whether to build to concurrently maintenance (aka roughly equivalent to Tier III / Rated 3 / Facility Class 3) or fault tolerant (Tier IV / Rated 4 / Facility class 4) are dependent on the demand by the client. If the target clientele are the financial institutions or those organizations that due to various reasons are reliant on IT but their system can only run on single host/system or active-passive set-up (it seems like airline ticket reservation system are like that), then it make sense. Another way is to plan for multiple level of resiliency features, i.e. share the same fault tolerant level of electrical infrastructure but flexible enough to accommodate either concurrently maintainable or fault tolerant demand of the client (although generally this will be slightly more costly than purely designed and implemented to concurrently maintainable).

Fortunately these days, there are so much information in the market and the new owners-to-be are better informed. My other gripe is those that knows a little in one particular topic of the data center knowledge and yet is so convinced of it that precludes meaningful exchange, but that is another story in future post.

Reference:

  1. http://www.datacenterknowledge.com/archives/2016/01/06/data-center-design-which-standards-to-follow/
  2. https://uptimeinstitute.com/tiers
  3. https://www.linkedin.com/pulse/data-center-tiers-tears-plus-minus-james-soh-%E8%8B%8F%E6%97%AD%E6%B1%9F
  4. https://www.linkedin.com/pulse/making-sense-data-center-standards-james-soh-%E8%8B%8F%E6%97%AD%E6%B1%9F
Can you help us build a tier 5 data center?

Hyperscale, 3rd party colocation service providers and the enterprise data center

sdr

Published 22 January 2017

Last November, I attended the DataCenterDynamics Zettastructure conference in London. There was a number of workshops on Open Compute Project (OCP) and one particular topic stands out – how OCP will impact the third party colocation players in Europe. To me, by extension, the same issue is faced by data centers in Asia when considering OCP type of racks.

On OCP website, it says “The Open Compute Project is …. More efficient, flexible and scalable”. The question is, to whom? At the moment, they are meant for the hyperscale data centers, i.e. used by Facebook, Yahoo!, Microsoft and such.

One benefit cited by OCP vendors is the speed to implement the compute/storage capacity, which meant that the compute/storage capacity arrives on site and ready to plug in. There should not be any on rack-on/rack-off work needed other than to plug the power in.

In the United States, Facebook, Yahoo!, Microsoft have large facilities (be it first party or third party custom-built site) are designed and built to accommodate hyperscale deployment and these sites accommodate the OCP racks without major issue.

The thing is, most sites in the rest of the world is not planned, designed nor implemented to accommodate thousands of OCP racks. The workshop where I participated in have colocation service providers asking the OCP data center project members what is the average power draw of average OCP racks, so that their private suite or colocation hall can accommodate some limited quantity of OCP racks.

When I talk to data center engineers from the Baidu-Alibaba-Tencent trio, they said their Project Scorpio (now called Open Data Center Committee – ODCC) racks are designed to fit into the top few data center facilities in data centers in 1st and 2nd tier China cities, on average putting 7kW cap per rack power capacity when going into third party colocation facility. This philosophy meant their asset light data center deployment with hot/cold aisle containment deployment of the Scorpio racks can go as planned in nearby every city that they wanted to deploy compute/storage capacity.

The other issue with OCP / ODCC racks are that these are mainly designed for hyperscale data center usage, and the largest users of IT hardware, meaning the enterprises are so called “missing out” on the benefits of quick deployment of IT capacity. Data centers in Asia, be it colocation space or enterprise data centers/computer rooms, are mostly around 5 to 6kW per rack in most of Asia (reference 4, 5 and 6).

Be it Baidu-Alibaba-Tencent, or Facebook, Yahoo!, Microsoft, these OCP / ODCC racks will not benefit the enterprises unless they accommodate demand of enterprise data center. Currently, the enterprise IT side do not see much benefit of OCP / ODCC, as they don’t look at their need of compute/storage on the scale that the current clients of OCP / ODCC. However, I believe this will change. Enterprise IT talk about software / app deployment too, and compute/storage last and this create pressure on the data center folks to quickly get ready space/rack and the IT capacity folks procure server/storage/network to add to current pool. Until the OCP / ODCC vendors think in terms of the way of the enterprise IT, which I predict they will, the enterprise data center market will not warm up the the OCP / ODCC vendors.

However, this is where I think the OCP vendors ill not limit their offerings to the Internet giants. They will need to consider when designing their hardware in consideration of the enterprise market because it is much larger than the Internet giants, such as designing their racks (which includes compute/storage/network gear) to be in stepped load of say 6, 8, 10 kW, and in terms of how enterprise IT will use them, i.e. on a per rack or per project basis or per enterprise private cloud basis. A new OCP vendor that I spoke to in London said that given the competition and the limited customer pool (of hyperscale data center), they want to sell to the enterprises. Sooner or later, we will see some sort of OCP / ODCC racks that are designed for deployment by enterprise into enterprise data centers and also third party colocation data centers.

 

Reference:

  1. http://www.opencompute.org/about/
  2. http://www.opendatacenter.cn/
  3. http://searchdatacenter.techtarget.com/feature/Hyperscale-data-center-means-different-hardware-needs-roles-for-IT
  4. http://www.datacenterdynamics.com/content-tracks/power-cooling/watts-up/94463.fullarticle
  5. http://ww2.frost.com/news/press-releases/australian-data-centre-market-offers-sizeable-growth-opportunities-says-frost-sullivan/
  6. http://asia.colt.net/services/data-centre/about-colt-data-centres/tdc2/

 

 

Hyperscale, 3rd party colocation service providers and the enterprise data center

Data Center Tiers, No Tears, No Plus or Minus

tiers

Background

Press releases, promotion material and website of some data center service providers, often carry this term Tier 3+, or Tier 4-, or Tier 3.5. This is intended to give the reader an impression the facility is of a higher level of resiliency in terms of design or implementation.

What’s in a Tier/Rated/Facility-Class

Tier Classification System is trademark by Uptime Institute (UTI). In a nutshell, UTI will assess and award the appropriate Tier level if a data center facility owner or private data center client engages UTI to perform such an evaluation. UTI issues the Tier levels in roman numerals I/II/III/IV. https://journal.uptimeinstitute.com/explaining-uptime-institutes-tier-classification-system/

The Telecommunications Industry Association, which is an American organization that issues telecommunications cabling and telecommunications facility standards, issued ANSI/TIA-942-A which is titled “Telecommunications Infrastructure Standard for Data Centers”, of which the latest 2014 edition contains three informative annexes (D, E, F) on data center space considerations, site selection and building design considerations, and data center infrastructure rating. Using the informative annexes of TIA-942-A, a data center facility can be rated according to four categories (Telecommunications, Architectural and Structural, Electrical, and Mechanical) to be Rated 1 – Basic, 2 – Redundant Component, 3 – Concurrently Maintainable, and 4 – Fault Tolerant.

The EN-50600 standard classify a data center in a similar manner to TIA-942-A, but adds a Facility Class 0, i.e. FC-0, while FC-1 through to FC-4 are essentially the same as TIA-942-A’s Rated 1 through to 4. FC-0 is a basically a computer room with server directly connected to utility power without backup power.

Plus? Minus? 3.9?

In any of the abovementioned standards, there is no mentioned of a +/- to any of the rating or classification. None of the standard gives room for partial, fractional, or + / – rating modifier, neither do UTI for their Tier award. So a data center can only be awarded certification that stated Tier III, or Rated 3, or Facility Class 3, but not 3.5, or 3+, or 4-.

Dig Deeper Below that Claimed Rating

If the particular data center facility that announced that they have a Tier 3+ data center facility, checked whether were the rating issued by any competent third party or an technical audit firm. No competent third party or technical audit firm should issue such a non-standard rating.

Such Tier 3+ or Tier4- are self-proclaimed rating in an effort by that data center facility to signal that their facility has features better than Tier 3 or just a tad below Tier 4. But, without a competent third party to evaluate whether that particular data center facility meets, say Rated 3 in the categories of Electrical and Mechanical in the first place.

In the case that that particular data center facility is evaluated by third party to be Tier 4 in the category of Electrical and Tier 3 in the category of Mechanical, then it is to be given at the lowest common rating, i.e. a Tier 3 rating.

What should potential Data Center Client Do

If the Tier level is self pro-claimed without the word certified by, or usually the words are like “our latest data center is designed to Tier 3+ resiliency”, then it is most likely not certified by any third party and the potential data center client should insist on a competent technical third party to evaluate the technical attributes of the data center if they want to consider collocating their IT equipment with them.

  • Ask the site to substantiate the self proclaimed rating using third party

We should just disregard the +, and in our mind de-rate those – or those with the decimal, i.e. if we see a Tier 4- or a Tier 3.5, we should just consider such a data center facility to be designed to Tier 3 and if we decide to consider such a data center facility, to engage a competent technical third party or better yet is to insist that the facility owner engages a third party and bears the cost.

The data center facility may dangle the Tier 3+ as a indirect indication that their site is of high quality which imply justifying a higher premium. However, the potential client should have a site selection process that have clear requirements of a data center facility and do not attach any score to the site unless it is justifiable through third party certification. Having a certification should be viewed as hygiene factor. The evaluation criteria should request for data on the technical, business/financials, and operations attributes which allows for normalization and comparison across the different shortlisted sites.

  • Tier level and Suitability to client business IT needs

A data center’s main function is to house IT equipment. Whether that IT equipment requires fault tolerant power and cooling support or it is only a test environment that can take a rung or two lower in terms of power and cooling resiliency set-up, then a data center facility that allows you to have a private suite to house critical production IT equipment in a Tier 4 set-up, and a small suite or even cage in a shared Tier 3 co-location hall is more suitable, thereby giving rise to a combined set-up that meets both the business need and best bang for the buck. This set-up is also called multi-tier or flexible-tier set-up. Not all data center facility can meet this need or the cost is higher because the base set-up of that particular data center facility will involve heavy re-work compared to one that is ready from day-one to be flexible in this aspect.

  • Do not over rely on the Tier level rating

Tier 4 data center facility doesn’t mean no downtime. Its fault tolerant but trouble rarely comes once, it may come twice and thrice. And it doesn’t take power or cooling issue to bring down a critical IT system within a data center. Human can cause problem. Or in the July 2016 incident with the Singapore Stock Exchange’s more than 5 hours unavailability of its trading system, it’s a hard disk failure that drag down the entire trading system. A distributed denial of service attack, or a telecommunications problem can bring down IT.

  • Evaluate using a comprehensive set of evaluation criteria

The things that potential data center client should do is look beyond the rating level, as whether a data center facility is designed, implemented, and certified to a data center rating level is just one facet of a data center facility’s suitability to its IT needs. There are multitude of other factors including telecommunications facilities, data center facility operations system, competencies of the facility people, among others that counts towards a resilient IT operations in a data center.

  • 24×7 on the ball operations and Watch that capacity

Sometimes, the Tier rating level will drop as the designed capacity is breached and N+1 suddenly because N and the site loses the redundancy ability. Concurrently maintainable or Fault tolerant electrical design do means that when 1N of the 2N UPS is taken offline to have servicing performed, the planning and execution of such maintenance should have the proper procedures (SOP, MOP, MOS) and backup or roll-back plans (RA). You want to minimize risk and risk window of UPS problem when the other set of UPS is taken offline for maintenance. You should also not allow the UPS maintenance and backup generator maintenance to take place at the same time because this doubles the risk that when the remaining 1N UPS fails and then the generators are on manual, you will be forced to rely only on utility supply. The maintenance should be during non operations hours. All these things comes into play and the vendor experience is very important.

Reference:

  1. https://journal.uptimeinstitute.com/explaining-uptime-institutes-tier-classification-system/
  2. http://www.tia-942.org/content/162/289/About_Data_Centers
  3. http://www.computerweekly.com/tip/Four-data-center-tier-classification-misconceptions-demystified
  4. http://searchdatacenter.techtarget.com/feature/What-colocation-customers-should-know-about-data-center-tiers
  5. https://www.linkedin.com/pulse/sharing-data-center-site-selection-evaluation-james-soh?trk=mp-author-card
Data Center Tiers, No Tears, No Plus or Minus