HUMAN ERROR: The Biggest Challenge to Data Center Availability and how we can mitigate it – Part 2

it_engineer_with_server

The previous article on this topic can be found via this link.

The layered approach to upkeep data center infrastructure availability should not look like the Swiss cheese, i.e. the hazard or trigger should be eventually stopped and preferably as early as possible.

swiss-cheese

The layers should include the following:

  • Design (in accordance with design intent of owner) with either concurrent maintainability objective or fault tolerance
  • Implementation (in accordance with design brief) and fully tested via comprehensive testing and commissioning phase before handover with fully documented SOPs/MOPs/EOPs
  • Maintenance and Operations Management, work by equipment service providers or any work on site through Method of Statement and Risk Assessment matrix by suitably qualified person/persons
  • Incident and Problem management process, escalation management and mitigation process

and so forth

Possible problems arising from inadequacy in each of the layer can result in:

  • Inherent Design / Setting flaw
    • Outdated / swiss cheese situation
    • Requires analysis and manual intervention
    • Error Producing Conditions (EPC)
  • Weakness in manual processes
    • Inadequate automation
    • Inadequate training / familiarity
    • Inadequate operations procedures
  • Insufficient Information / knowledge
    • Capacity limit reached earlier than design intent
    • Inadequate training / knowledge
    • Inadequate documentations
  • Insufficient Risk Assessment
    • MOS / RA, risk matrix
    • Vendor experience

 

Learn from other industry

Our data center industry is a relatively young industry and there are other industries with mission critical infrastructure that have undergone extensive research and iterative enhancements which we can learn from and adopt.

  • Airline’s Crew Resource Management
    • Checklist and double checking by pilot and co-pilot on the airplane air-worthiness
    • Communications within the cockpit and the cabin staff with the cockpit to ensure timely and prioritized response
  • US Nuclear Regulatory Commission
    • Standardized Plant Analysis Risk – Human Reliability Analysis (SPAR-H) method to take account of the potential for human error
  • OECD’s Nuclear Energy Agency
    • Ways to avoid human error, e.g.,
      • Systems should also be designed to limit the need for human intervention
      • distinctive and consistent labelling of equipment, control panels & documents;
      • displaying information concerning the state of the plant so that the operator don’t need to guess and make a faulty diagnosis; and
      • designing systems to give unambiguous responses to operator actions so incorrect actions can be easily identified.
      • operators to be better trained for plant emergencies, use of simulators

 

 

Error Reduction Strategy and Error Predictor

In addition, error reducing strategies can be applied in all areas of data center maintenance and operations management to reduce the probability of occurrence of human error. Whether in design of the data center power and cooling infrastructure, or determining the risk of operations of particular maintenance operations (e.g. power switch-over exercise to perform UPS or back-up generator maintenance), all the strategy below should be applied.

Take for example the case of the AWS US-East-1 outage incident (http://mashable.com/2017/03/02/what-caused-amazon-aws-s3-outage/), the command set is powerful and a typo could bring down a lot of servers real quick. So AWS said in their post incident summary (https://aws.amazon.com/message/41926/) that they will limit the speed of the effect of the command and tool, i.e. put in safety check which is basically an application of the constraint strategy.

error-reduction-strategy

When service or repair task are assigned to operations staff, or by qualified technicians of equipment service provider, an evaluation of the existence of error precursors and eliminating these pre-cursors will reduce the likelihood of human error. For example, the combination of time pressure, inexperienced staff already at the end of the long work shift and ambiguous task objective are all contributor to a higher risk of the assigned task. Eliminate, reduce, and re-direct to an experienced staff at the start of the work shift with clear task objective will reduce the risk of the assigned task.

error-precursors.jpg

Risk Mitigation is a Continuous Process

A multi-prong multi-layer and attention to details approach are required to mitigate the risk of human error causing an outage in a data center facility.

Risk Mitigation Process Flow.jpg

 

Design and implementation of a data center to a set of clear and tested design intent (e.g. objective of data center being concurrently maintainable). Day-in and day-out the operations staff, vendors, client personnel interact with the system within the data center. So there need to be a well oil system in place, not only just documentations, that works 24×7 for as long as the data center is in existence.

An iterative risk mitigation system, relying upon consistent management support and attention, with knowledge learned from near misses and incidents are key attributes of an environment that is resilient in terms of the human aspect.

We Human can reduce Human Error, effort required

We should look at the data center organization especially the operations team, the resources and tools, the capability of the operations team, and so forth. A culture of no blame, and encouraging active participation by all staff to address potential weakness or error precursors, addressing near-miss which is a sign of error inducing conditions, are important to mitigate effects of human errors. We should get away from pointing fingers and to learn from past problems, like what AWS did with their incidents. And our data center industry can do more to share and learn from one another, to prevent future occurrence of issues that were faced and dealt with elsewhere.

This built-up knowledge of good practices should be documented and disseminated, with management support. The weakest link is an inexperience staff hesitating or worse making a wrong decision, so training everyone on the operations team is critical to maintaining availability of data center.

A periodic (for example annual basis) no-nonsense third party data center operations and management review coupled with improvement plans to strengthen those weakest links will boost insight and assurance to data center C-level executives, data center operations managers, and clients. Most operations manager will be too busy to review their own data center operations, coupled with the difficult position of finding your own fault and limited experience if the staff have not worked in more than one or two data center sites, therefore a third party operations and management review is the next best thing to enhance resilience against human error provided it has the full co-operation from top to bottom of the data center staff.

Furthermore, if a data center service provider has grown beyond 2 to 3 data centers, it will be difficult to consistently manage the data center operations across them especially if they are managed independently. A third party review that is applied to all of them will help to reign in inconsistent operations processes, subject to having a central data center operations programme function within the data center service provide, of course.

Therefore, a data center facility is ultimately dependent on well trained and knowledgeable staff, whom are clear about their data center facility information or knows where to quickly find the documentation that contains the detail information, do the risk assessment work of evaluating equipment service vendor or upgrade works properly.

In summary,

  • It is worthwhile to commit resources to reduce errors
  • We can do improve our resiliency and thereby uptime through available methods and tools
  • There are proven methods and tools we can borrow from other mission critical environments
  • Third party data center operations and management review coupled with improvement plan should be considered for large data center operations especially those that have multiple sites

 

References:

  1. https://en.wikipedia.org/wiki/Human_error_assessment_and_reduction_technique
  2. http://news.delta.com/chief-operating-officer-gives-delta-operations-update
  3. https://journal.uptimeinstitute.com/examining-and-learning-from-complex-systems-failures/
  4. http://www.enterpriseinnovation.net/system/files/whitepapers/1_2016-cost-of-data-center-outages-final-2.pdf
  5. https://www.oecd-nea.org/brief/brief-02.html
  1. http://www2.lbl.gov/ehs/training/assets/docs/Error-Precursors.pdf
  2. https://www.linkedin.com/pulse/data-center-human-factor-james-soh-%E8%8B%8F%E6%97%AD%E6%B1%9F

 

  1. https://www.linkedin.com/pulse/human-errors-biggest-challenge-data-center-how-we-can-james-soh-%E8%8B%8F%E6%97%AD%E6%B1%9F
Advertisements
HUMAN ERROR: The Biggest Challenge to Data Center Availability and how we can mitigate it – Part 2

The problem with the (use of) PUE in the Data Center industry

Montage-data-centers.jpg

I had mentioned in a previous post on reporting and use of PUE, including the terms iPUE, PUE3, dPUE etc ( https://www.linkedin.com/pulse/data-center-resource-efficiency-pue-isoiec-30134-james-soh-%E8%8B%8F%E6%97%AD%E6%B1%9F).

Like how Tier / Facility Class / Rated are being mentioned fuzzily in the industry, having not make clear whether it is designed according to which standard or certified or not, of which the confusion is not helping the potential clients and the whole industry. Just to clarify, I take no stand against any data center saying that its facility is designed in accordance to a particular standard given that any potential client should and will make detail review and audit of the facility before committing to a co-location deal.

The issue that I like to highlight in this post, is the use of designed PUE (dPUE) instead of PUE in the way it is used in a manner to market or even set policy. dPUE itself is subject to estimation (as per example case in ISO 30134-2) and imprecise. The actual PUE3 versus dPUE can have a huge gap given the IT load profile will normally not ramp up to near 100% for any new data center facility.

This encourages the owner of yet-to-be-built data center to claim a low dPUE. You know, it is an estimate, who is it to say the figure of 1.13 is wrong? You want to check my calculations? Talk to my design consultants who are the ones that work out that number (at the insistence of me to assume the best case situation to come up with a low dPUE).

The announced ban by Beijing for new data center with PUE of 1.5 or above really meant designed PUE. Given that it is a designed PUE, a lot can go into estimating a low dPUE. Who is going to shut-off the power after the data center facility is designed, equipment selected, built and operating at well below full capacity thus yielding a bad actual interim PUE? There are many ways to go about to make the dPUE figure works to your advantage. See reference 1.

You may ignore ancillary power usage or give a very low predicted power usage in the mechanical load or cite the most ideal power efficient chiller in the design but choose a not the most efficient chiller when you decide to purchase the actual equipment. Or you decide to base your dPUE on PUE1 or PUE2 way of calculating the dPUE which makes it look slightly better. They all add (or minus) up.

pue-at-design-load-chart-2

Credit: CGG Facilities. http://www.ccgfacilities.com/insight/detail.aspx?ID=18

From my experience of operating and auditing more than a dozen data centers, I have seen very crude designed PUE estimation and some better ones.

The thing is that the designed PUE always looks too good and it stems from:

  • Not including some of the data center infrastructure losses
  • Not including electricity losses in the cables (3%)
  • Tolerance of installed equipment performing to factory specifications
  • Estimation using PUE1 situation, i.e. at UPS output whereas PUE2 or PUE3 is the recommended way
  • Different environmental conditions over 12 months in a real data center will be sub-optimal

A friend of mine who works in the data center co-location service provider laments that their honesty has given them a lower category in a green data center award versus others in the same city that claim lower dPUE figures and got higher awards. It may not be completely due to the lower dPUE figures, but it play a part.

The clients are not fools and the data center colocation service provider that claims such low dPUE will find it tougher to negotiate co-location service contracts as the power bill recovery in some countries are tied to the actual PUE but related to the dPUE when closer to full utilization. This will eat into their profits.

Ultimately, it is the real PUE3 that measures over a period of 365 days at current client IT power load that matters, and a 100% leased out co-location data center which meant full endorsement by the clients. Nothing speaks better that ka-chin at the cash registers, no amount of bill board outside will take money out of wallets of potential clients. It is how the design, equipment selection, measurement and reporting, running a tight operations, continuous monitoring and enhancement, people that all combines into having a well-run and well respected data center facility with a happy clientele that grows the co-location business. Playing with dPUE gets some attention, but delivering the service consistently and having clients that take up more of your data center space is the indicator of healthy data center business.

It is my hope that awards in energy efficient data center shall be based on actual PUE, rather than designed PUE.

Reference:

  1. http://www.ccgfacilities.com/insight/detail.aspx?ID=18
  2. https://www.greenbiz.com/article/new-efficiency-standard-challenges-data-center-status-quo
  3. http://www.datacenterknowledge.com/archives/2009/07/13/pue-and-marketing-mischief/
  4. ISO/IEC 30134-2    Part 2, Power Usage Effectiveness (“PUE”) – http://www.iso.org/iso/home/store/catalogue_tc/catalogue_tc_browse.htm?commid=654019
The problem with the (use of) PUE in the Data Center industry

Human Error: The biggest challenge to data center availability and how we can mitigate – Part 1

it_engineer_with_server

The 2016 Ponemon Institute research report on Cost of downtime (reference 1) contains a chart showing the cause of data center downtime, and classify accidental human error to be 22%, and the top six contributors to downtime are UPS system failure (25%), cyber crime (22%), accidental human error (22%), water/heat/CRAC failure (11%), weather related (10%), and generator failure (6%). However, the accidental human error did not account for latent human error that could have contributed to those UPS/CRAC/Generator failure.

ponemon2016

Uptime Institute had cited 70% of data center outages can be attributed to human error.

The definition of human error is broader and can be generally classify into Active Error (where a deliberate action caused deviation from expected outcome), and Latent Error (where a non-deliberate action caused deviation from expected outcome). For example, when a design decision is made regarding the power protection circuit for a data center room, if it was not fully co-ordinated to isolate and protect power issue to cascade upstream to higher level circuit breakers.

There are many cases of major outages in the past few years that are attributed to human error. The 2016 Delta airline data center outage is reported to cost them USD 150 Millions. Part of the long delay (3 days) to resume service is that a significant part of their IT infrastructure is not connected to backup power source which begs the question why did it happen that way? Well, it should be due to latent error, where the IT equipment installation or the in-rack PDUs are not from two separate UPS or supported by in-rack ATS switch.

I was asked a question during my presentation on this subject matter whether higher tier level aka higher resiliency designed and implemented data center can minimize this issue of human error. My answer is you can design and implement to 2N power and cooling infrastructure, but when 1N is taken down for maintenance, any mistake or weakness (inexperience operations staff/vendor personnel, procedure gap that human nature overlooked and made wrong guess etc) can take down the IT load and has happened to many data centers (google search on human error and data center outage incidents).

swiss_cheese_model_of_accident_causation

There are multiple ways for the human error to manifest in a data center outage. They can be simple external trigger that goes through loopholes like the Swiss cheese above, or cascade (combination), or direct active human error.

For example on cascade, a case of lightning strike that caused momentary power dip (see reference) should not cause an outage in a data center; however if the selection of circuit protection device or the design did not cater for how the DRUPS would respond in such a situation, and the automated control was not configured to deal with it, then any amount of SOP/MOP/EOP or Method of Statement-Risk Assessment (MOS-RA) may not protect the facility against a particular external trigger. A case of a data center in Sydney whereby the circuit breakers were not designed and selected to cater to such a scenario caused the UPS to supply to the grid instead of to the load.

For direct human error, I have also known a case of UPS manufacturer trained and authorized service engineer causing an outage, where the engineer did not follow the documented service manual and caused the entire set of UPS to tripped, and because the circuit protection devices were not able to isolate the fault downstream, caused the upstream incoming breaker to trip. This is part of the reason why data center staff should accompany and question the service engineer at critical check-points during servicing of critical infrastructure.

Outage can be failure of the resilient design / implementation due to under-capacity. This can be traced to latent (no tracking of actual power capacity versus designed capacity) or active (no checking of UPS capacity before maintenance). For example, actual power usage of N+1 UPS has actually become N UPS, and when one of the UPS was down, the entire UPS set shutdown.

In the next post, measures to mitigate the risk of human error will be discussed.

References:

  1. http://www.enterpriseinnovation.net/system/files/whitepapers/1_2016-cost-of-data-center-outages-final-2.pdf
  2. https://aws.amazon.com/message/4372T8/
  3. http://news.delta.com/chief-operating-officer-gives-delta-operations-update
  4. https://journal.uptimeinstitute.com/examining-and-learning-from-complex-systems-failures/
Human Error: The biggest challenge to data center availability and how we can mitigate – Part 1

Trust but verify

Incident #1

Mid last year while I was in Shanghai conducting a data center training course, a friend and I arranged to meet up in a café and ordered lunch, something is wrong when the waiter brought three main course and we told him we only ordered two. He said he also felt it strange when he thought we ordered three mains. He has penciled down our orders on his order sheet which he should have used it to repeat our orders back to us. His supervisor came back to apologize for the confusion and will cancel the extra order.

Incident #2

Back in the days when I was working in an IT outsource company, I was called out to assist another site’s investigation of a problem.

A request by a VIP user in the company to retrieve his mailbox because he accidentally deleted an important email raised a service request to restore his mailbox to the previous night’s copy. Zilch. Nothing. So the server admin use the day before tape, nothing.

All the way to a week ago. Still nothing. They checked the tapes, not only was there no backup of that ViP user mailbox, the entire exchange server mailboxes are not backed up. And then they checked all the backups of all the enterprise servers which is linked to a server dedicated to perform this backup for six hours starting from midnight. Nothing. It has been doing back up of empty since a year ago when they migrated to a centralized backup server using netbackup software.

The CIO demanded an investigation by the IT outsourcing vendor and I was called onsite.

I asked my senior support engineer who is a backup expert to come along. A quick check by him noticed that a check box on the netbackup software on the centralized backup server was not ticked. This checkbox if ticked is meant to backup the system and data as an incremental backup if there was a full backup previously, if no full backup previously it will backup the entire system and data as a full backup. Now the situation was there was not a full backup in the trail of backups, but this checkbox was not ticked, therefore the tape drives has been backing up empty, nothing.

A simple daily backup checklist to check the backup log, do a restore and test function would have prevented the problem.

Incident #3

A couple of years ago, there was this brown field data center project, i.e. retrofitting an industrial building into a dedicated data center co-location facility was underway in Beijing, which took more than two years (that is another story on why it took so long for the retrofit). There are four parties involved in the project, the owner’s data center project team (let’s call them A), main contractor for the project (let’s call them B), 3rd party project supervision company appointed by A (let’s call them C), data center design firm (let’s call them D), and various other parties.

Three outdoor chilled water storage tanks were delivered on site and arrangements were being made to install them. These tanks are cylindrical and are to be 8 meters tall with circumference of 1.5 meters. However, project superintendent C finds that the specifications are wrong (according to last approved drawings by D) which should be 12 meters by 1 meters. The sub-contractor for the chilled water storage tanks (let’s call them E) and main contractor B were adamant that there was an agreement in response to request by owner A to reduce the height of the tank (so that it is shorter than the building’s height) thereby changing the specifications. D said there was no such record of request nor did they change it. A said there should be.

The tanks were on site but were not installed for four full days. Finally, on insistence by project manager of A for a face to face meetings of all five involved parties (A, B, C, D, E), the request to change the design of the chilled water storage tanks were contained in an email communications between D and E. The problem lies with no documented project minutes and all the parties not keeping proper records causing delays to sort things out. Time was wasted on this and many other things, which is one of the reason why the project was late and taking so long.

So, quoting a famous sentence that is attributed to Ronald Reagan: “We trust, but Verify!”, and follow up immediately with documentation.

Reference:

  1. https://en.wikipedia.org/wiki/Trust,_but_verify
Trust but verify

Data Center Tiers, No Tears, No Plus, and No Minus

Published 13 August 2016

This post is also available via https://newwitblog.wordpress.com/

  1. Background

In press releases, promotion material and website of some data center service providers, we often came across this term Tier 3+, or Tier 4-, or Tier 3.5. This gives an impression the facility is of a higher level of resiliency in terms of design or implementation.

 

  1. What’s in a Tier/Rated/Facility-Class

Tier Classification System is trademark by Uptime Institute (UTI). UTI will assess and award the appropriate Tier level if a data center facility owner or private data center client engages UTI to perform such an evaluation. UTI issues the Tier levels in roman numerals I/II/III/IV. https://journal.uptimeinstitute.com/explaining-uptime-institutes-tier-classification-system/

The Telecommunications Industry Association, which is an American organization that issues telecommunications cabling and telecommunications facility standards, issued ANSI/TIA-942-A which is titled “Telecommunications Infrastructure Standard for Data Centers”, of which the latest 2014 version contains three informative annex (D, E, F) on data center space considerations, site selection and building design considerations, and data center infrastructure rating. Using the informative annexes of TIA-942-A, a data center facility can be rated according to four categories (Telecommunications, Architectural and Structural, Electrical, and Mechanical) to be Rated 1 – Basic, 2 – Redundant Component, 3 – Concurrently Maintainable, and 4 – Fault Tolerant.

The EN-50600 standard classify a data center in a similar manner to TIA-942-A, but adds a Facility Class 0, i.e. FC-0, while FC-1 through to FC-4 are essentially the same as TIA-942-A’s Rated 1 through to 4. FC-0 is a basically a computer room with server directly connected to utility power without backup power.

 

  1. Plus? Minus?

In any of the abovementioned standards, there is no mentioned of a +/- to any of the rating or classification.None of the standard gives room for partial, fractional, +, or – rating modifier, neither do UTI for their Tier award. So a data center can only be awarded certification that stated Tier III, or Rated 3, or Facility Class 3, but not 3.5, or 3+, or 4-.

 

  1. Dig Deeper Behind that Claimed Rating

If the particular data center facility that announced that they have a Tier 3+ data center facility, checked whether were the rating issued by any competent third party or an technical audit firm. No competent third party or technical audit firm should or will issue such a non-standard rating.

Such Tier 3+ or Tier4- are self-proclaimed rating in an effort by that data center facility to signal that their facility has features better than Tier 3 or just a tad below Tier 4. But, without a competent third party to evaluate whether that particular data center facility meets, say Rated 3 in the categories of Electrical and Mechanical in the first place.

In the case that that particular data center facility is evaluated by third party to be Tier 4 in the category of Electrical and Tier 3 in the category of Mechanical, than it is to be given at the lowest common rating, i.e. a Tier 3 rating.

 

  1. What should potential Data Center Client Do

If the Tier level is self pro-claimed without the word certified by, or usually the words are like “our latest data center is designed to Tier 3+ resiliency”, then it is most likely not certified by any third party and the potential data center client should insist on a competent technical third party to evaluate the technical attributes of the data center if they want to consider collocating their IT equipment with them.

  • Ask the site to substantiate the self proclaimed rating using third party

We should just disregard the +, and in our mind de-rate those – or those with the decimal, i.e. if we see a Tier 4- or a Tier 3.5, we should just consider such a data center facility to be designed to Tier 3 and if we decide to consider such a data center facility, to engage a competent technical third party or better yet is to insist that the facility owner engages a third party and bears the cost.

The data center facility may dangle the Tier 3+ as a indirect indication that their site is of high quality which imply justifying a higher premium. However, the potential client should have a site selection process that have clear requirements of a data center facility and do not attach any score to the site unless it is justifiable through third party certification. Having a certification should be viewed as hygiene factor. The evaluation criteria should request for data on the technical, business/financials, and operations attributes which allows for normalization and comparison across the different shortlisted sites.

  • Do not over rely on the Tier level rating

Tier 4 data center facility doesn’t mean no downtime. Its fault tolerant but trouble rarely comes once, it may come twice and thrice. And it doesn’t take power or cooling issue to bring down a critical IT system within a data center. Human can cause problem. Or in the July 2016 incident with the Singapore Stock Exchange’s more than 5 hours unavailability of its trading system, it’s a hard disk failure that drag down the entire trading system. A distributed denial of service attack, or a telecommunications problem can bring down IT.

  • Tier level and Suitability to client business IT needs

A data center’s main function is to house IT equipment. Whether that IT equipment requires fault tolerant power and cooling support or it is only a test environment that can take a rung or two lower in terms of power and cooling resiliency set-up, then a data center facility that allows you to have a private suite to house critical production IT equipment in a Tier 4 set-up, and a small suite or even cage in a shared Tier 3 co-location hall is more suitable, thereby giving rise to a combined set-up that meets both the business need and best bang for the buck. This set-up is also called multi-tier or flexible-tier set-up. Not all data center facility can meet this need or the cost is higher because the base set-up of that particular data center facility will involve heavy re-work compared to one that is ready from day-one to be flexible in this aspect.

  • Evaluate using a comprehensive set of evaluation criteria

The things that potential data center client should do is look beyond the rating level, as whether a data center facility is designed, implemented, and certified to a data center rating level is just one facet of a data center facility’s suitability to its IT needs. There are multitude of other factors including telecommunications facilities, data center facility operations system, competencies of the facility people, among others that counts towards a resilient IT operations in a data center.

  • 24×7 on the ball operations and Watch that capacity

Sometimes, the Tier rating level will drop as the designed capacity is breached and N+1 suddenly because N and the site loses the redundancy ability. Concurrently maintainable or Fault tolerant electrical design do means that when 1N of the 2N UPS is taken offline to have servicing performed, the planning and execution of such maintenance should have the proper procedures (SOP, MOP, MOS) and backup or roll-back plans (RA). You want to minimize risk and risk window of UPS problem when the other set of UPS is taken offline for maintenance. You should also not allow the UPS maintenance and backup generator maintenance to take place at the same time because this doubles the risk that when the remaining 1N UPS fails and then the generators are on manual, you will be forced to rely only on utility supply. The maintenance should be during non operations hours. All these things comes into play and the vendor experience is very important.

 

Reference:

  1. https://journal.uptimeinstitute.com/explaining-uptime-institutes-tier-classification-system/
  2. http://www.tia-942.org/content/162/289/About_Data_Centers
  3. http://www.computerweekly.com/tip/Four-data-center-tier-classification-misconceptions-demystified
  4. http://searchdatacenter.techtarget.com/feature/What-colocation-customers-should-know-about-data-center-tiers
  5. https://www.linkedin.com/pulse/sharing-data-center-site-selection-evaluation-james-soh?trk=mp-author-card
Data Center Tiers, No Tears, No Plus, and No Minus

External Event that Affected Data Centers and Lessons Learnt

Tianjin-explosion-photo

Background

Exactly 1 year ago (the day this article is posted), i.e. 12th of August 2015, at about 22:51 hours, a fire broke out at a dangerous goods warehouse in Tianjin BinHai District. After the fire services are on site, an explosion rocked the site at 23:30 hours and another explosion happened again at 23:34 hours.

There are at least two data center facilities within a 2km radius of the site and another 3 facilities within a 3.8km radius. The data centers were already built and running before the dangerous goods warehouse (constructed in 2013) came in existence.

In one of my earlier post on data center site selection, it is one thing to conduct the environmental factors and distance from dangerous goods storage facility before the data center site selection, but such exercise should be conducted on an annual basis to evaluate if there is any change of the environmental risk.

tianjin-incident-20150814024919762

Data Centers in the Blast Radius

Those data centers within this 2km circle includes Tencent, Standard Chartered bank Tianjin backend processing center that contains a data center, National Supercomputer centre, China Hewlett Packard cloud solution centre, and LiePin (a job search company) data center. Data centers that are further away but still within Tianjin are Sohu, 58TongCheng, 21Vianet, China Telecom and China Unicom.

Data Centers that are known to have stopped operations or IT services are impaired

  1. National Supercomputer Center
  2. Tencent Tianjin Data Center

The Tencent Tianjin Data Center had to stopped operations due to evacuation order by the authorities as the chemical fumes from the incident site is harmful to human. Damage to existing data center equipment and facilities were not too serious as the wall of the Tencent Tianjin data center facility closest to the explosion site is undergoing fitting-out work as part of its phase 2 project work. However, the force of the explosions were so great that some air-handling units that were awaiting installation as part of the phase 2 project were moved a meter or so by the explosive force.

Tencent transfer their Tianjin Data Center workload over to their main data center at Shenzhen and had their people evacuated. Their vendors were informed and ready to repair damaged equipment after evacuation order were lifted.

From the Tencent response and preparedness, we can deduce that Tencent have a few key factors that made it ready for a major power or site accessibility incident:

 

  1. The People Factor – the people at both their Tianjin and Shenzhen facilities are trained and ready for such a major event.
  2. The System Factor –Tencent have several public facing systems such as QQ, Wechat, Wechat Pay, Tencent eMall, gaming, video streaming and so forth, and all these backend systems are up to date and their frontend app are engineered in such a way that they can be served from any of their data center. Tencent probably had to do something on their backend to ensure data integrity of their transactional system (e.g. for online gaming and Tencent eMall) works without problem despite the unavailability of their Tianjin Data Center facility.

 

From the social media that some of the Tencent staff had shared, however, they were unprepared in terms of:

  1. Food and Drinking Water Supply
  2. Accommodation
  3. Up to date news, however this is probably due to the lockdown or disparate information from the authorities during the first 24-48 hours of the incident.

Tianjin-news_7__796311779

Tianjin-office-photo2C2B476040.jpegInside an office

Tianjin-Inside-a-building-2 Inside an office

Other similar Incident

On 21st August 2015, there was an explosion that is attributed to a generator in a building basement, at downtown Los Angeles (http://www.datacenterknowledge.com/archives/2015/08/21/explosion-downtown-los-angeles-disrupts-data-center-operations/)

Quote “The blast at 811 West Wilshire Blvd. took out an on-site power station, leaving 12 buildings in the area without electricity, according to the local utility.

The explosion interrupted connectivity on network infrastructure operated by Level 3 Communications, which serves a lot of data center users in the area, Craig VerColen, a spokesman for LogMeIn, a company whose data center went dark as a result of the incident, said via email. Level 3 issued a statement saying its technicians were working to restore services.”

 

List of Data Center service providers DOs

  • Regular review of site profile and surroundings, assess risk and devise mitigating measures
  • Business Continuity and Disaster Recovery plan and exercise
  • Evacuation Plan and exercise
  • Disaster Response Checklist (update those that are not up to date, check your fuel supply vendor contacts and their ability to supply fuel)
  • Updated Vendor Contact List
  • Include a damage assessment checklist that list all the major equipment and systems, including inside the building, the data halls, plant rooms, fuel tanks, and the surroundings.
  • Stay in touch with every staff, on and off shift
  • Update staff on safe transport route
  • Procure food supplies and place in office
  • Inform your legal and finance people (document your loss, and ask your legal and finance teams to prepare for insurance claim)
  • Get ready additional resource including staff and transport
  • TIA942-2012A states that a data center site should be more than 0.4km away from chemical plant. It is better to safer, for example be more than 3.2km away.

 

Reference:

  1. https://zh.wikipedia.org/wiki/2015%E5%B9%B4%E5%A4%A9%E6%B4%A5%E6%B8%AF%E5%8D%B1%E5%8C%96%E5%93%81%E5%80%89%E5%BA%AB%E7%88%86%E7%82%B8%E4%BA%8B%E6%95%85
  2. http://www.missionmode.com/blog/business-continuity-lessons-tianjin-port-explosion/
  3. http://www.datacenterknowledge.com/archives/2015/08/21/explosion-downtown-los-angeles-disrupts-data-center-operations/
External Event that Affected Data Centers and Lessons Learnt

关注数据中心的能源效率,ISO/IEC 30134,正确的使用PUE值

简介

英文版本请点击链接https://newwitblog.wordpress.com/2016/06/10/data-center-resource-efficiency-pue-isoiec-30134/

自从绿色网格(The Green Grid)在2006年发布能源效率值Power Utilization Effectiveness (“PUE”),已经在数据中心行业及以外使用为数据中心能源使用效率的衡量指标. 是数据中心的在累计十二个月的所有能耗与IT负载使用的能源之比。绿色网格推广PUE的原意是让数据中心业主先探测然后制定减能耗计划来降低PUE值。

PUE虽然简单易懂,但是不正确的使用会带来困扰甚至于误导。不使用完整的12个月测量期或选择PUE1测量点的低于1.5的PUE值都是不正确做法的例子。

数据中心业内把低PUE值这个单纯的数据中心能源效率管理工具拿来与其他数据中心互相对比也是错误的使用PUE。模糊化的PUE值其实降低了PUE值的可信度。

绿色网格在发现这个错误使用PUE工具的现象之后发布一系列的报告,指导正确的PUE值的测量点,使用法,和深化PUE的相关术语。

国际标准化组织(The International Standards Organization – ISO)在2014年成立专门对数据中心能源使用关键绩效指标标准的ISO/IEC JTC 1/SC 39标准制定小组,绿色网格组织以观察者身份积极参与,提供了PUE和相关资料。
2016年4月份,ISO发布了数据中心能源使用关键绩效指标标准(ISO/IEC 30134)的三份子标准:

  • ISO/IEC 30134-1    第一部, ISO/IEC 30134 概述
  • ISO/IEC 30134-2    第二部, Power Usage Effectiveness (“PUE”) 能源使用效率
  • ISO/IEC 30134-3.    第三部, Renewable Energy Factor (“REF”) 再生能源因素

ISO 网店可买到以上正式标准文件。

ISO/IEC JTC 1/SC 39标准制定小组正在讨论并且计划推出另外两部子标准,这两部标准将是 IT设备能源效率(IT Equipment Energy Efficiency for Servers – ITEE) 和IT服务器使用效率(IT Equipment for Utilization of Servers – ITEU_SV).

 

重点考虑

既然PUE已经成为ISO标准,那么所有PUE值的采集以及计算就必须是符合标准(ISO 30134-2),比如IT负载必须是最接近IT设备的机柜内电源分配插座。

几乎所有的数据中心宣传资料所写的PUE值都不标明PUE分类或是否只是设计PUE(dPUE)。ISO/IEC 30134-2里指导PUE2是从数据中心列头柜或者房间式配电柜取得IT设备用电量。最精确的做法是从IT机柜内电源分配插座取得IT设备用电量,也就是PUE3的计算方法。ISO/IEC 30134-2也做出指导,PUE值在1.2-1.5之间,可以是用PUE2的采集方法,而PUE值在1.2(含)以下则必须是PUE3的采集方法。

ISO/IEC 30134-2也包含了延伸的PUE值,包括设计PUE(designed – dPUE),局部PUE(partial – pPUE),中期PUE(interim – iPUE)。

数据中心的PUE值也不是数据中心客户唯一关注的点,一个经常被忽悠但是很重要因素是供电和制冷系统的可用性及冗余,包括在设计期间以及运维中的调整对PUE值的影响,这个因素在ISO/IEC 30134-2的附录A里面有介绍。举个例子,如果在同一座城市里有两个数据中心A和B,A提供的PUE2值是1.4,B提供1.5,在不完全了解数据中心的总IT使用率,设计及运维而认为数据中心A比B的能耗低是不正确的。比如数据中心B可能把一直都把备用的一台制冷水机组并入N,也就是说N+1台的制冷水机组总是在运行状态,而数据中心A是N台制冷水机组是在运行状态,那么B把规避制冷水机组故障造成制冷量下降的风险看得比降低能耗是无可厚非的。又如果说数据中心A的2N UPS都是在节能(Economy)旁路市电状态而数据中心B应客户要求把2N UPS设置为标准状态,那么数据中心B的能耗较高但是新客户可以要求设置为节能模式的话,就能降低PUE值。

 

PUE已是标准工具

PUE是一个工具,而数据中心的能源管理及关键绩效指标及其他工具都能帮助我们对数据中心的资源(电,水,再生能源,IT设备使用率等等)管理。

选址,设计,选择设备,运维,采购IT设备等节点都影响数据中心的能源使用,就得把能耗放在这些节点的主要考虑因素之一。

以PUE为宣传目的,到底是PUE1/PUE2/PUE3/dPUE的模糊性,或者单纯把PUE值拿来比较两家数据中心而不考虑所有可以影响PUE值得因素,都对数据中心行业的健康发展无帮助。

有了划一的标准- ISO/IEC 30134,正确的使用标准来测量以及计算PUE及其他数据中心的关键指标将关注点聚焦在制定数据中心长期能效改善计划。关注,正确使用,参与数据中心能源效率或其他关键绩效指标 的制定,甚至试用并对相关主导机构提出加强及改善这些工具的意见将意义深远。

最终,重复一句,PUE是有用的,用得恰当,帮助自己的数据中心减低能耗是PUE工具的主要功能,宣传PUE值请写明PUE分类或者延伸的PUE值的全名。

关注数据中心的能源效率,ISO/IEC 30134,正确的使用PUE值