Can you help us build a tier 5 data center?

the-data-center-build-photo

A data center consultant K told me this story. In around 2005 or 2006 he gave a talk at a data center conference at a famous financial and resort city somewhere in Asia. A gentleman J walked up to K afterwards and introduced himself as a property developer who are looking into building a new data center. The topic that K spoke about on the stage was about data center standards and mentioned Uptime Institute Tiers and TIA942, and J said he wanted to build a Tier 5 data center.

As an aside, let me defer to other posts/websites on the design standards and Tier level/Rated/Facility class (see reference 1 and 2). Generally speaking, most define the data center design based on resiliency required up to four levels.

K was taken aback and asked if J is aware that the Tier levels tops off at IV / 4, J said he knows and he wanted to go one up better than Tier IV / 4. J shared that given the city he planned to build the new data center has not had any standalone data center facility, he wants to stand out and that city is well known to have extravagant hotels and malls and such.

The idea that build them and they will come

K is kind enough to ask J if he had done a market study and knows whether potential clients demand a highly resilient and fault tolerant data center, and J replied no he has not but he thinks that demands will rush in for his data center once it is announced that such a data center will be built. Well, maybe if you have done your study and knows where the competition for starters. But if you have not done any of the study of market demand and competition, then what you built may be over built, or way ahead of demand and will take longer than your optimistic timeframe to sell them.

I had on multiple occasions met with potential data center owner who are considering to build their first data center non first tier data center market in Asia. Surprisingly, a common central theme of their plan hinges on “build them and they will come” mindset. Today, several Asia cities are in over-supply not only in residential / industrial sectors but also the sub-sector of data center of which the over-confidence of demand will come when supply is there is one contributor to the situation. Data center facility is a huge investment, a China data center company I have known to have a data center facility in Beijing that is well sought after, but expanded in other cities which they are less familiar with and suffers losses for years, which drags down their overall finances and they are forced to sell their crown jewel at less than preferred circumstances and numbers.

Client needs and supply / demand

I have two points to make. Firstly, know your market and competition, and your financial strength. If all your competition in the market are building to get shared hosting type which only demands a UPS backed electrical supply to their IT servers, then building it to higher level of resiliency makes your data center space more pricy and will take longer to fill up, if ever. There was a few such cases in Singapore and some have folded after building a data center, and some have spent millions of dollars and the project cannot take off and are now in limbo. Many such cases also exist in China. While one case in Singapore had prevailed, they built their data center during the dot com boom but were caught in the downturn of dot com bust which had several casualties, and this one data center managed to survived through building up their data center facility on a floor by floor basis unlike the other two, thus less demand on their financials compared to the others during that period.

More prudent to match cost outlay to take-up

Secondly, the main technical infrastructure design parameter of whether to build to concurrently maintenance (aka roughly equivalent to Tier III / Rated 3 / Facility Class 3) or fault tolerant (Tier IV / Rated 4 / Facility class 4) are dependent on the demand by the client. If the target clientele are the financial institutions or those organizations that due to various reasons are reliant on IT but their system can only run on single host/system or active-passive set-up (it seems like airline ticket reservation system are like that), then it make sense. Another way is to plan for multiple level of resiliency features, i.e. share the same fault tolerant level of electrical infrastructure but flexible enough to accommodate either concurrently maintainable or fault tolerant demand of the client (although generally this will be slightly more costly than purely designed and implemented to concurrently maintainable).

Fortunately these days, there are so much information in the market and the new owners-to-be are better informed. My other gripe is those that knows a little in one particular topic of the data center knowledge and yet is so convinced of it that precludes meaningful exchange, but that is another story in future post.

Reference:

  1. http://www.datacenterknowledge.com/archives/2016/01/06/data-center-design-which-standards-to-follow/
  2. https://uptimeinstitute.com/tiers
  3. https://www.linkedin.com/pulse/data-center-tiers-tears-plus-minus-james-soh-%E8%8B%8F%E6%97%AD%E6%B1%9F
  4. https://www.linkedin.com/pulse/making-sense-data-center-standards-james-soh-%E8%8B%8F%E6%97%AD%E6%B1%9F
Can you help us build a tier 5 data center?

The problem with the (use of) PUE in the Data Center industry

Montage-data-centers.jpg

I had mentioned in a previous post on reporting and use of PUE, including the terms iPUE, PUE3, dPUE etc ( https://www.linkedin.com/pulse/data-center-resource-efficiency-pue-isoiec-30134-james-soh-%E8%8B%8F%E6%97%AD%E6%B1%9F).

Like how Tier / Facility Class / Rated are being mentioned fuzzily in the industry, having not make clear whether it is designed according to which standard or certified or not, of which the confusion is not helping the potential clients and the whole industry. Just to clarify, I take no stand against any data center saying that its facility is designed in accordance to a particular standard given that any potential client should and will make detail review and audit of the facility before committing to a co-location deal.

The issue that I like to highlight in this post, is the use of designed PUE (dPUE) instead of PUE in the way it is used in a manner to market or even set policy. dPUE itself is subject to estimation (as per example case in ISO 30134-2) and imprecise. The actual PUE3 versus dPUE can have a huge gap given the IT load profile will normally not ramp up to near 100% for any new data center facility.

This encourages the owner of yet-to-be-built data center to claim a low dPUE. You know, it is an estimate, who is it to say the figure of 1.13 is wrong? You want to check my calculations? Talk to my design consultants who are the ones that work out that number (at the insistence of me to assume the best case situation to come up with a low dPUE).

The announced ban by Beijing for new data center with PUE of 1.5 or above really meant designed PUE. Given that it is a designed PUE, a lot can go into estimating a low dPUE. Who is going to shut-off the power after the data center facility is designed, equipment selected, built and operating at well below full capacity thus yielding a bad actual interim PUE? There are many ways to go about to make the dPUE figure works to your advantage. See reference 1.

You may ignore ancillary power usage or give a very low predicted power usage in the mechanical load or cite the most ideal power efficient chiller in the design but choose a not the most efficient chiller when you decide to purchase the actual equipment. Or you decide to base your dPUE on PUE1 or PUE2 way of calculating the dPUE which makes it look slightly better. They all add (or minus) up.

pue-at-design-load-chart-2

Credit: CGG Facilities. http://www.ccgfacilities.com/insight/detail.aspx?ID=18

From my experience of operating and auditing more than a dozen data centers, I have seen very crude designed PUE estimation and some better ones.

The thing is that the designed PUE always looks too good and it stems from:

  • Not including some of the data center infrastructure losses
  • Not including electricity losses in the cables (3%)
  • Tolerance of installed equipment performing to factory specifications
  • Estimation using PUE1 situation, i.e. at UPS output whereas PUE2 or PUE3 is the recommended way
  • Different environmental conditions over 12 months in a real data center will be sub-optimal

A friend of mine who works in the data center co-location service provider laments that their honesty has given them a lower category in a green data center award versus others in the same city that claim lower dPUE figures and got higher awards. It may not be completely due to the lower dPUE figures, but it play a part.

The clients are not fools and the data center colocation service provider that claims such low dPUE will find it tougher to negotiate co-location service contracts as the power bill recovery in some countries are tied to the actual PUE but related to the dPUE when closer to full utilization. This will eat into their profits.

Ultimately, it is the real PUE3 that measures over a period of 365 days at current client IT power load that matters, and a 100% leased out co-location data center which meant full endorsement by the clients. Nothing speaks better that ka-chin at the cash registers, no amount of bill board outside will take money out of wallets of potential clients. It is how the design, equipment selection, measurement and reporting, running a tight operations, continuous monitoring and enhancement, people that all combines into having a well-run and well respected data center facility with a happy clientele that grows the co-location business. Playing with dPUE gets some attention, but delivering the service consistently and having clients that take up more of your data center space is the indicator of healthy data center business.

It is my hope that awards in energy efficient data center shall be based on actual PUE, rather than designed PUE.

Reference:

  1. http://www.ccgfacilities.com/insight/detail.aspx?ID=18
  2. https://www.greenbiz.com/article/new-efficiency-standard-challenges-data-center-status-quo
  3. http://www.datacenterknowledge.com/archives/2009/07/13/pue-and-marketing-mischief/
  4. ISO/IEC 30134-2    Part 2, Power Usage Effectiveness (“PUE”) – http://www.iso.org/iso/home/store/catalogue_tc/catalogue_tc_browse.htm?commid=654019
The problem with the (use of) PUE in the Data Center industry

Human Error: The biggest challenge to data center availability and how we can mitigate – Part 1

it_engineer_with_server

The 2016 Ponemon Institute research report on Cost of downtime (reference 1) contains a chart showing the cause of data center downtime, and classify accidental human error to be 22%, and the top six contributors to downtime are UPS system failure (25%), cyber crime (22%), accidental human error (22%), water/heat/CRAC failure (11%), weather related (10%), and generator failure (6%). However, the accidental human error did not account for latent human error that could have contributed to those UPS/CRAC/Generator failure.

ponemon2016

Uptime Institute had cited 70% of data center outages can be attributed to human error.

The definition of human error is broader and can be generally classify into Active Error (where a deliberate action caused deviation from expected outcome), and Latent Error (where a non-deliberate action caused deviation from expected outcome). For example, when a design decision is made regarding the power protection circuit for a data center room, if it was not fully co-ordinated to isolate and protect power issue to cascade upstream to higher level circuit breakers.

There are many cases of major outages in the past few years that are attributed to human error. The 2016 Delta airline data center outage is reported to cost them USD 150 Millions. Part of the long delay (3 days) to resume service is that a significant part of their IT infrastructure is not connected to backup power source which begs the question why did it happen that way? Well, it should be due to latent error, where the IT equipment installation or the in-rack PDUs are not from two separate UPS or supported by in-rack ATS switch.

I was asked a question during my presentation on this subject matter whether higher tier level aka higher resiliency designed and implemented data center can minimize this issue of human error. My answer is you can design and implement to 2N power and cooling infrastructure, but when 1N is taken down for maintenance, any mistake or weakness (inexperience operations staff/vendor personnel, procedure gap that human nature overlooked and made wrong guess etc) can take down the IT load and has happened to many data centers (google search on human error and data center outage incidents).

swiss_cheese_model_of_accident_causation

There are multiple ways for the human error to manifest in a data center outage. They can be simple external trigger that goes through loopholes like the Swiss cheese above, or cascade (combination), or direct active human error.

For example on cascade, a case of lightning strike that caused momentary power dip (see reference) should not cause an outage in a data center; however if the selection of circuit protection device or the design did not cater for how the DRUPS would respond in such a situation, and the automated control was not configured to deal with it, then any amount of SOP/MOP/EOP or Method of Statement-Risk Assessment (MOS-RA) may not protect the facility against a particular external trigger. A case of a data center in Sydney whereby the circuit breakers were not designed and selected to cater to such a scenario caused the UPS to supply to the grid instead of to the load.

For direct human error, I have also known a case of UPS manufacturer trained and authorized service engineer causing an outage, where the engineer did not follow the documented service manual and caused the entire set of UPS to tripped, and because the circuit protection devices were not able to isolate the fault downstream, caused the upstream incoming breaker to trip. This is part of the reason why data center staff should accompany and question the service engineer at critical check-points during servicing of critical infrastructure.

Outage can be failure of the resilient design / implementation due to under-capacity. This can be traced to latent (no tracking of actual power capacity versus designed capacity) or active (no checking of UPS capacity before maintenance). For example, actual power usage of N+1 UPS has actually become N UPS, and when one of the UPS was down, the entire UPS set shutdown.

In the next post, measures to mitigate the risk of human error will be discussed.

References:

  1. http://www.enterpriseinnovation.net/system/files/whitepapers/1_2016-cost-of-data-center-outages-final-2.pdf
  2. https://aws.amazon.com/message/4372T8/
  3. http://news.delta.com/chief-operating-officer-gives-delta-operations-update
  4. https://journal.uptimeinstitute.com/examining-and-learning-from-complex-systems-failures/
Human Error: The biggest challenge to data center availability and how we can mitigate – Part 1

Trust but verify

Incident #1

Mid last year while I was in Shanghai conducting a data center training course, a friend and I arranged to meet up in a café and ordered lunch, something is wrong when the waiter brought three main course and we told him we only ordered two. He said he also felt it strange when he thought we ordered three mains. He has penciled down our orders on his order sheet which he should have used it to repeat our orders back to us. His supervisor came back to apologize for the confusion and will cancel the extra order.

Incident #2

Back in the days when I was working in an IT outsource company, I was called out to assist another site’s investigation of a problem.

A request by a VIP user in the company to retrieve his mailbox because he accidentally deleted an important email raised a service request to restore his mailbox to the previous night’s copy. Zilch. Nothing. So the server admin use the day before tape, nothing.

All the way to a week ago. Still nothing. They checked the tapes, not only was there no backup of that ViP user mailbox, the entire exchange server mailboxes are not backed up. And then they checked all the backups of all the enterprise servers which is linked to a server dedicated to perform this backup for six hours starting from midnight. Nothing. It has been doing back up of empty since a year ago when they migrated to a centralized backup server using netbackup software.

The CIO demanded an investigation by the IT outsourcing vendor and I was called onsite.

I asked my senior support engineer who is a backup expert to come along. A quick check by him noticed that a check box on the netbackup software on the centralized backup server was not ticked. This checkbox if ticked is meant to backup the system and data as an incremental backup if there was a full backup previously, if no full backup previously it will backup the entire system and data as a full backup. Now the situation was there was not a full backup in the trail of backups, but this checkbox was not ticked, therefore the tape drives has been backing up empty, nothing.

A simple daily backup checklist to check the backup log, do a restore and test function would have prevented the problem.

Incident #3

A couple of years ago, there was this brown field data center project, i.e. retrofitting an industrial building into a dedicated data center co-location facility was underway in Beijing, which took more than two years (that is another story on why it took so long for the retrofit). There are four parties involved in the project, the owner’s data center project team (let’s call them A), main contractor for the project (let’s call them B), 3rd party project supervision company appointed by A (let’s call them C), data center design firm (let’s call them D), and various other parties.

Three outdoor chilled water storage tanks were delivered on site and arrangements were being made to install them. These tanks are cylindrical and are to be 8 meters tall with circumference of 1.5 meters. However, project superintendent C finds that the specifications are wrong (according to last approved drawings by D) which should be 12 meters by 1 meters. The sub-contractor for the chilled water storage tanks (let’s call them E) and main contractor B were adamant that there was an agreement in response to request by owner A to reduce the height of the tank (so that it is shorter than the building’s height) thereby changing the specifications. D said there was no such record of request nor did they change it. A said there should be.

The tanks were on site but were not installed for four full days. Finally, on insistence by project manager of A for a face to face meetings of all five involved parties (A, B, C, D, E), the request to change the design of the chilled water storage tanks were contained in an email communications between D and E. The problem lies with no documented project minutes and all the parties not keeping proper records causing delays to sort things out. Time was wasted on this and many other things, which is one of the reason why the project was late and taking so long.

So, quoting a famous sentence that is attributed to Ronald Reagan: “We trust, but Verify!”, and follow up immediately with documentation.

Reference:

  1. https://en.wikipedia.org/wiki/Trust,_but_verify
Trust but verify