On the same page – data center co-location market segments and terms

AAEAAQAAAAAAAA2PAAAAJGE1ODU5NjIyLWU4MzYtNDcwOS1iYmNmLTcyMWEzYzVhY2E1ZA.jpg

In China, Internet Data Center (IDC) is synonymous with all the following terms:

  •  Wholesale co-location data center
  • Multi-tenanted co-location data center
  • Retail co-location data center
  • Webhosting

They don’t classify the IDC into the above terms – simply just IDC.

The data center industry sometimes dropped the term co-location, i.e. wholesale co-location data center = wholesale data center.

I believe these separate terms are created by the data center service providers ourselves.

I believe the different market segment terms came about due to the sheer size of the US market that these somewhat distinct segments are formed and they tend to differentiate along the scale and size of the client’s data hall size requirements.

  • Enterprise aka in-house data center
  • Wholesale co-location data center (preference to have one client per building)
  • Multi-tenanted co-location data center (preference for 1 customer = 1 data hall = 1 private vault =1 private data room)
  • Retail co-location data center (One rack or within one rack, includes dedicated server hosting)
  • Webhosting aka managed hosting, shared server hosting, and virtual server hosting
  • Cloud Data Center

Let me put this into perspective. With the exception of enterprise data center, the co-location data center terms are from the perspective of the data center co-location service providers and not usually from the perspective of clients.

The big boys are those that are big scale, the DRTs, Equinix, but there are lots of data center co-location service providers in the US that we outside of that market are not aware of. In the end, DRT is still less than 10 percent of data center white space in the US, so there are a lot of service providers while it is a big pie to slice and dice.

There is little agreement to the definition of these terms globally though. A particular type may be dominant in a particular country due to the fact that clients are only looking for say, retail co-location data center. In Jakarta, while dedicated data center building is being planned, most of the data centers are the retail and multi-tenanted in mixed use buildings as the scale of demand are aggregately large but coming from lots of small and middle clients because most large enterprises prefers self-built and in-house.

There is no agreement on the term Cloud data center, since it can exist in any of the data center market segment types.

It is perhaps one of the reason why there is very few campus-scale wholesale co-location data center in China because customer demand have not reached that scale yet and also the way land are allocated in China do not favor data-center-industry-park as the local government prefers job-creation while data center ultimately do not employ a lot of workers. They do like cloud data center as they perceive these will employ lots of IT engineers/programmers.

The client do not care how you position your data center in the data center market by size, they only care if your data center space meet their requirements in terms of technical and financial attributes.

While the US and thus the US-based data center co-location service providers has enjoyed growth in demand of large scale data centers, the rest of the global market may not see the same level of demand for large scale data center.

In Europe, when I was attending London’s edition of DCD conference in 2016, the European multi-tenanted co-location service providers are giving feedback to OCP workshop organizers that their data center space generally do not cater to entire facility meeting the average power requirements of OCP racks. While in China, only Alibaba has gone the route to want large scale while Baidu and Tencent are managing their data center space growth via presence in multiple buildings by multiple multi-tenanted co-location service providers. Large scale data center park builder in Hebei has faced problem of building ahead of predicted demand when there is little takers of their already built 3 data center buildings.

I was working in a data center co-location service provider that started off with offering retail co-location, i.e. one rack or sub-division of a rack, which then moved into multi-tenanted co-location, we have clients of all types and the client don’t really care about how we define our data center co-location market position. They only want to know if we will meet their requirement and submit a price proposal. A rack-mount server fits into a standard IT rack, so whether the rack is shared or the room or the building is shared may not matter to the client.

In some cases, clients actually may prefer smaller data center service providers who are considered more responsive while the larger enterprise may tend to go for global wholesale data center co-location service providers because they have a global agreement and consistent standard, to each their own.

Back in the dotcom burst days, data centers closures due to overbuilt supply has seen the then dominant players like Exodus and Digital Island changed hands, scale down, and sold.

In the end of the day, it is the clients and their needs that define which market segment will grow.

Reference:

  1. https://en.wikipedia.org/wiki/Colocation_centre
  2. https://cyrusone.com/corporate-blog/understanding-the-different-types-of-data-center-facilities/
  3. http://www.technavio.com/report/global-data-center-multi-tenant-wholesale-market
  4. http://www.missioncriticalmagazine.com/articles/88290-report-data-center-colocation-market-annualized-revenue-projected-to-reach-33bn-worldwide-by-end-of-2018
  5. https://structureresearch.net/product/marketshare-report-global-data-centre-colocation/
On the same page – data center co-location market segments and terms

HUMAN ERROR: The Biggest Challenge to Data Center Availability and how we can mitigate it – Part 2

it_engineer_with_server

The previous article on this topic can be found via this link.

The layered approach to upkeep data center infrastructure availability should not look like the Swiss cheese, i.e. the hazard or trigger should be eventually stopped and preferably as early as possible.

swiss-cheese

The layers should include the following:

  • Design (in accordance with design intent of owner) with either concurrent maintainability objective or fault tolerance
  • Implementation (in accordance with design brief) and fully tested via comprehensive testing and commissioning phase before handover with fully documented SOPs/MOPs/EOPs
  • Maintenance and Operations Management, work by equipment service providers or any work on site through Method of Statement and Risk Assessment matrix by suitably qualified person/persons
  • Incident and Problem management process, escalation management and mitigation process

and so forth

Possible problems arising from inadequacy in each of the layer can result in:

  • Inherent Design / Setting flaw
    • Outdated / swiss cheese situation
    • Requires analysis and manual intervention
    • Error Producing Conditions (EPC)
  • Weakness in manual processes
    • Inadequate automation
    • Inadequate training / familiarity
    • Inadequate operations procedures
  • Insufficient Information / knowledge
    • Capacity limit reached earlier than design intent
    • Inadequate training / knowledge
    • Inadequate documentations
  • Insufficient Risk Assessment
    • MOS / RA, risk matrix
    • Vendor experience

 

Learn from other industry

Our data center industry is a relatively young industry and there are other industries with mission critical infrastructure that have undergone extensive research and iterative enhancements which we can learn from and adopt.

  • Airline’s Crew Resource Management
    • Checklist and double checking by pilot and co-pilot on the airplane air-worthiness
    • Communications within the cockpit and the cabin staff with the cockpit to ensure timely and prioritized response
  • US Nuclear Regulatory Commission
    • Standardized Plant Analysis Risk – Human Reliability Analysis (SPAR-H) method to take account of the potential for human error
  • OECD’s Nuclear Energy Agency
    • Ways to avoid human error, e.g.,
      • Systems should also be designed to limit the need for human intervention
      • distinctive and consistent labelling of equipment, control panels & documents;
      • displaying information concerning the state of the plant so that the operator don’t need to guess and make a faulty diagnosis; and
      • designing systems to give unambiguous responses to operator actions so incorrect actions can be easily identified.
      • operators to be better trained for plant emergencies, use of simulators

 

 

Error Reduction Strategy and Error Predictor

In addition, error reducing strategies can be applied in all areas of data center maintenance and operations management to reduce the probability of occurrence of human error. Whether in design of the data center power and cooling infrastructure, or determining the risk of operations of particular maintenance operations (e.g. power switch-over exercise to perform UPS or back-up generator maintenance), all the strategy below should be applied.

Take for example the case of the AWS US-East-1 outage incident (http://mashable.com/2017/03/02/what-caused-amazon-aws-s3-outage/), the command set is powerful and a typo could bring down a lot of servers real quick. So AWS said in their post incident summary (https://aws.amazon.com/message/41926/) that they will limit the speed of the effect of the command and tool, i.e. put in safety check which is basically an application of the constraint strategy.

error-reduction-strategy

When service or repair task are assigned to operations staff, or by qualified technicians of equipment service provider, an evaluation of the existence of error precursors and eliminating these pre-cursors will reduce the likelihood of human error. For example, the combination of time pressure, inexperienced staff already at the end of the long work shift and ambiguous task objective are all contributor to a higher risk of the assigned task. Eliminate, reduce, and re-direct to an experienced staff at the start of the work shift with clear task objective will reduce the risk of the assigned task.

error-precursors.jpg

Risk Mitigation is a Continuous Process

A multi-prong multi-layer and attention to details approach are required to mitigate the risk of human error causing an outage in a data center facility.

Risk Mitigation Process Flow.jpg

 

Design and implementation of a data center to a set of clear and tested design intent (e.g. objective of data center being concurrently maintainable). Day-in and day-out the operations staff, vendors, client personnel interact with the system within the data center. So there need to be a well oil system in place, not only just documentations, that works 24×7 for as long as the data center is in existence.

An iterative risk mitigation system, relying upon consistent management support and attention, with knowledge learned from near misses and incidents are key attributes of an environment that is resilient in terms of the human aspect.

We Human can reduce Human Error, effort required

We should look at the data center organization especially the operations team, the resources and tools, the capability of the operations team, and so forth. A culture of no blame, and encouraging active participation by all staff to address potential weakness or error precursors, addressing near-miss which is a sign of error inducing conditions, are important to mitigate effects of human errors. We should get away from pointing fingers and to learn from past problems, like what AWS did with their incidents. And our data center industry can do more to share and learn from one another, to prevent future occurrence of issues that were faced and dealt with elsewhere.

This built-up knowledge of good practices should be documented and disseminated, with management support. The weakest link is an inexperience staff hesitating or worse making a wrong decision, so training everyone on the operations team is critical to maintaining availability of data center.

A periodic (for example annual basis) no-nonsense third party data center operations and management review coupled with improvement plans to strengthen those weakest links will boost insight and assurance to data center C-level executives, data center operations managers, and clients. Most operations manager will be too busy to review their own data center operations, coupled with the difficult position of finding your own fault and limited experience if the staff have not worked in more than one or two data center sites, therefore a third party operations and management review is the next best thing to enhance resilience against human error provided it has the full co-operation from top to bottom of the data center staff.

Furthermore, if a data center service provider has grown beyond 2 to 3 data centers, it will be difficult to consistently manage the data center operations across them especially if they are managed independently. A third party review that is applied to all of them will help to reign in inconsistent operations processes, subject to having a central data center operations programme function within the data center service provide, of course.

Therefore, a data center facility is ultimately dependent on well trained and knowledgeable staff, whom are clear about their data center facility information or knows where to quickly find the documentation that contains the detail information, do the risk assessment work of evaluating equipment service vendor or upgrade works properly.

In summary,

  • It is worthwhile to commit resources to reduce errors
  • We can do improve our resiliency and thereby uptime through available methods and tools
  • There are proven methods and tools we can borrow from other mission critical environments
  • Third party data center operations and management review coupled with improvement plan should be considered for large data center operations especially those that have multiple sites

 

References:

  1. https://en.wikipedia.org/wiki/Human_error_assessment_and_reduction_technique
  2. http://news.delta.com/chief-operating-officer-gives-delta-operations-update
  3. https://journal.uptimeinstitute.com/examining-and-learning-from-complex-systems-failures/
  4. http://www.enterpriseinnovation.net/system/files/whitepapers/1_2016-cost-of-data-center-outages-final-2.pdf
  5. https://www.oecd-nea.org/brief/brief-02.html
  1. http://www2.lbl.gov/ehs/training/assets/docs/Error-Precursors.pdf
  2. https://www.linkedin.com/pulse/data-center-human-factor-james-soh-%E8%8B%8F%E6%97%AD%E6%B1%9F

 

  1. https://www.linkedin.com/pulse/human-errors-biggest-challenge-data-center-how-we-can-james-soh-%E8%8B%8F%E6%97%AD%E6%B1%9F
HUMAN ERROR: The Biggest Challenge to Data Center Availability and how we can mitigate it – Part 2

Demonetized currency, re-examine our ports, winds is changing

singapore-shipyards-map

I sometimes write about wider topics before the weekend.

I travelled to India in late 2016 amid the aftermath of the Indian government sudden announcement that the 500 and 1,000 rupee notes are obsolete within a short period of a month or so. I was only able to exchange limited (government directed restriction and limited supply) new notes at the money changer at the Bangalore airport and barely enough to last for the three days I was there to conduct training and I had to find places that accept credit card to preserve whatever amount of rupees I had on me. When I travel to Philippines, the same happened for older peso notes but fortunately the timeframe to demonetize them was longer and so there was not much impact except those old pesos that my wife kept cannot be used anymore. There are good reasons to ban the old notes and I shan’t go into them here. It is the inconvenience and pain that the law abiding citizens and we tourists had suffered.

Analog modems, com ports are gone from the computers. Ok, pun intended.

I think we need to also look at shipyards and dry docks in Singapore.

Singapore in the 60s and 70s needed all the infrastructure it had inherited from the British to keep everyone at work. The ports are the bedrock on which a world leading ranked Singapore shipping hub is born from, and the shipyards and dry docks created two world leading oilrig builders – Sembawang Shipyard and Keppel Offshore and Marine.

There was this story regarding the Saudi government seeking advice from then PM Lee Kuan Yew (LKY) about the new economic city of King Abdullah Economic City (KAEC) and LKY was told that top foreign advisors had advised the Saudi government to line the coast line with industries and build high class residences on the hills further inland. LKY told them this advice from those advisors are rubbish, instead the coast line should be lined with high class residences. Background is the oil producing facilities are on the east coast of the Saudi peninsula and KAEC is on the west coast and it will not have oil refinery nor lots of heavy industries. So LKY’s advice was adopted and appreciated given that he had the interest of the Saudis at heart and he is viewed as a dear friend ever since.

Our Singapore coast line, in the east side is lined by the East Coast Park. However, the western end is dominated by port, docks, and dry docks. There are some high end condominiums in around the famous Sentosa area but majority of the southern to western coast line are industrial.

There were good reasons when we began with what we had, i.e. shipyards left by the British armed forces, but the Singapore economy has been less reliant on manpower intensive industries and large scale manufacturing has nearly all but shifted out while Singapore embarked to re-positioned our economy to more brain-driven sectors in services, pharmaceuticals and so forth.

It is also strange that two Temasek (i.e. state) owned oilrig builders duke it out in our tiny island state. I mean we don’t build two national airlines to compete against one another, why we need two oilrig builder giants compete against each other and the world which are cheaper and faster?

A shift of thinking on the big picture level is needed, nothing is sacrosanct to refit our economy’s engine, and this creates opportunity to re-draw the Singapore coastline and bring greater value and enhance our position in the new economy winds.

Lastly, to steer this post towards my favorite topic of data center, some of those land are prime and close to one of the subsea cable landing station and thus will give more choices to data center site selection! Ok, it is not totally related but this is my article. 😉

Have a good weekend ahead.

Demonetized currency, re-examine our ports, winds is changing

2+2000+3000 = 1 big challenge

thousands-of-app-in-a-bank

You read it right.

2 DCs + 2,000 servers with 3,000 applications are going into a new Data Center in three years time, and the man/woman to do it is yet to be found.

I have had the unfortunately experience of network interruption that caused slow and unacceptable access to hundreds (my then estimate) of online application in a large Enterprise data center, but then again we were an internal shared co-location data center so we only count the projects and departments but never the number of application systems.

 

In the above picture is an advertisement that calls for a data center migration project manager, and it has been repeatedly put up for more than 6 months and I have seen the indicated salary range gone up (from 10k SGD per month to now 15k SGD). The scale and complexity is also fully spelled out, as previously their advert did not indicate two data centers moving into a new one, and 2,000 servers with 2,000 application systems, and to move all in (I presume by phases, it will be nearly mission impossible if all at once) by 2020. Don’t forget, system change and new system and services will not stay frozen during this period, while new system may still be added to the current data centers given the inter-dependencies of system and data required to introduce these new systems and services. Hopefully, no IP address change for any of the system is required, oh that is only one out of many possible things to consider for such a move.

I had one data center move done in the mid 1990s with the enterprise having one mini-computer sytem as the one and only mission critical system, we went beyond the planned downtime window of 24 hours by additional 30 hours because the new site’s telecommunication cables are digital compared to the old site which was analog, and our analog modem was unable to work and we had to bring in new ones while the migration took place on a Sunday when the vendor’s warehouse are closed.

I had on occasions when talking to data center facility owners, sales people and fellow consultants about the mission critical nature of IT for most of today’s enterprises and mentioned hundreds of applications are in use on average by medium to large enterprises.

098733b

3,000 applications is one big number. I hope only 10% of them are mission critical and the entire application system portfolio are prioritized and inter-dependencies already mapped out.

When a bank data center ran into problem (see reference section below), what we see externally is ATM being down, and counter staff may switch over to back-up system and service became slower. What really happen will involve lots more effort to bring the critical applications back to service.

What makes me wonder though, shouldn’t the bank have identified such a role and bring him in earlier in the process of decision making to have a new data center, and this person or better a team of data center migration experts be better in all sorts of way rather than bring someone from outside and who is to manage any knowledge and therefore mitigate any migration risks? I am pretty sure that the local financial regulator will dive in to audit and assess the bank’s migration plan.

Anyway, I have learnt a number (1,000s of servers, applications) for a Singapore bank is probably typical.

Best of luck to their data center migration.

Reference:

  1. http://www.datacenterknowledge.com/archives/2013/12/16/year-downtime-top-10-outages-2013/
  2. https://www.theregister.co.uk/2017/01/13/lloyds_bank_in_talks_to_outsource_bit_barns_to_ibm/
  3. https://www.linkedin.com/pulse/20140616192008-655694-best-practices-for-data-center-migration
2+2000+3000 = 1 big challenge

Can you help us build a tier 5 data center?

the-data-center-build-photo

A data center consultant K told me this story. In around 2005 or 2006 he gave a talk at a data center conference at a famous financial and resort city somewhere in Asia. A gentleman J walked up to K afterwards and introduced himself as a property developer who are looking into building a new data center. The topic that K spoke about on the stage was about data center standards and mentioned Uptime Institute Tiers and TIA942, and J said he wanted to build a Tier 5 data center.

As an aside, let me defer to other posts/websites on the design standards and Tier level/Rated/Facility class (see reference 1 and 2). Generally speaking, most define the data center design based on resiliency required up to four levels.

K was taken aback and asked if J is aware that the Tier levels tops off at IV / 4, J said he knows and he wanted to go one up better than Tier IV / 4. J shared that given the city he planned to build the new data center has not had any standalone data center facility, he wants to stand out and that city is well known to have extravagant hotels and malls and such.

The idea that build them and they will come

K is kind enough to ask J if he had done a market study and knows whether potential clients demand a highly resilient and fault tolerant data center, and J replied no he has not but he thinks that demands will rush in for his data center once it is announced that such a data center will be built. Well, maybe if you have done your study and knows where the competition for starters. But if you have not done any of the study of market demand and competition, then what you built may be over built, or way ahead of demand and will take longer than your optimistic timeframe to sell them.

I had on multiple occasions met with potential data center owner who are considering to build their first data center non first tier data center market in Asia. Surprisingly, a common central theme of their plan hinges on “build them and they will come” mindset. Today, several Asia cities are in over-supply not only in residential / industrial sectors but also the sub-sector of data center of which the over-confidence of demand will come when supply is there is one contributor to the situation. Data center facility is a huge investment, a China data center company I have known to have a data center facility in Beijing that is well sought after, but expanded in other cities which they are less familiar with and suffers losses for years, which drags down their overall finances and they are forced to sell their crown jewel at less than preferred circumstances and numbers.

Client needs and supply / demand

I have two points to make. Firstly, know your market and competition, and your financial strength. If all your competition in the market are building to get shared hosting type which only demands a UPS backed electrical supply to their IT servers, then building it to higher level of resiliency makes your data center space more pricy and will take longer to fill up, if ever. There was a few such cases in Singapore and some have folded after building a data center, and some have spent millions of dollars and the project cannot take off and are now in limbo. Many such cases also exist in China. While one case in Singapore had prevailed, they built their data center during the dot com boom but were caught in the downturn of dot com bust which had several casualties, and this one data center managed to survived through building up their data center facility on a floor by floor basis unlike the other two, thus less demand on their financials compared to the others during that period.

More prudent to match cost outlay to take-up

Secondly, the main technical infrastructure design parameter of whether to build to concurrently maintenance (aka roughly equivalent to Tier III / Rated 3 / Facility Class 3) or fault tolerant (Tier IV / Rated 4 / Facility class 4) are dependent on the demand by the client. If the target clientele are the financial institutions or those organizations that due to various reasons are reliant on IT but their system can only run on single host/system or active-passive set-up (it seems like airline ticket reservation system are like that), then it make sense. Another way is to plan for multiple level of resiliency features, i.e. share the same fault tolerant level of electrical infrastructure but flexible enough to accommodate either concurrently maintainable or fault tolerant demand of the client (although generally this will be slightly more costly than purely designed and implemented to concurrently maintainable).

Fortunately these days, there are so much information in the market and the new owners-to-be are better informed. My other gripe is those that knows a little in one particular topic of the data center knowledge and yet is so convinced of it that precludes meaningful exchange, but that is another story in future post.

Reference:

  1. http://www.datacenterknowledge.com/archives/2016/01/06/data-center-design-which-standards-to-follow/
  2. https://uptimeinstitute.com/tiers
  3. https://www.linkedin.com/pulse/data-center-tiers-tears-plus-minus-james-soh-%E8%8B%8F%E6%97%AD%E6%B1%9F
  4. https://www.linkedin.com/pulse/making-sense-data-center-standards-james-soh-%E8%8B%8F%E6%97%AD%E6%B1%9F
Can you help us build a tier 5 data center?

The problem with the (use of) PUE in the Data Center industry

Montage-data-centers.jpg

I had mentioned in a previous post on reporting and use of PUE, including the terms iPUE, PUE3, dPUE etc ( https://www.linkedin.com/pulse/data-center-resource-efficiency-pue-isoiec-30134-james-soh-%E8%8B%8F%E6%97%AD%E6%B1%9F).

Like how Tier / Facility Class / Rated are being mentioned fuzzily in the industry, having not make clear whether it is designed according to which standard or certified or not, of which the confusion is not helping the potential clients and the whole industry. Just to clarify, I take no stand against any data center saying that its facility is designed in accordance to a particular standard given that any potential client should and will make detail review and audit of the facility before committing to a co-location deal.

The issue that I like to highlight in this post, is the use of designed PUE (dPUE) instead of PUE in the way it is used in a manner to market or even set policy. dPUE itself is subject to estimation (as per example case in ISO 30134-2) and imprecise. The actual PUE3 versus dPUE can have a huge gap given the IT load profile will normally not ramp up to near 100% for any new data center facility.

This encourages the owner of yet-to-be-built data center to claim a low dPUE. You know, it is an estimate, who is it to say the figure of 1.13 is wrong? You want to check my calculations? Talk to my design consultants who are the ones that work out that number (at the insistence of me to assume the best case situation to come up with a low dPUE).

The announced ban by Beijing for new data center with PUE of 1.5 or above really meant designed PUE. Given that it is a designed PUE, a lot can go into estimating a low dPUE. Who is going to shut-off the power after the data center facility is designed, equipment selected, built and operating at well below full capacity thus yielding a bad actual interim PUE? There are many ways to go about to make the dPUE figure works to your advantage. See reference 1.

You may ignore ancillary power usage or give a very low predicted power usage in the mechanical load or cite the most ideal power efficient chiller in the design but choose a not the most efficient chiller when you decide to purchase the actual equipment. Or you decide to base your dPUE on PUE1 or PUE2 way of calculating the dPUE which makes it look slightly better. They all add (or minus) up.

pue-at-design-load-chart-2

Credit: CGG Facilities. http://www.ccgfacilities.com/insight/detail.aspx?ID=18

From my experience of operating and auditing more than a dozen data centers, I have seen very crude designed PUE estimation and some better ones.

The thing is that the designed PUE always looks too good and it stems from:

  • Not including some of the data center infrastructure losses
  • Not including electricity losses in the cables (3%)
  • Tolerance of installed equipment performing to factory specifications
  • Estimation using PUE1 situation, i.e. at UPS output whereas PUE2 or PUE3 is the recommended way
  • Different environmental conditions over 12 months in a real data center will be sub-optimal

A friend of mine who works in the data center co-location service provider laments that their honesty has given them a lower category in a green data center award versus others in the same city that claim lower dPUE figures and got higher awards. It may not be completely due to the lower dPUE figures, but it play a part.

The clients are not fools and the data center colocation service provider that claims such low dPUE will find it tougher to negotiate co-location service contracts as the power bill recovery in some countries are tied to the actual PUE but related to the dPUE when closer to full utilization. This will eat into their profits.

Ultimately, it is the real PUE3 that measures over a period of 365 days at current client IT power load that matters, and a 100% leased out co-location data center which meant full endorsement by the clients. Nothing speaks better that ka-chin at the cash registers, no amount of bill board outside will take money out of wallets of potential clients. It is how the design, equipment selection, measurement and reporting, running a tight operations, continuous monitoring and enhancement, people that all combines into having a well-run and well respected data center facility with a happy clientele that grows the co-location business. Playing with dPUE gets some attention, but delivering the service consistently and having clients that take up more of your data center space is the indicator of healthy data center business.

It is my hope that awards in energy efficient data center shall be based on actual PUE, rather than designed PUE.

Reference:

  1. http://www.ccgfacilities.com/insight/detail.aspx?ID=18
  2. https://www.greenbiz.com/article/new-efficiency-standard-challenges-data-center-status-quo
  3. http://www.datacenterknowledge.com/archives/2009/07/13/pue-and-marketing-mischief/
  4. ISO/IEC 30134-2    Part 2, Power Usage Effectiveness (“PUE”) – http://www.iso.org/iso/home/store/catalogue_tc/catalogue_tc_browse.htm?commid=654019
The problem with the (use of) PUE in the Data Center industry

Human Error: The biggest challenge to data center availability and how we can mitigate – Part 1

it_engineer_with_server

The 2016 Ponemon Institute research report on Cost of downtime (reference 1) contains a chart showing the cause of data center downtime, and classify accidental human error to be 22%, and the top six contributors to downtime are UPS system failure (25%), cyber crime (22%), accidental human error (22%), water/heat/CRAC failure (11%), weather related (10%), and generator failure (6%). However, the accidental human error did not account for latent human error that could have contributed to those UPS/CRAC/Generator failure.

ponemon2016

Uptime Institute had cited 70% of data center outages can be attributed to human error.

The definition of human error is broader and can be generally classify into Active Error (where a deliberate action caused deviation from expected outcome), and Latent Error (where a non-deliberate action caused deviation from expected outcome). For example, when a design decision is made regarding the power protection circuit for a data center room, if it was not fully co-ordinated to isolate and protect power issue to cascade upstream to higher level circuit breakers.

There are many cases of major outages in the past few years that are attributed to human error. The 2016 Delta airline data center outage is reported to cost them USD 150 Millions. Part of the long delay (3 days) to resume service is that a significant part of their IT infrastructure is not connected to backup power source which begs the question why did it happen that way? Well, it should be due to latent error, where the IT equipment installation or the in-rack PDUs are not from two separate UPS or supported by in-rack ATS switch.

I was asked a question during my presentation on this subject matter whether higher tier level aka higher resiliency designed and implemented data center can minimize this issue of human error. My answer is you can design and implement to 2N power and cooling infrastructure, but when 1N is taken down for maintenance, any mistake or weakness (inexperience operations staff/vendor personnel, procedure gap that human nature overlooked and made wrong guess etc) can take down the IT load and has happened to many data centers (google search on human error and data center outage incidents).

swiss_cheese_model_of_accident_causation

There are multiple ways for the human error to manifest in a data center outage. They can be simple external trigger that goes through loopholes like the Swiss cheese above, or cascade (combination), or direct active human error.

For example on cascade, a case of lightning strike that caused momentary power dip (see reference) should not cause an outage in a data center; however if the selection of circuit protection device or the design did not cater for how the DRUPS would respond in such a situation, and the automated control was not configured to deal with it, then any amount of SOP/MOP/EOP or Method of Statement-Risk Assessment (MOS-RA) may not protect the facility against a particular external trigger. A case of a data center in Sydney whereby the circuit breakers were not designed and selected to cater to such a scenario caused the UPS to supply to the grid instead of to the load.

For direct human error, I have also known a case of UPS manufacturer trained and authorized service engineer causing an outage, where the engineer did not follow the documented service manual and caused the entire set of UPS to tripped, and because the circuit protection devices were not able to isolate the fault downstream, caused the upstream incoming breaker to trip. This is part of the reason why data center staff should accompany and question the service engineer at critical check-points during servicing of critical infrastructure.

Outage can be failure of the resilient design / implementation due to under-capacity. This can be traced to latent (no tracking of actual power capacity versus designed capacity) or active (no checking of UPS capacity before maintenance). For example, actual power usage of N+1 UPS has actually become N UPS, and when one of the UPS was down, the entire UPS set shutdown.

In the next post, measures to mitigate the risk of human error will be discussed.

References:

  1. http://www.enterpriseinnovation.net/system/files/whitepapers/1_2016-cost-of-data-center-outages-final-2.pdf
  2. https://aws.amazon.com/message/4372T8/
  3. http://news.delta.com/chief-operating-officer-gives-delta-operations-update
  4. https://journal.uptimeinstitute.com/examining-and-learning-from-complex-systems-failures/
Human Error: The biggest challenge to data center availability and how we can mitigate – Part 1