A list of data center mechanical and electrical design consulting firms in Singapore

planounifilar_22647

In two separate occasions, a common question came up – what are the data center mechanical and electrical design consulting firms can they call for submissions, as they only knew at most one or two.

Well, quite a lot apparently. The list below are those that had at least done a data center project or data center technical review in Singapore:

  1. ARUP
  2. Aurecon
  3. Bescon Consulting Engineers
  4. Cundall
  5. DSCO
  6. HurleyPalmerFlatt
  7. I 3 Critical Facilities
  8. J Roger Preston
  9. M+W Group
  10. Meinhardt
  11. NTT Facilities (formerly Pro-Matrix)
  12. Plan One Engineering Services
  13. RED Engineering
  14. SJ Thames
  15. TW International Counsel
  16. Wah Loon Engineering
  17. Worley Parsons

I am not associated with any of the above company in a business capacity.

You can easily obtain the company contact information from google search of these companies’ website. If you would like another mechanical and electrical design consulting firm to be listed, please let me know with a project reference information (just project name) and I will update as and when I am able to.

Reference:

  1. http://www.datacenterjournal.com/selecting-a-data-center-consultant/
  2. https://www.linkedin.com/pulse/making-sense-data-center-standards-james-soh-%E8%8B%8F%E6%97%AD%E6%B1%9F
  3. https://www.bca.gov.sg/PanelsConsultants/panels_consultants.html
Advertisements
A list of data center mechanical and electrical design consulting firms in Singapore

HUMAN ERROR: The Biggest Challenge to Data Center Availability and how we can mitigate it – Part 2

it_engineer_with_server

The previous article on this topic can be found via this link.

The layered approach to upkeep data center infrastructure availability should not look like the Swiss cheese, i.e. the hazard or trigger should be eventually stopped and preferably as early as possible.

swiss-cheese

The layers should include the following:

  • Design (in accordance with design intent of owner) with either concurrent maintainability objective or fault tolerance
  • Implementation (in accordance with design brief) and fully tested via comprehensive testing and commissioning phase before handover with fully documented SOPs/MOPs/EOPs
  • Maintenance and Operations Management, work by equipment service providers or any work on site through Method of Statement and Risk Assessment matrix by suitably qualified person/persons
  • Incident and Problem management process, escalation management and mitigation process

and so forth

Possible problems arising from inadequacy in each of the layer can result in:

  • Inherent Design / Setting flaw
    • Outdated / swiss cheese situation
    • Requires analysis and manual intervention
    • Error Producing Conditions (EPC)
  • Weakness in manual processes
    • Inadequate automation
    • Inadequate training / familiarity
    • Inadequate operations procedures
  • Insufficient Information / knowledge
    • Capacity limit reached earlier than design intent
    • Inadequate training / knowledge
    • Inadequate documentations
  • Insufficient Risk Assessment
    • MOS / RA, risk matrix
    • Vendor experience

 

Learn from other industry

Our data center industry is a relatively young industry and there are other industries with mission critical infrastructure that have undergone extensive research and iterative enhancements which we can learn from and adopt.

  • Airline’s Crew Resource Management
    • Checklist and double checking by pilot and co-pilot on the airplane air-worthiness
    • Communications within the cockpit and the cabin staff with the cockpit to ensure timely and prioritized response
  • US Nuclear Regulatory Commission
    • Standardized Plant Analysis Risk – Human Reliability Analysis (SPAR-H) method to take account of the potential for human error
  • OECD’s Nuclear Energy Agency
    • Ways to avoid human error, e.g.,
      • Systems should also be designed to limit the need for human intervention
      • distinctive and consistent labelling of equipment, control panels & documents;
      • displaying information concerning the state of the plant so that the operator don’t need to guess and make a faulty diagnosis; and
      • designing systems to give unambiguous responses to operator actions so incorrect actions can be easily identified.
      • operators to be better trained for plant emergencies, use of simulators

 

 

Error Reduction Strategy and Error Predictor

In addition, error reducing strategies can be applied in all areas of data center maintenance and operations management to reduce the probability of occurrence of human error. Whether in design of the data center power and cooling infrastructure, or determining the risk of operations of particular maintenance operations (e.g. power switch-over exercise to perform UPS or back-up generator maintenance), all the strategy below should be applied.

Take for example the case of the AWS US-East-1 outage incident (http://mashable.com/2017/03/02/what-caused-amazon-aws-s3-outage/), the command set is powerful and a typo could bring down a lot of servers real quick. So AWS said in their post incident summary (https://aws.amazon.com/message/41926/) that they will limit the speed of the effect of the command and tool, i.e. put in safety check which is basically an application of the constraint strategy.

error-reduction-strategy

When service or repair task are assigned to operations staff, or by qualified technicians of equipment service provider, an evaluation of the existence of error precursors and eliminating these pre-cursors will reduce the likelihood of human error. For example, the combination of time pressure, inexperienced staff already at the end of the long work shift and ambiguous task objective are all contributor to a higher risk of the assigned task. Eliminate, reduce, and re-direct to an experienced staff at the start of the work shift with clear task objective will reduce the risk of the assigned task.

error-precursors.jpg

Risk Mitigation is a Continuous Process

A multi-prong multi-layer and attention to details approach are required to mitigate the risk of human error causing an outage in a data center facility.

Risk Mitigation Process Flow.jpg

 

Design and implementation of a data center to a set of clear and tested design intent (e.g. objective of data center being concurrently maintainable). Day-in and day-out the operations staff, vendors, client personnel interact with the system within the data center. So there need to be a well oil system in place, not only just documentations, that works 24×7 for as long as the data center is in existence.

An iterative risk mitigation system, relying upon consistent management support and attention, with knowledge learned from near misses and incidents are key attributes of an environment that is resilient in terms of the human aspect.

We Human can reduce Human Error, effort required

We should look at the data center organization especially the operations team, the resources and tools, the capability of the operations team, and so forth. A culture of no blame, and encouraging active participation by all staff to address potential weakness or error precursors, addressing near-miss which is a sign of error inducing conditions, are important to mitigate effects of human errors. We should get away from pointing fingers and to learn from past problems, like what AWS did with their incidents. And our data center industry can do more to share and learn from one another, to prevent future occurrence of issues that were faced and dealt with elsewhere.

This built-up knowledge of good practices should be documented and disseminated, with management support. The weakest link is an inexperience staff hesitating or worse making a wrong decision, so training everyone on the operations team is critical to maintaining availability of data center.

A periodic (for example annual basis) no-nonsense third party data center operations and management review coupled with improvement plans to strengthen those weakest links will boost insight and assurance to data center C-level executives, data center operations managers, and clients. Most operations manager will be too busy to review their own data center operations, coupled with the difficult position of finding your own fault and limited experience if the staff have not worked in more than one or two data center sites, therefore a third party operations and management review is the next best thing to enhance resilience against human error provided it has the full co-operation from top to bottom of the data center staff.

Furthermore, if a data center service provider has grown beyond 2 to 3 data centers, it will be difficult to consistently manage the data center operations across them especially if they are managed independently. A third party review that is applied to all of them will help to reign in inconsistent operations processes, subject to having a central data center operations programme function within the data center service provide, of course.

Therefore, a data center facility is ultimately dependent on well trained and knowledgeable staff, whom are clear about their data center facility information or knows where to quickly find the documentation that contains the detail information, do the risk assessment work of evaluating equipment service vendor or upgrade works properly.

In summary,

  • It is worthwhile to commit resources to reduce errors
  • We can do improve our resiliency and thereby uptime through available methods and tools
  • There are proven methods and tools we can borrow from other mission critical environments
  • Third party data center operations and management review coupled with improvement plan should be considered for large data center operations especially those that have multiple sites

 

References:

  1. https://en.wikipedia.org/wiki/Human_error_assessment_and_reduction_technique
  2. http://news.delta.com/chief-operating-officer-gives-delta-operations-update
  3. https://journal.uptimeinstitute.com/examining-and-learning-from-complex-systems-failures/
  4. http://www.enterpriseinnovation.net/system/files/whitepapers/1_2016-cost-of-data-center-outages-final-2.pdf
  5. https://www.oecd-nea.org/brief/brief-02.html
  1. http://www2.lbl.gov/ehs/training/assets/docs/Error-Precursors.pdf
  2. https://www.linkedin.com/pulse/data-center-human-factor-james-soh-%E8%8B%8F%E6%97%AD%E6%B1%9F

 

  1. https://www.linkedin.com/pulse/human-errors-biggest-challenge-data-center-how-we-can-james-soh-%E8%8B%8F%E6%97%AD%E6%B1%9F
HUMAN ERROR: The Biggest Challenge to Data Center Availability and how we can mitigate it – Part 2

Can you help us build a tier 5 data center?

the-data-center-build-photo

A data center consultant K told me this story. In around 2005 or 2006 he gave a talk at a data center conference at a famous financial and resort city somewhere in Asia. A gentleman J walked up to K afterwards and introduced himself as a property developer who are looking into building a new data center. The topic that K spoke about on the stage was about data center standards and mentioned Uptime Institute Tiers and TIA942, and J said he wanted to build a Tier 5 data center.

As an aside, let me defer to other posts/websites on the design standards and Tier level/Rated/Facility class (see reference 1 and 2). Generally speaking, most define the data center design based on resiliency required up to four levels.

K was taken aback and asked if J is aware that the Tier levels tops off at IV / 4, J said he knows and he wanted to go one up better than Tier IV / 4. J shared that given the city he planned to build the new data center has not had any standalone data center facility, he wants to stand out and that city is well known to have extravagant hotels and malls and such.

The idea that build them and they will come

K is kind enough to ask J if he had done a market study and knows whether potential clients demand a highly resilient and fault tolerant data center, and J replied no he has not but he thinks that demands will rush in for his data center once it is announced that such a data center will be built. Well, maybe if you have done your study and knows where the competition for starters. But if you have not done any of the study of market demand and competition, then what you built may be over built, or way ahead of demand and will take longer than your optimistic timeframe to sell them.

I had on multiple occasions met with potential data center owner who are considering to build their first data center non first tier data center market in Asia. Surprisingly, a common central theme of their plan hinges on “build them and they will come” mindset. Today, several Asia cities are in over-supply not only in residential / industrial sectors but also the sub-sector of data center of which the over-confidence of demand will come when supply is there is one contributor to the situation. Data center facility is a huge investment, a China data center company I have known to have a data center facility in Beijing that is well sought after, but expanded in other cities which they are less familiar with and suffers losses for years, which drags down their overall finances and they are forced to sell their crown jewel at less than preferred circumstances and numbers.

Client needs and supply / demand

I have two points to make. Firstly, know your market and competition, and your financial strength. If all your competition in the market are building to get shared hosting type which only demands a UPS backed electrical supply to their IT servers, then building it to higher level of resiliency makes your data center space more pricy and will take longer to fill up, if ever. There was a few such cases in Singapore and some have folded after building a data center, and some have spent millions of dollars and the project cannot take off and are now in limbo. Many such cases also exist in China. While one case in Singapore had prevailed, they built their data center during the dot com boom but were caught in the downturn of dot com bust which had several casualties, and this one data center managed to survived through building up their data center facility on a floor by floor basis unlike the other two, thus less demand on their financials compared to the others during that period.

More prudent to match cost outlay to take-up

Secondly, the main technical infrastructure design parameter of whether to build to concurrently maintenance (aka roughly equivalent to Tier III / Rated 3 / Facility Class 3) or fault tolerant (Tier IV / Rated 4 / Facility class 4) are dependent on the demand by the client. If the target clientele are the financial institutions or those organizations that due to various reasons are reliant on IT but their system can only run on single host/system or active-passive set-up (it seems like airline ticket reservation system are like that), then it make sense. Another way is to plan for multiple level of resiliency features, i.e. share the same fault tolerant level of electrical infrastructure but flexible enough to accommodate either concurrently maintainable or fault tolerant demand of the client (although generally this will be slightly more costly than purely designed and implemented to concurrently maintainable).

Fortunately these days, there are so much information in the market and the new owners-to-be are better informed. My other gripe is those that knows a little in one particular topic of the data center knowledge and yet is so convinced of it that precludes meaningful exchange, but that is another story in future post.

Reference:

  1. http://www.datacenterknowledge.com/archives/2016/01/06/data-center-design-which-standards-to-follow/
  2. https://uptimeinstitute.com/tiers
  3. https://www.linkedin.com/pulse/data-center-tiers-tears-plus-minus-james-soh-%E8%8B%8F%E6%97%AD%E6%B1%9F
  4. https://www.linkedin.com/pulse/making-sense-data-center-standards-james-soh-%E8%8B%8F%E6%97%AD%E6%B1%9F
Can you help us build a tier 5 data center?

The problem with the (use of) PUE in the Data Center industry

Montage-data-centers.jpg

I had mentioned in a previous post on reporting and use of PUE, including the terms iPUE, PUE3, dPUE etc ( https://www.linkedin.com/pulse/data-center-resource-efficiency-pue-isoiec-30134-james-soh-%E8%8B%8F%E6%97%AD%E6%B1%9F).

Like how Tier / Facility Class / Rated are being mentioned fuzzily in the industry, having not make clear whether it is designed according to which standard or certified or not, of which the confusion is not helping the potential clients and the whole industry. Just to clarify, I take no stand against any data center saying that its facility is designed in accordance to a particular standard given that any potential client should and will make detail review and audit of the facility before committing to a co-location deal.

The issue that I like to highlight in this post, is the use of designed PUE (dPUE) instead of PUE in the way it is used in a manner to market or even set policy. dPUE itself is subject to estimation (as per example case in ISO 30134-2) and imprecise. The actual PUE3 versus dPUE can have a huge gap given the IT load profile will normally not ramp up to near 100% for any new data center facility.

This encourages the owner of yet-to-be-built data center to claim a low dPUE. You know, it is an estimate, who is it to say the figure of 1.13 is wrong? You want to check my calculations? Talk to my design consultants who are the ones that work out that number (at the insistence of me to assume the best case situation to come up with a low dPUE).

The announced ban by Beijing for new data center with PUE of 1.5 or above really meant designed PUE. Given that it is a designed PUE, a lot can go into estimating a low dPUE. Who is going to shut-off the power after the data center facility is designed, equipment selected, built and operating at well below full capacity thus yielding a bad actual interim PUE? There are many ways to go about to make the dPUE figure works to your advantage. See reference 1.

You may ignore ancillary power usage or give a very low predicted power usage in the mechanical load or cite the most ideal power efficient chiller in the design but choose a not the most efficient chiller when you decide to purchase the actual equipment. Or you decide to base your dPUE on PUE1 or PUE2 way of calculating the dPUE which makes it look slightly better. They all add (or minus) up.

pue-at-design-load-chart-2

Credit: CGG Facilities. http://www.ccgfacilities.com/insight/detail.aspx?ID=18

From my experience of operating and auditing more than a dozen data centers, I have seen very crude designed PUE estimation and some better ones.

The thing is that the designed PUE always looks too good and it stems from:

  • Not including some of the data center infrastructure losses
  • Not including electricity losses in the cables (3%)
  • Tolerance of installed equipment performing to factory specifications
  • Estimation using PUE1 situation, i.e. at UPS output whereas PUE2 or PUE3 is the recommended way
  • Different environmental conditions over 12 months in a real data center will be sub-optimal

A friend of mine who works in the data center co-location service provider laments that their honesty has given them a lower category in a green data center award versus others in the same city that claim lower dPUE figures and got higher awards. It may not be completely due to the lower dPUE figures, but it play a part.

The clients are not fools and the data center colocation service provider that claims such low dPUE will find it tougher to negotiate co-location service contracts as the power bill recovery in some countries are tied to the actual PUE but related to the dPUE when closer to full utilization. This will eat into their profits.

Ultimately, it is the real PUE3 that measures over a period of 365 days at current client IT power load that matters, and a 100% leased out co-location data center which meant full endorsement by the clients. Nothing speaks better that ka-chin at the cash registers, no amount of bill board outside will take money out of wallets of potential clients. It is how the design, equipment selection, measurement and reporting, running a tight operations, continuous monitoring and enhancement, people that all combines into having a well-run and well respected data center facility with a happy clientele that grows the co-location business. Playing with dPUE gets some attention, but delivering the service consistently and having clients that take up more of your data center space is the indicator of healthy data center business.

It is my hope that awards in energy efficient data center shall be based on actual PUE, rather than designed PUE.

Reference:

  1. http://www.ccgfacilities.com/insight/detail.aspx?ID=18
  2. https://www.greenbiz.com/article/new-efficiency-standard-challenges-data-center-status-quo
  3. http://www.datacenterknowledge.com/archives/2009/07/13/pue-and-marketing-mischief/
  4. ISO/IEC 30134-2    Part 2, Power Usage Effectiveness (“PUE”) – http://www.iso.org/iso/home/store/catalogue_tc/catalogue_tc_browse.htm?commid=654019
The problem with the (use of) PUE in the Data Center industry

Data Center Tiers, No Tears, No Plus or Minus

tiers

Background

Press releases, promotion material and website of some data center service providers, often carry this term Tier 3+, or Tier 4-, or Tier 3.5. This is intended to give the reader an impression the facility is of a higher level of resiliency in terms of design or implementation.

What’s in a Tier/Rated/Facility-Class

Tier Classification System is trademark by Uptime Institute (UTI). In a nutshell, UTI will assess and award the appropriate Tier level if a data center facility owner or private data center client engages UTI to perform such an evaluation. UTI issues the Tier levels in roman numerals I/II/III/IV. https://journal.uptimeinstitute.com/explaining-uptime-institutes-tier-classification-system/

The Telecommunications Industry Association, which is an American organization that issues telecommunications cabling and telecommunications facility standards, issued ANSI/TIA-942-A which is titled “Telecommunications Infrastructure Standard for Data Centers”, of which the latest 2014 edition contains three informative annexes (D, E, F) on data center space considerations, site selection and building design considerations, and data center infrastructure rating. Using the informative annexes of TIA-942-A, a data center facility can be rated according to four categories (Telecommunications, Architectural and Structural, Electrical, and Mechanical) to be Rated 1 – Basic, 2 – Redundant Component, 3 – Concurrently Maintainable, and 4 – Fault Tolerant.

The EN-50600 standard classify a data center in a similar manner to TIA-942-A, but adds a Facility Class 0, i.e. FC-0, while FC-1 through to FC-4 are essentially the same as TIA-942-A’s Rated 1 through to 4. FC-0 is a basically a computer room with server directly connected to utility power without backup power.

Plus? Minus? 3.9?

In any of the abovementioned standards, there is no mentioned of a +/- to any of the rating or classification. None of the standard gives room for partial, fractional, or + / – rating modifier, neither do UTI for their Tier award. So a data center can only be awarded certification that stated Tier III, or Rated 3, or Facility Class 3, but not 3.5, or 3+, or 4-.

Dig Deeper Below that Claimed Rating

If the particular data center facility that announced that they have a Tier 3+ data center facility, checked whether were the rating issued by any competent third party or an technical audit firm. No competent third party or technical audit firm should issue such a non-standard rating.

Such Tier 3+ or Tier4- are self-proclaimed rating in an effort by that data center facility to signal that their facility has features better than Tier 3 or just a tad below Tier 4. But, without a competent third party to evaluate whether that particular data center facility meets, say Rated 3 in the categories of Electrical and Mechanical in the first place.

In the case that that particular data center facility is evaluated by third party to be Tier 4 in the category of Electrical and Tier 3 in the category of Mechanical, then it is to be given at the lowest common rating, i.e. a Tier 3 rating.

What should potential Data Center Client Do

If the Tier level is self pro-claimed without the word certified by, or usually the words are like “our latest data center is designed to Tier 3+ resiliency”, then it is most likely not certified by any third party and the potential data center client should insist on a competent technical third party to evaluate the technical attributes of the data center if they want to consider collocating their IT equipment with them.

  • Ask the site to substantiate the self proclaimed rating using third party

We should just disregard the +, and in our mind de-rate those – or those with the decimal, i.e. if we see a Tier 4- or a Tier 3.5, we should just consider such a data center facility to be designed to Tier 3 and if we decide to consider such a data center facility, to engage a competent technical third party or better yet is to insist that the facility owner engages a third party and bears the cost.

The data center facility may dangle the Tier 3+ as a indirect indication that their site is of high quality which imply justifying a higher premium. However, the potential client should have a site selection process that have clear requirements of a data center facility and do not attach any score to the site unless it is justifiable through third party certification. Having a certification should be viewed as hygiene factor. The evaluation criteria should request for data on the technical, business/financials, and operations attributes which allows for normalization and comparison across the different shortlisted sites.

  • Tier level and Suitability to client business IT needs

A data center’s main function is to house IT equipment. Whether that IT equipment requires fault tolerant power and cooling support or it is only a test environment that can take a rung or two lower in terms of power and cooling resiliency set-up, then a data center facility that allows you to have a private suite to house critical production IT equipment in a Tier 4 set-up, and a small suite or even cage in a shared Tier 3 co-location hall is more suitable, thereby giving rise to a combined set-up that meets both the business need and best bang for the buck. This set-up is also called multi-tier or flexible-tier set-up. Not all data center facility can meet this need or the cost is higher because the base set-up of that particular data center facility will involve heavy re-work compared to one that is ready from day-one to be flexible in this aspect.

  • Do not over rely on the Tier level rating

Tier 4 data center facility doesn’t mean no downtime. Its fault tolerant but trouble rarely comes once, it may come twice and thrice. And it doesn’t take power or cooling issue to bring down a critical IT system within a data center. Human can cause problem. Or in the July 2016 incident with the Singapore Stock Exchange’s more than 5 hours unavailability of its trading system, it’s a hard disk failure that drag down the entire trading system. A distributed denial of service attack, or a telecommunications problem can bring down IT.

  • Evaluate using a comprehensive set of evaluation criteria

The things that potential data center client should do is look beyond the rating level, as whether a data center facility is designed, implemented, and certified to a data center rating level is just one facet of a data center facility’s suitability to its IT needs. There are multitude of other factors including telecommunications facilities, data center facility operations system, competencies of the facility people, among others that counts towards a resilient IT operations in a data center.

  • 24×7 on the ball operations and Watch that capacity

Sometimes, the Tier rating level will drop as the designed capacity is breached and N+1 suddenly because N and the site loses the redundancy ability. Concurrently maintainable or Fault tolerant electrical design do means that when 1N of the 2N UPS is taken offline to have servicing performed, the planning and execution of such maintenance should have the proper procedures (SOP, MOP, MOS) and backup or roll-back plans (RA). You want to minimize risk and risk window of UPS problem when the other set of UPS is taken offline for maintenance. You should also not allow the UPS maintenance and backup generator maintenance to take place at the same time because this doubles the risk that when the remaining 1N UPS fails and then the generators are on manual, you will be forced to rely only on utility supply. The maintenance should be during non operations hours. All these things comes into play and the vendor experience is very important.

Reference:

  1. https://journal.uptimeinstitute.com/explaining-uptime-institutes-tier-classification-system/
  2. http://www.tia-942.org/content/162/289/About_Data_Centers
  3. http://www.computerweekly.com/tip/Four-data-center-tier-classification-misconceptions-demystified
  4. http://searchdatacenter.techtarget.com/feature/What-colocation-customers-should-know-about-data-center-tiers
  5. https://www.linkedin.com/pulse/sharing-data-center-site-selection-evaluation-james-soh?trk=mp-author-card
Data Center Tiers, No Tears, No Plus or Minus