From the trenches of Asia data center build projects

trenches

There are issues faced by yours truly and colleagues in data center build projects in China and elsewhere. Some of these experiences and lesson learnt will be shared in this post. For reason of protecting the people involved, some specifics are obscured (e.g. country/location) while not affecting the message being shared.

Labor Strike

At about half way into the build programme for a data center in a city in south pacific, the electricians union called for a strike because compensation negotiations with the electrical installation companies broke down for the four year collective agreement. The labor strike went on for 10 full days and 7 half days. For those half day strike, the electricians worked until 11:30am and walked off. Our project schedule had very small buffer. While the client is apprised of the labor strike situation but their legal counsels do not agree that this fall under force majeure condition of our agreement. We did manage to catch up by paying for some overtime days after the strike period was over and the representatives from the client delay their take-over date by four precious days and we avoided penalty. We should look out for and plan a bit more buffer whenever such collective agreement are up for renewal.

Budget

Different country has different practice when it comes to budget, in particular buffer. In Singapore, we do not budget for a buffer, it is usually budget in such an arrangement that should the budget not exceed by 5%, then it will be covered from a central fund or some prior arranged account for contingency.

When we worked on a project in Australia, the design firm had to explain to us that the buffer of 10% is expected to be used and savings, if any, are a blessing.

In China, the budget is fixed and cannot be exceeded and there is no buffer by the final round of budget review by senior management. The other thing about the budget number is that almost certainly all data center equipment vendor and contractors will learn of this number, somehow.

While on the topic of budget, the budget for data center design contract is roughly the same in Singapore and Australia, which is around 3-5% of the whole project cost.

In China, the guideline by the building regulations give non-mandatory guideline on all design contractors to be 3-5% as well, however the competition for contract and normally the awarded design contract comes up to under 3%, sometimes not even 1%. Quality and resource allocation by the design firm do suffer as a consequence of the cut throat competitive pricing by the data center mechanical and electrical design companies in China. For example, they will only send technical liaison to be on-site once or twice just for project meetings during the build phase as it is the norm in China that the main contractor takes over the drawings and be responsible for the drawings thereafter, the design company endorse the drawings after the build programme is done with very little work to verify or inspect. More on this in my future post.

Electricity Supply (last mile)

Securing power supply for the data center facility is critical and our Beijing project in an industrial park had it secured through another company that purely acts as go-between with the power company that supply to the industrial park. We were promised the MV electrical cables will be laid and connected within six months. One month before this six months was up, the power company called for a meeting and a third party electrical cabling contractor was invited by the power company to be in the same meeting. We were told that the power company requires this third party electrical cabling contractor to be engaged through us or our middle man company to finish the last mile of electrical cabling from nearest power substation to our premises and additional cost of 1Mil RMB are to be paid by us to this new party. We were shocked and various ways to escalate, negotiate to take away this new party, and so forth took another half a year before we relented and find the budget to achieve this. The worst thing is, a new problem crops up afterwards that the power company said since we have delayed in contracting to this new party, the electrical cabling path has to be changed as the previous planned-for path to our premises is not possible since it had been given to another electrical cable for a separate project. The situation now is, we and the adjacent building complex owner are to share a MV transformer room that is to be located in one of our neighbor’s building ground floor and then the MV cable run from their complex to ours. Again, it is a take-it-or-leave-it attitude from the power company. Our project ended up taking three years to complete. So China speed can be fast, but it can also be slow and end up being a lousy compromise.

New Rules retrospectively applied to existing site

On the same Beijing project mentioned above with the last mile electricity supply obstacle, we have another problem which was new regulations imposed after the Beijing TV tower fire incident (2009) that we need to have two sources of water for fire fighters to use. They issued the regulations and all building that are to be retrofitted are to comply. When we bought over the former textile building complex and submitted our plan to change of use, the industrial park management and the fire service bureau issued the directive that we are to comply with the new rules. The complex as-is has only one water pipe which according to the new regulations is insufficient provision. We either dig a well connected to swimming pool size water storage tank or have two swimming pool size water storage tanks. We did the latter. Our green grounds were dug up and we put in the two 30m X 20m X 4m tanks with associated water pumps.

On-site inspection is a must

On two separate occasions while in Shanghai, I inspected two green field sites. Supposedly one is zoned and ready and the draft drawings looks promising until I visited the site after a two hours drive. It was farm land, with old residential single-level building with village folks still around and no road has been built. The plans are just paper, nothing has been done yet. After 5 years, the project has not taken off from the paper.

In a tier-3 (it is a China definition of their city standard, tier 1 cities are the likes of Beijing, Tianjin, Shanghai, Shenzhen, tier 2 are usually the provincial capitals and more developed cities) city in Anhui, we are to retrofit the ground and second floors of one 12 years-old factory building. The paper as-built drawing plans took 5 days to locate from the state-owned conglomerate’s building management company. We were pondering how best to re-locate the water sprinkler pipes on the ground floor such that the no-water time can be reduced to the shortest possible duration. When the building management office folks came by to look at the water sprinkler pipe, they made a comment that the water sprinkler pipe is dry. I thought the water sprinkler is a pre-action dry-sprinkler system but puzzled as it is a factory building that do not require such a complicated system. It turns out that I was mistaken. The water sprinkler system is supposed to be charged with water, but the pipe had leaks and some pipe or valves were broken and it has been dry for quite some years.

When the fire service bureau representative came by to review our drawing plans for the retrofitted floor, he nary inspected the relocated water sprinkler system. I reminded myself to always look for the nearest exit out of the building everyday I go to that building.

On an data center site inspection in GuangDong province, I came across a case whereby the design company endorsed original documentation showed dual incoming MV electricity supply while in fact the building has only one electricity supply.

Defect

How each country handle defect list is different. In China, defect list is sometimes overlooked if the main contractor’s boss and the project owner’s big boss are friend and that was the reason for awarding to that main contractor in the first place. When the relationship turn soar for whatever reason, the defect list grows and grows and it was used as a reason for non-payment of remaining sums of the project. It is not good to mix business and relationship.

Protest

A few days before the Chinese New Year holidays, a main contractor asked the sub-contractor to hire some workers to stand in front of the gate of a Shanghai data center company to block incoming car and worker transport buses. The police were called and these protestors were moved to the side of the entrance. After two days of this organized protests, the protest stopped as the police warned the sub-contractor of serious consequences. A romor has it that the sub-contractor’s manager and his colleagues responsible for organizing the protesters were “dealt” with.

Impulsive Decision and Big Grand Upscale (高大上)

Impulsive decision making (拍脑袋做决策) without expert advice by key decision maker is a major problem in Chinese organizations. Many cloud data center projects were announced throughout China, while only a fraction are actually completed and in operation. Most of them have not reached 50% occupancy and are loss making. One organization I know has its cloud strategy in 2015 to have 30,000 racks capacity in 3 years, without regards to market demand and the worst thing is those cities it planned to build its large scale cloud data centers (6,000-10,000 racks capacity each) are in less developed cities which do not need such scale without consideration of the competitors whom are not sitting still. The leaders wants big grand projects, the saying in Chinese is 高大上.

Finding the Gems among all the “promising” projects

In China and elsewhere in Asia, it is a challenge to sieve through all the announced plans by the local governments, the cloud players, and big data center park scale developers to see the gems and find the worthy project. There are data center projects that are more sure footed and data center service providers that are on firmer grounds and growing from strength to strength.

From the trenches of Asia data center build projects

Rebel forces in the China Data Center Market

f1604f457cc643379202372ad3c7f594_th

The headline picture is ChongQing city LiangJiang New District XinShui High Tech Park, whereby there are five data center buildings built or in the process of construction. Two of them are by a server manufacturer called Inspur.

Chinese server manufacturers are going a completely different route versus non-Chinese server manufacturers. The likes of IBM, HP, Dell are offering their services on in-the-rack (i.e. server) and on-top-of-hardware (i.e. software and services), and are most definitely not building data center using their own money.

In China, it is the other way around. The Chinese server manufacturers are funding and building data centers through-out China in a frenzy!

The following five server manufacturers are Chinese owned companies:

  •  华为 (Huawei)
  • 浪潮 (Inspur)
  • 曙光 (Sugon)
  • 紫光 (Unisplendour – majority owner of H3C)
  • 中兴 (ZTE)

Besides they are all server manufacturers, they have built multiple large scale colocation data centers in China.

Huawei need no introduction, and perhaps ZTE as well. The rest of the server manufacturers mentioned above are relatively unknown outside of China. But, they are well known in China.

Huawei had built more than three cloud data center facilities in China (reference 1, 2). For the data center facility sector, Huawei has containerized data center solution, their own UPS, CRACs/CRAHs. LV switchboards, Back-up generator and chillers are sourced from external parties and re-badged as Huawei. They also have their own Huawei BMS software.

Both ZTE and Sugon are state-owned enterprises, while Inspur had transited from state-owned to private ownership, a portion of its shares is still partially owned by the Chinese government.

Inspur had built a data center facility with capacity of 8,000 racks in ChongQin for China Unicom as a built-operate model. (reference 3). Inspur announced plans to build seven large data center facilities and 50 smaller ones through-out China. (reference 3, 4)

Apple will use Inspur to build and operate a data center in China, according to news. (reference 5).

Sugon is well known in China for their super-computer cluster but they also have server line of product. They have built super-computer data centers in three cities (Wuxi, ChengDu, and Nanjing) and also cloud data centers in many cities. They had announced plans to cover 100 cities. They had ventured outside of China and built a data center in Slovenia for Slovenian ICT company Arctur.(reference 6, 7)

Unisplendour is public listed on Shenzhen stock exchange. Unisplendour had bought 51% of HP’s ownership of H3C and started making and selling HP servers that are branded as H3C in China. Unispendour announced that it will spend 2.2B RMB dollars to acquire or to build data centers. They had announced plan to build a data center facility in FangShan district of Beijing, the built-in area of the building will be 39,725 sqm (about 427,500 sqft) (reference 8).

ZTE had buit a data center facility with rack capacity of 13,000 racks in ChongQing for China Mobile as a built-operate model. (reference 9). ZTE is explicitly stating via their website that they provide a total data center solution from planning, building all the way to hand-over. (reference 10). They were looking outside of China to build data centers to meet the demand of Chinese companies that needs to set-up network point-of-presence (POP) or host their applications and data.

For example, in ChongQing, there are at least two server manufacturers involved in building and managing data centers on behalf of China Unicom and China Mobile, which in turn are for cloud service providers like Baidu and Tencent.

This has been going on for more than 4 years and thus quite a number of data center facilities had been completed> Inspur has been leading the charge, followed by the rest and not surprisingly Huawei was among the last to do so (Unisplendour is slow because it had only acquired the 51% shares of H3C in 2016). Huawei was probably hesitant to encroach onto IDC space but couldn’t forego the market share and potential revenue.

In China, the background and the incentives are vastly different from elsewhere which is why these manufacturers have enter the data center build-operate-(optional)transfer cycle. The data center builds are under an umbrella or a framework package deal whereby the city or district government wanted a public-private-partnership deal that is headlined by smart-city or government-cloud initiative, which usually ties together the following:

  • smart-city projects – communities (residents and busineseses) projects – employment (skilled manpower)- infrastructure (cloud, network, servers, storage) – data center

So the data center is the underpinnings for the whole deal, and the server manufacturers sees the volume of network/servers/storage required (when it is 10s of thousands of racks = 100,000 servers = multiple times investment versus data center core and shell plus electrical and mechanical infrastructure).

The server manufacturers will work with partners to deliver the entire package or supplement with their in-house capability to develop some of the big data / cloud solutions. So these server manufacturers view these huge projects as opportunities for growth. The other thing is that in China, there is very few big third party co-location data center service providers that has the reach and deep pockets compared to these mostly state-backed server manufacturers. Therefore, this creates a unique situation for the Chinese server manufacturers to enter the data center build-operate-(optional)transfer model. Note that these data centers are built by appointed main contractors and not strictly by these China server manufacturers.

Intertwined in this mix is that the Baidu/Alibaba/Tencent (BAT) are also one of such partner with the Chinese server manufacturers in those smart-city projects, thus not necessary that the server manufacturers are squeezing out the BAT in the smart city projects. For example, the Inspur built data center in ChongQing LiangJiang new district, is known as the Baidu ChongQing LiangJiang Data Center.

How this will play out in the future remains to be seen. For the moment, it is worth noting that the China server manufacturers can be a channel or a partner to work with in the China data center market as well as overseas.

Reference:

  1.  http://mt.sohu.com/20170308/n482702087.shtml
  2. http://news.eastday.com/eastday/13news/auto/news/china/20170209/u7ai6476772.html
  3. http://www.liangjiang.gov.cn/Content/2015-06/19/content_90027.htm
  4. http://www.ebrun.com/20150716/141014.shtml
  5. http://www.gold678.com/dy/P/67321072
  6. http://cn.chinagate.cn/news/2016-11/16/content_39716087.htm
  7. http://www.jifang360.com/news/2014422/n402358123.html
  8. http://news.idcquan.com/news/60548.shtml
  9. http://fiber.ofweek.com/2016-09/ART-210022-8120-30033419.html
  10. http://www.zte.com.cn/cn/solutions/cocloud/201508/t20150825_443862.html
Rebel forces in the China Data Center Market

A list of Singapore Colocation Data Centers

AAEAAQAAAAAAAA2rAAAAJGUyNjE4YmJmLTE1NzYtNGVmNS1iMDRjLTk2NGMxYjM0ZTAxMA

For Enterprise looking for data center co-location service provider, google search will yield pages of pages of links.

Apparently, there is a few website that list the data centers (see reference 2 to 5) of which some of the listed data center colocation service providers were no longer available, but this post is to list those that offers co-location service to enterprises, and lists the web page that contains contact information. The list will only contain data centers that are ready to take in customer’s 1 rack full of IT gear.

This post will help an enterprise that is searching for Singapore data center co-location service provider to house at least one rack with IT gear, and not specifically looking for managed service. However, we do not distinguish between those that offers whole rack, or private cage, or private data hall. If the enterprise is looking for colocation plus managed services, then the likes of HPE, IBM, Dimension Data, Atos-Origin will be able to bundle the hosting facility (be it a private suite / data hall or a private cage) from one of the service provider below and managed service.

Some of the data center co-location service providers listed below has more than one data center facility in Singapore.

1. 1-Net Singapore

http://www.1-net.com.sg/

2. AIMS

http://www.aims.com.my/co-location/data-centre-locations/singapore/

3. Ascenix Singapore

https://ascenix.net/index.php

4. AT&T

https://www.corp.att.com/worldwide/att-you-singapore.html

5. CenturyLink

http://www.centurylink.com.sg/business/enterprise.html

6. Colt

http://asia.colt.net/services/data-centre/about-colt-data-centres/sgdc1/

7. CyrusOne

https://cyrusone.com/enterprise-data-center-services/colocation-solutions/cabinets/

8. DataPipe

https://www.datapipe.com/data_centers/singapore_one_data_center/

9. Digital Realty Singapore

https://www.digitalrealty.com/data-centers/singapore/

10. Epsilon

http://www.epsilontel.com/

11. Equinix Singapore

http://www.equinix.sg/locations/singapore-colocation/singapore-data-center/

12. Fujitsu

http://www.fujitsu.com/sg/services/infrastructure/data-center-services/

13. Global Switch

http://www.globalswitch.sg/locations/singapore/

14. InterNap

http://www.internap.com/data-centers/data-center-locations/singapore/

15. IO

https://www.io.sg/data-centers/singapore-data-centre/

16. KDDI Telehouse Singapore

https://www.telehouse.com.sg/

17. Keppel Data Centres

https://www.keppeldatacentres.com/

18.Kingsland

http://kingslanddatacenter.com.sg/

19. LeaseWeb

https://www.leaseweb.com/platform/data-centers/SIN-11

20. M1

https://www.m1.com.sg/business/datacentrehosting

21. NewMediaExpress

https://www.newmediaexpress.com/colocation.html

22. NTT

https://sg.query.ntt.com/en/contact-us/office-location.html

23. Racks Central

http://www.rackscentral.com/

24. SingTel

https://www.singtel.com/business/products-and-services/singtel-data-centre-services

25. Starhub

http://www.starhub.com/business/cloud-and-data-centre/data-centre.html

26. ST Electronics DCS

http://www.steedcs.com/

27. ST Telemedia

https://www.sttelemediagdc.com/

28. T-Systems

http://www.t-systems.com.sg/abouttsystems/about-t-systems-data-centre-managed-network-services-hosting-services/621404

29. Tata Communications

https://www.tatacommunications.com/products-services/enterprises/cloud-hosting

30. Telekomunikasi Indonesia International (Telin SG)

http://telin.sg/

31. Telstra

https://www.telstraglobal.com/sg/products/cloud/colocation

32. ViewQwest

http://corporate.viewqwest.com/products/colocation.html

33. Icon-Webvisions

http://www.iwv.com.sg/services/colocation/

34. WebZilla

http://www.webzilla.sg/colocation/

Note: some of these data center co-location service provider are themselves leasing physical data hall from a large scale data center co-location service provider.

Just for completeness sake, here are some hyperscale data center, i.e. data center built by the enterprise that offers cloud services for the enterprise. Given that they have cloud set-up in Singapore, there will be speed advantage to consider cloud services from the following providers:

a. Aliyun aka Alibaba Cloud

https://intl.aliyun.com/

b. AWS

https://aws.amazon.com/ (Singapore is one of the Availability Zone that you can pick)

c. Microsoft Azure

https://azure.microsoft.com/en-us/regions/ (Singapore is one of Azure’s region)

d. Google (Jurong West)

https://cloud.google.com/about/locations/ (Singapore is one of Google’s cloud zone)

e. Softlayer

http://www.softlayer.com/data-centers

Reference:

1. https://www.google.com/maps/d/viewer?mid=1EG49nDzDZ-NLjVyqbkmDLS4V5TE&hl=en&ll=1.3304834515591697%2C103.83486799999991&z=12

2. http://www.datacentermap.com/singapore/singapore/

3. https://cloudscene.com/search/data-centers?searchTerm=Singapore&pDc=1&pSp=1&pFb=1&pMar=1&sDc=providers&sSp=pops&sFb=markets&sMar=facilities

4. http://wiredre.com/singapore-data-center/

5. https://www.datacenter.sg/

A list of Singapore Colocation Data Centers

A list of data center mechanical and electrical design consulting firms in Singapore

planounifilar_22647

In two separate occasions, a common question came up – what are the data center mechanical and electrical design consulting firms can they call for submissions, as they only knew at most one or two.

Well, quite a lot apparently. The list below are those that had at least done a data center project or data center technical review in Singapore:

  1. ARUP
  2. Aurecon
  3. Bescon Consulting Engineers
  4. Cundall
  5. DSCO
  6. HurleyPalmerFlatt
  7. I 3 Critical Facilities
  8. J Roger Preston
  9. M+W Group
  10. Meinhardt
  11. NTT Facilities (formerly Pro-Matrix)
  12. Plan One Engineering Services
  13. RED Engineering
  14. SJ Thames
  15. TW International Counsel
  16. Wah Loon Engineering
  17. Worley Parsons

I am not associated with any of the above company in a business capacity.

You can easily obtain the company contact information from google search of these companies’ website. If you would like another mechanical and electrical design consulting firm to be listed, please let me know with a project reference information (just project name) and I will update as and when I am able to.

Reference:

  1. http://www.datacenterjournal.com/selecting-a-data-center-consultant/
  2. https://www.linkedin.com/pulse/making-sense-data-center-standards-james-soh-%E8%8B%8F%E6%97%AD%E6%B1%9F
  3. https://www.bca.gov.sg/PanelsConsultants/panels_consultants.html
A list of data center mechanical and electrical design consulting firms in Singapore

On the same page – data center co-location market segments and terms

AAEAAQAAAAAAAA2PAAAAJGE1ODU5NjIyLWU4MzYtNDcwOS1iYmNmLTcyMWEzYzVhY2E1ZA.jpg

In China, Internet Data Center (IDC) is synonymous with all the following terms:

  •  Wholesale co-location data center
  • Multi-tenanted co-location data center
  • Retail co-location data center
  • Webhosting

They don’t classify the IDC into the above terms – simply just IDC.

The data center industry sometimes dropped the term co-location, i.e. wholesale co-location data center = wholesale data center.

I believe these separate terms are created by the data center service providers ourselves.

I believe the different market segment terms came about due to the sheer size of the US market that these somewhat distinct segments are formed and they tend to differentiate along the scale and size of the client’s data hall size requirements.

  • Enterprise aka in-house data center
  • Wholesale co-location data center (preference to have one client per building)
  • Multi-tenanted co-location data center (preference for 1 customer = 1 data hall = 1 private vault =1 private data room)
  • Retail co-location data center (One rack or within one rack, includes dedicated server hosting)
  • Webhosting aka managed hosting, shared server hosting, and virtual server hosting
  • Cloud Data Center

Let me put this into perspective. With the exception of enterprise data center, the co-location data center terms are from the perspective of the data center co-location service providers and not usually from the perspective of clients.

The big boys are those that are big scale, the DRTs, Equinix, but there are lots of data center co-location service providers in the US that we outside of that market are not aware of. In the end, DRT is still less than 10 percent of data center white space in the US, so there are a lot of service providers while it is a big pie to slice and dice.

There is little agreement to the definition of these terms globally though. A particular type may be dominant in a particular country due to the fact that clients are only looking for say, retail co-location data center. In Jakarta, while dedicated data center building is being planned, most of the data centers are the retail and multi-tenanted in mixed use buildings as the scale of demand are aggregately large but coming from lots of small and middle clients because most large enterprises prefers self-built and in-house.

There is no agreement on the term Cloud data center, since it can exist in any of the data center market segment types.

It is perhaps one of the reason why there is very few campus-scale wholesale co-location data center in China because customer demand have not reached that scale yet and also the way land are allocated in China do not favor data-center-industry-park as the local government prefers job-creation while data center ultimately do not employ a lot of workers. They do like cloud data center as they perceive these will employ lots of IT engineers/programmers.

The client do not care how you position your data center in the data center market by size, they only care if your data center space meet their requirements in terms of technical and financial attributes.

While the US and thus the US-based data center co-location service providers has enjoyed growth in demand of large scale data centers, the rest of the global market may not see the same level of demand for large scale data center.

In Europe, when I was attending London’s edition of DCD conference in 2016, the European multi-tenanted co-location service providers are giving feedback to OCP workshop organizers that their data center space generally do not cater to entire facility meeting the average power requirements of OCP racks. While in China, only Alibaba has gone the route to want large scale while Baidu and Tencent are managing their data center space growth via presence in multiple buildings by multiple multi-tenanted co-location service providers. Large scale data center park builder in Hebei has faced problem of building ahead of predicted demand when there is little takers of their already built 3 data center buildings.

I was working in a data center co-location service provider that started off with offering retail co-location, i.e. one rack or sub-division of a rack, which then moved into multi-tenanted co-location, we have clients of all types and the client don’t really care about how we define our data center co-location market position. They only want to know if we will meet their requirement and submit a price proposal. A rack-mount server fits into a standard IT rack, so whether the rack is shared or the room or the building is shared may not matter to the client.

In some cases, clients actually may prefer smaller data center service providers who are considered more responsive while the larger enterprise may tend to go for global wholesale data center co-location service providers because they have a global agreement and consistent standard, to each their own.

Back in the dotcom burst days, data centers closures due to overbuilt supply has seen the then dominant players like Exodus and Digital Island changed hands, scale down, and sold.

In the end of the day, it is the clients and their needs that define which market segment will grow.

Reference:

  1. https://en.wikipedia.org/wiki/Colocation_centre
  2. https://cyrusone.com/corporate-blog/understanding-the-different-types-of-data-center-facilities/
  3. http://www.technavio.com/report/global-data-center-multi-tenant-wholesale-market
  4. http://www.missioncriticalmagazine.com/articles/88290-report-data-center-colocation-market-annualized-revenue-projected-to-reach-33bn-worldwide-by-end-of-2018
  5. https://structureresearch.net/product/marketshare-report-global-data-centre-colocation/
On the same page – data center co-location market segments and terms

HUMAN ERROR: The Biggest Challenge to Data Center Availability and how we can mitigate it – Part 2

it_engineer_with_server

The previous article on this topic can be found via this link.

The layered approach to upkeep data center infrastructure availability should not look like the Swiss cheese, i.e. the hazard or trigger should be eventually stopped and preferably as early as possible.

swiss-cheese

The layers should include the following:

  • Design (in accordance with design intent of owner) with either concurrent maintainability objective or fault tolerance
  • Implementation (in accordance with design brief) and fully tested via comprehensive testing and commissioning phase before handover with fully documented SOPs/MOPs/EOPs
  • Maintenance and Operations Management, work by equipment service providers or any work on site through Method of Statement and Risk Assessment matrix by suitably qualified person/persons
  • Incident and Problem management process, escalation management and mitigation process

and so forth

Possible problems arising from inadequacy in each of the layer can result in:

  • Inherent Design / Setting flaw
    • Outdated / swiss cheese situation
    • Requires analysis and manual intervention
    • Error Producing Conditions (EPC)
  • Weakness in manual processes
    • Inadequate automation
    • Inadequate training / familiarity
    • Inadequate operations procedures
  • Insufficient Information / knowledge
    • Capacity limit reached earlier than design intent
    • Inadequate training / knowledge
    • Inadequate documentations
  • Insufficient Risk Assessment
    • MOS / RA, risk matrix
    • Vendor experience

 

Learn from other industry

Our data center industry is a relatively young industry and there are other industries with mission critical infrastructure that have undergone extensive research and iterative enhancements which we can learn from and adopt.

  • Airline’s Crew Resource Management
    • Checklist and double checking by pilot and co-pilot on the airplane air-worthiness
    • Communications within the cockpit and the cabin staff with the cockpit to ensure timely and prioritized response
  • US Nuclear Regulatory Commission
    • Standardized Plant Analysis Risk – Human Reliability Analysis (SPAR-H) method to take account of the potential for human error
  • OECD’s Nuclear Energy Agency
    • Ways to avoid human error, e.g.,
      • Systems should also be designed to limit the need for human intervention
      • distinctive and consistent labelling of equipment, control panels & documents;
      • displaying information concerning the state of the plant so that the operator don’t need to guess and make a faulty diagnosis; and
      • designing systems to give unambiguous responses to operator actions so incorrect actions can be easily identified.
      • operators to be better trained for plant emergencies, use of simulators

 

 

Error Reduction Strategy and Error Predictor

In addition, error reducing strategies can be applied in all areas of data center maintenance and operations management to reduce the probability of occurrence of human error. Whether in design of the data center power and cooling infrastructure, or determining the risk of operations of particular maintenance operations (e.g. power switch-over exercise to perform UPS or back-up generator maintenance), all the strategy below should be applied.

Take for example the case of the AWS US-East-1 outage incident (http://mashable.com/2017/03/02/what-caused-amazon-aws-s3-outage/), the command set is powerful and a typo could bring down a lot of servers real quick. So AWS said in their post incident summary (https://aws.amazon.com/message/41926/) that they will limit the speed of the effect of the command and tool, i.e. put in safety check which is basically an application of the constraint strategy.

error-reduction-strategy

When service or repair task are assigned to operations staff, or by qualified technicians of equipment service provider, an evaluation of the existence of error precursors and eliminating these pre-cursors will reduce the likelihood of human error. For example, the combination of time pressure, inexperienced staff already at the end of the long work shift and ambiguous task objective are all contributor to a higher risk of the assigned task. Eliminate, reduce, and re-direct to an experienced staff at the start of the work shift with clear task objective will reduce the risk of the assigned task.

error-precursors.jpg

Risk Mitigation is a Continuous Process

A multi-prong multi-layer and attention to details approach are required to mitigate the risk of human error causing an outage in a data center facility.

Risk Mitigation Process Flow.jpg

 

Design and implementation of a data center to a set of clear and tested design intent (e.g. objective of data center being concurrently maintainable). Day-in and day-out the operations staff, vendors, client personnel interact with the system within the data center. So there need to be a well oil system in place, not only just documentations, that works 24×7 for as long as the data center is in existence.

An iterative risk mitigation system, relying upon consistent management support and attention, with knowledge learned from near misses and incidents are key attributes of an environment that is resilient in terms of the human aspect.

We Human can reduce Human Error, effort required

We should look at the data center organization especially the operations team, the resources and tools, the capability of the operations team, and so forth. A culture of no blame, and encouraging active participation by all staff to address potential weakness or error precursors, addressing near-miss which is a sign of error inducing conditions, are important to mitigate effects of human errors. We should get away from pointing fingers and to learn from past problems, like what AWS did with their incidents. And our data center industry can do more to share and learn from one another, to prevent future occurrence of issues that were faced and dealt with elsewhere.

This built-up knowledge of good practices should be documented and disseminated, with management support. The weakest link is an inexperience staff hesitating or worse making a wrong decision, so training everyone on the operations team is critical to maintaining availability of data center.

A periodic (for example annual basis) no-nonsense third party data center operations and management review coupled with improvement plans to strengthen those weakest links will boost insight and assurance to data center C-level executives, data center operations managers, and clients. Most operations manager will be too busy to review their own data center operations, coupled with the difficult position of finding your own fault and limited experience if the staff have not worked in more than one or two data center sites, therefore a third party operations and management review is the next best thing to enhance resilience against human error provided it has the full co-operation from top to bottom of the data center staff.

Furthermore, if a data center service provider has grown beyond 2 to 3 data centers, it will be difficult to consistently manage the data center operations across them especially if they are managed independently. A third party review that is applied to all of them will help to reign in inconsistent operations processes, subject to having a central data center operations programme function within the data center service provide, of course.

Therefore, a data center facility is ultimately dependent on well trained and knowledgeable staff, whom are clear about their data center facility information or knows where to quickly find the documentation that contains the detail information, do the risk assessment work of evaluating equipment service vendor or upgrade works properly.

In summary,

  • It is worthwhile to commit resources to reduce errors
  • We can do improve our resiliency and thereby uptime through available methods and tools
  • There are proven methods and tools we can borrow from other mission critical environments
  • Third party data center operations and management review coupled with improvement plan should be considered for large data center operations especially those that have multiple sites

 

References:

  1. https://en.wikipedia.org/wiki/Human_error_assessment_and_reduction_technique
  2. http://news.delta.com/chief-operating-officer-gives-delta-operations-update
  3. https://journal.uptimeinstitute.com/examining-and-learning-from-complex-systems-failures/
  4. http://www.enterpriseinnovation.net/system/files/whitepapers/1_2016-cost-of-data-center-outages-final-2.pdf
  5. https://www.oecd-nea.org/brief/brief-02.html
  1. http://www2.lbl.gov/ehs/training/assets/docs/Error-Precursors.pdf
  2. https://www.linkedin.com/pulse/data-center-human-factor-james-soh-%E8%8B%8F%E6%97%AD%E6%B1%9F

 

  1. https://www.linkedin.com/pulse/human-errors-biggest-challenge-data-center-how-we-can-james-soh-%E8%8B%8F%E6%97%AD%E6%B1%9F
HUMAN ERROR: The Biggest Challenge to Data Center Availability and how we can mitigate it – Part 2

2+2000+3000 = 1 big challenge

thousands-of-app-in-a-bank

You read it right.

2 DCs + 2,000 servers with 3,000 applications are going into a new Data Center in three years time, and the man/woman to do it is yet to be found.

I have had the unfortunately experience of network interruption that caused slow and unacceptable access to hundreds (my then estimate) of online application in a large Enterprise data center, but then again we were an internal shared co-location data center so we only count the projects and departments but never the number of application systems.

 

In the above picture is an advertisement that calls for a data center migration project manager, and it has been repeatedly put up for more than 6 months and I have seen the indicated salary range gone up (from 10k SGD per month to now 15k SGD). The scale and complexity is also fully spelled out, as previously their advert did not indicate two data centers moving into a new one, and 2,000 servers with 2,000 application systems, and to move all in (I presume by phases, it will be nearly mission impossible if all at once) by 2020. Don’t forget, system change and new system and services will not stay frozen during this period, while new system may still be added to the current data centers given the inter-dependencies of system and data required to introduce these new systems and services. Hopefully, no IP address change for any of the system is required, oh that is only one out of many possible things to consider for such a move.

I had one data center move done in the mid 1990s with the enterprise having one mini-computer sytem as the one and only mission critical system, we went beyond the planned downtime window of 24 hours by additional 30 hours because the new site’s telecommunication cables are digital compared to the old site which was analog, and our analog modem was unable to work and we had to bring in new ones while the migration took place on a Sunday when the vendor’s warehouse are closed.

I had on occasions when talking to data center facility owners, sales people and fellow consultants about the mission critical nature of IT for most of today’s enterprises and mentioned hundreds of applications are in use on average by medium to large enterprises.

098733b

3,000 applications is one big number. I hope only 10% of them are mission critical and the entire application system portfolio are prioritized and inter-dependencies already mapped out.

When a bank data center ran into problem (see reference section below), what we see externally is ATM being down, and counter staff may switch over to back-up system and service became slower. What really happen will involve lots more effort to bring the critical applications back to service.

What makes me wonder though, shouldn’t the bank have identified such a role and bring him in earlier in the process of decision making to have a new data center, and this person or better a team of data center migration experts be better in all sorts of way rather than bring someone from outside and who is to manage any knowledge and therefore mitigate any migration risks? I am pretty sure that the local financial regulator will dive in to audit and assess the bank’s migration plan.

Anyway, I have learnt a number (1,000s of servers, applications) for a Singapore bank is probably typical.

Best of luck to their data center migration.

Reference:

  1. http://www.datacenterknowledge.com/archives/2013/12/16/year-downtime-top-10-outages-2013/
  2. https://www.theregister.co.uk/2017/01/13/lloyds_bank_in_talks_to_outsource_bit_barns_to_ibm/
  3. https://www.linkedin.com/pulse/20140616192008-655694-best-practices-for-data-center-migration
2+2000+3000 = 1 big challenge

Can you help us build a tier 5 data center?

the-data-center-build-photo

A data center consultant K told me this story. In around 2005 or 2006 he gave a talk at a data center conference at a famous financial and resort city somewhere in Asia. A gentleman J walked up to K afterwards and introduced himself as a property developer who are looking into building a new data center. The topic that K spoke about on the stage was about data center standards and mentioned Uptime Institute Tiers and TIA942, and J said he wanted to build a Tier 5 data center.

As an aside, let me defer to other posts/websites on the design standards and Tier level/Rated/Facility class (see reference 1 and 2). Generally speaking, most define the data center design based on resiliency required up to four levels.

K was taken aback and asked if J is aware that the Tier levels tops off at IV / 4, J said he knows and he wanted to go one up better than Tier IV / 4. J shared that given the city he planned to build the new data center has not had any standalone data center facility, he wants to stand out and that city is well known to have extravagant hotels and malls and such.

The idea that build them and they will come

K is kind enough to ask J if he had done a market study and knows whether potential clients demand a highly resilient and fault tolerant data center, and J replied no he has not but he thinks that demands will rush in for his data center once it is announced that such a data center will be built. Well, maybe if you have done your study and knows where the competition for starters. But if you have not done any of the study of market demand and competition, then what you built may be over built, or way ahead of demand and will take longer than your optimistic timeframe to sell them.

I had on multiple occasions met with potential data center owner who are considering to build their first data center non first tier data center market in Asia. Surprisingly, a common central theme of their plan hinges on “build them and they will come” mindset. Today, several Asia cities are in over-supply not only in residential / industrial sectors but also the sub-sector of data center of which the over-confidence of demand will come when supply is there is one contributor to the situation. Data center facility is a huge investment, a China data center company I have known to have a data center facility in Beijing that is well sought after, but expanded in other cities which they are less familiar with and suffers losses for years, which drags down their overall finances and they are forced to sell their crown jewel at less than preferred circumstances and numbers.

Client needs and supply / demand

I have two points to make. Firstly, know your market and competition, and your financial strength. If all your competition in the market are building to get shared hosting type which only demands a UPS backed electrical supply to their IT servers, then building it to higher level of resiliency makes your data center space more pricy and will take longer to fill up, if ever. There was a few such cases in Singapore and some have folded after building a data center, and some have spent millions of dollars and the project cannot take off and are now in limbo. Many such cases also exist in China. While one case in Singapore had prevailed, they built their data center during the dot com boom but were caught in the downturn of dot com bust which had several casualties, and this one data center managed to survived through building up their data center facility on a floor by floor basis unlike the other two, thus less demand on their financials compared to the others during that period.

More prudent to match cost outlay to take-up

Secondly, the main technical infrastructure design parameter of whether to build to concurrently maintenance (aka roughly equivalent to Tier III / Rated 3 / Facility Class 3) or fault tolerant (Tier IV / Rated 4 / Facility class 4) are dependent on the demand by the client. If the target clientele are the financial institutions or those organizations that due to various reasons are reliant on IT but their system can only run on single host/system or active-passive set-up (it seems like airline ticket reservation system are like that), then it make sense. Another way is to plan for multiple level of resiliency features, i.e. share the same fault tolerant level of electrical infrastructure but flexible enough to accommodate either concurrently maintainable or fault tolerant demand of the client (although generally this will be slightly more costly than purely designed and implemented to concurrently maintainable).

Fortunately these days, there are so much information in the market and the new owners-to-be are better informed. My other gripe is those that knows a little in one particular topic of the data center knowledge and yet is so convinced of it that precludes meaningful exchange, but that is another story in future post.

Reference:

  1. http://www.datacenterknowledge.com/archives/2016/01/06/data-center-design-which-standards-to-follow/
  2. https://uptimeinstitute.com/tiers
  3. https://www.linkedin.com/pulse/data-center-tiers-tears-plus-minus-james-soh-%E8%8B%8F%E6%97%AD%E6%B1%9F
  4. https://www.linkedin.com/pulse/making-sense-data-center-standards-james-soh-%E8%8B%8F%E6%97%AD%E6%B1%9F
Can you help us build a tier 5 data center?

The problem with the (use of) PUE in the Data Center industry

Montage-data-centers.jpg

I had mentioned in a previous post on reporting and use of PUE, including the terms iPUE, PUE3, dPUE etc ( https://www.linkedin.com/pulse/data-center-resource-efficiency-pue-isoiec-30134-james-soh-%E8%8B%8F%E6%97%AD%E6%B1%9F).

Like how Tier / Facility Class / Rated are being mentioned fuzzily in the industry, having not make clear whether it is designed according to which standard or certified or not, of which the confusion is not helping the potential clients and the whole industry. Just to clarify, I take no stand against any data center saying that its facility is designed in accordance to a particular standard given that any potential client should and will make detail review and audit of the facility before committing to a co-location deal.

The issue that I like to highlight in this post, is the use of designed PUE (dPUE) instead of PUE in the way it is used in a manner to market or even set policy. dPUE itself is subject to estimation (as per example case in ISO 30134-2) and imprecise. The actual PUE3 versus dPUE can have a huge gap given the IT load profile will normally not ramp up to near 100% for any new data center facility.

This encourages the owner of yet-to-be-built data center to claim a low dPUE. You know, it is an estimate, who is it to say the figure of 1.13 is wrong? You want to check my calculations? Talk to my design consultants who are the ones that work out that number (at the insistence of me to assume the best case situation to come up with a low dPUE).

The announced ban by Beijing for new data center with PUE of 1.5 or above really meant designed PUE. Given that it is a designed PUE, a lot can go into estimating a low dPUE. Who is going to shut-off the power after the data center facility is designed, equipment selected, built and operating at well below full capacity thus yielding a bad actual interim PUE? There are many ways to go about to make the dPUE figure works to your advantage. See reference 1.

You may ignore ancillary power usage or give a very low predicted power usage in the mechanical load or cite the most ideal power efficient chiller in the design but choose a not the most efficient chiller when you decide to purchase the actual equipment. Or you decide to base your dPUE on PUE1 or PUE2 way of calculating the dPUE which makes it look slightly better. They all add (or minus) up.

pue-at-design-load-chart-2

Credit: CGG Facilities. http://www.ccgfacilities.com/insight/detail.aspx?ID=18

From my experience of operating and auditing more than a dozen data centers, I have seen very crude designed PUE estimation and some better ones.

The thing is that the designed PUE always looks too good and it stems from:

  • Not including some of the data center infrastructure losses
  • Not including electricity losses in the cables (3%)
  • Tolerance of installed equipment performing to factory specifications
  • Estimation using PUE1 situation, i.e. at UPS output whereas PUE2 or PUE3 is the recommended way
  • Different environmental conditions over 12 months in a real data center will be sub-optimal

A friend of mine who works in the data center co-location service provider laments that their honesty has given them a lower category in a green data center award versus others in the same city that claim lower dPUE figures and got higher awards. It may not be completely due to the lower dPUE figures, but it play a part.

The clients are not fools and the data center colocation service provider that claims such low dPUE will find it tougher to negotiate co-location service contracts as the power bill recovery in some countries are tied to the actual PUE but related to the dPUE when closer to full utilization. This will eat into their profits.

Ultimately, it is the real PUE3 that measures over a period of 365 days at current client IT power load that matters, and a 100% leased out co-location data center which meant full endorsement by the clients. Nothing speaks better that ka-chin at the cash registers, no amount of bill board outside will take money out of wallets of potential clients. It is how the design, equipment selection, measurement and reporting, running a tight operations, continuous monitoring and enhancement, people that all combines into having a well-run and well respected data center facility with a happy clientele that grows the co-location business. Playing with dPUE gets some attention, but delivering the service consistently and having clients that take up more of your data center space is the indicator of healthy data center business.

It is my hope that awards in energy efficient data center shall be based on actual PUE, rather than designed PUE.

Reference:

  1. http://www.ccgfacilities.com/insight/detail.aspx?ID=18
  2. https://www.greenbiz.com/article/new-efficiency-standard-challenges-data-center-status-quo
  3. http://www.datacenterknowledge.com/archives/2009/07/13/pue-and-marketing-mischief/
  4. ISO/IEC 30134-2    Part 2, Power Usage Effectiveness (“PUE”) – http://www.iso.org/iso/home/store/catalogue_tc/catalogue_tc_browse.htm?commid=654019
The problem with the (use of) PUE in the Data Center industry

Human Error: The biggest challenge to data center availability and how we can mitigate – Part 1

it_engineer_with_server

The 2016 Ponemon Institute research report on Cost of downtime (reference 1) contains a chart showing the cause of data center downtime, and classify accidental human error to be 22%, and the top six contributors to downtime are UPS system failure (25%), cyber crime (22%), accidental human error (22%), water/heat/CRAC failure (11%), weather related (10%), and generator failure (6%). However, the accidental human error did not account for latent human error that could have contributed to those UPS/CRAC/Generator failure.

ponemon2016

Uptime Institute had cited 70% of data center outages can be attributed to human error.

The definition of human error is broader and can be generally classify into Active Error (where a deliberate action caused deviation from expected outcome), and Latent Error (where a non-deliberate action caused deviation from expected outcome). For example, when a design decision is made regarding the power protection circuit for a data center room, if it was not fully co-ordinated to isolate and protect power issue to cascade upstream to higher level circuit breakers.

There are many cases of major outages in the past few years that are attributed to human error. The 2016 Delta airline data center outage is reported to cost them USD 150 Millions. Part of the long delay (3 days) to resume service is that a significant part of their IT infrastructure is not connected to backup power source which begs the question why did it happen that way? Well, it should be due to latent error, where the IT equipment installation or the in-rack PDUs are not from two separate UPS or supported by in-rack ATS switch.

I was asked a question during my presentation on this subject matter whether higher tier level aka higher resiliency designed and implemented data center can minimize this issue of human error. My answer is you can design and implement to 2N power and cooling infrastructure, but when 1N is taken down for maintenance, any mistake or weakness (inexperience operations staff/vendor personnel, procedure gap that human nature overlooked and made wrong guess etc) can take down the IT load and has happened to many data centers (google search on human error and data center outage incidents).

swiss_cheese_model_of_accident_causation

There are multiple ways for the human error to manifest in a data center outage. They can be simple external trigger that goes through loopholes like the Swiss cheese above, or cascade (combination), or direct active human error.

For example on cascade, a case of lightning strike that caused momentary power dip (see reference) should not cause an outage in a data center; however if the selection of circuit protection device or the design did not cater for how the DRUPS would respond in such a situation, and the automated control was not configured to deal with it, then any amount of SOP/MOP/EOP or Method of Statement-Risk Assessment (MOS-RA) may not protect the facility against a particular external trigger. A case of a data center in Sydney whereby the circuit breakers were not designed and selected to cater to such a scenario caused the UPS to supply to the grid instead of to the load.

For direct human error, I have also known a case of UPS manufacturer trained and authorized service engineer causing an outage, where the engineer did not follow the documented service manual and caused the entire set of UPS to tripped, and because the circuit protection devices were not able to isolate the fault downstream, caused the upstream incoming breaker to trip. This is part of the reason why data center staff should accompany and question the service engineer at critical check-points during servicing of critical infrastructure.

Outage can be failure of the resilient design / implementation due to under-capacity. This can be traced to latent (no tracking of actual power capacity versus designed capacity) or active (no checking of UPS capacity before maintenance). For example, actual power usage of N+1 UPS has actually become N UPS, and when one of the UPS was down, the entire UPS set shutdown.

In the next post, measures to mitigate the risk of human error will be discussed.

References:

  1. http://www.enterpriseinnovation.net/system/files/whitepapers/1_2016-cost-of-data-center-outages-final-2.pdf
  2. https://aws.amazon.com/message/4372T8/
  3. http://news.delta.com/chief-operating-officer-gives-delta-operations-update
  4. https://journal.uptimeinstitute.com/examining-and-learning-from-complex-systems-failures/
Human Error: The biggest challenge to data center availability and how we can mitigate – Part 1