From the trenches of Asia data center build projects

trenches

There are issues faced by yours truly and colleagues in data center build projects in China and elsewhere. Some of these experiences and lesson learnt will be shared in this post. For reason of protecting the people involved, some specifics are obscured (e.g. country/location) while not affecting the message being shared.

Labor Strike

At about half way into the build programme for a data center in a city in south pacific, the electricians union called for a strike because compensation negotiations with the electrical installation companies broke down for the four year collective agreement. The labor strike went on for 10 full days and 7 half days. For those half day strike, the electricians worked until 11:30am and walked off. Our project schedule had very small buffer. While the client is apprised of the labor strike situation but their legal counsels do not agree that this fall under force majeure condition of our agreement. We did manage to catch up by paying for some overtime days after the strike period was over and the representatives from the client delay their take-over date by four precious days and we avoided penalty. We should look out for and plan a bit more buffer whenever such collective agreement are up for renewal.

Budget

Different country has different practice when it comes to budget, in particular buffer. In Singapore, we do not budget for a buffer, it is usually budget in such an arrangement that should the budget not exceed by 5%, then it will be covered from a central fund or some prior arranged account for contingency.

When we worked on a project in Australia, the design firm had to explain to us that the buffer of 10% is expected to be used and savings, if any, are a blessing.

In China, the budget is fixed and cannot be exceeded and there is no buffer by the final round of budget review by senior management. The other thing about the budget number is that almost certainly all data center equipment vendor and contractors will learn of this number, somehow.

While on the topic of budget, the budget for data center design contract is roughly the same in Singapore and Australia, which is around 3-5% of the whole project cost.

In China, the guideline by the building regulations give non-mandatory guideline on all design contractors to be 3-5% as well, however the competition for contract and normally the awarded design contract comes up to under 3%, sometimes not even 1%. Quality and resource allocation by the design firm do suffer as a consequence of the cut throat competitive pricing by the data center mechanical and electrical design companies in China. For example, they will only send technical liaison to be on-site once or twice just for project meetings during the build phase as it is the norm in China that the main contractor takes over the drawings and be responsible for the drawings thereafter, the design company endorse the drawings after the build programme is done with very little work to verify or inspect. More on this in my future post.

Electricity Supply (last mile)

Securing power supply for the data center facility is critical and our Beijing project in an industrial park had it secured through another company that purely acts as go-between with the power company that supply to the industrial park. We were promised the MV electrical cables will be laid and connected within six months. One month before this six months was up, the power company called for a meeting and a third party electrical cabling contractor was invited by the power company to be in the same meeting. We were told that the power company requires this third party electrical cabling contractor to be engaged through us or our middle man company to finish the last mile of electrical cabling from nearest power substation to our premises and additional cost of 1Mil RMB are to be paid by us to this new party. We were shocked and various ways to escalate, negotiate to take away this new party, and so forth took another half a year before we relented and find the budget to achieve this. The worst thing is, a new problem crops up afterwards that the power company said since we have delayed in contracting to this new party, the electrical cabling path has to be changed as the previous planned-for path to our premises is not possible since it had been given to another electrical cable for a separate project. The situation now is, we and the adjacent building complex owner are to share a MV transformer room that is to be located in one of our neighbor’s building ground floor and then the MV cable run from their complex to ours. Again, it is a take-it-or-leave-it attitude from the power company. Our project ended up taking three years to complete. So China speed can be fast, but it can also be slow and end up being a lousy compromise.

New Rules retrospectively applied to existing site

On the same Beijing project mentioned above with the last mile electricity supply obstacle, we have another problem which was new regulations imposed after the Beijing TV tower fire incident (2009) that we need to have two sources of water for fire fighters to use. They issued the regulations and all building that are to be retrofitted are to comply. When we bought over the former textile building complex and submitted our plan to change of use, the industrial park management and the fire service bureau issued the directive that we are to comply with the new rules. The complex as-is has only one water pipe which according to the new regulations is insufficient provision. We either dig a well connected to swimming pool size water storage tank or have two swimming pool size water storage tanks. We did the latter. Our green grounds were dug up and we put in the two 30m X 20m X 4m tanks with associated water pumps.

On-site inspection is a must

On two separate occasions while in Shanghai, I inspected two green field sites. Supposedly one is zoned and ready and the draft drawings looks promising until I visited the site after a two hours drive. It was farm land, with old residential single-level building with village folks still around and no road has been built. The plans are just paper, nothing has been done yet. After 5 years, the project has not taken off from the paper.

In a tier-3 (it is a China definition of their city standard, tier 1 cities are the likes of Beijing, Tianjin, Shanghai, Shenzhen, tier 2 are usually the provincial capitals and more developed cities) city in Anhui, we are to retrofit the ground and second floors of one 12 years-old factory building. The paper as-built drawing plans took 5 days to locate from the state-owned conglomerate’s building management company. We were pondering how best to re-locate the water sprinkler pipes on the ground floor such that the no-water time can be reduced to the shortest possible duration. When the building management office folks came by to look at the water sprinkler pipe, they made a comment that the water sprinkler pipe is dry. I thought the water sprinkler is a pre-action dry-sprinkler system but puzzled as it is a factory building that do not require such a complicated system. It turns out that I was mistaken. The water sprinkler system is supposed to be charged with water, but the pipe had leaks and some pipe or valves were broken and it has been dry for quite some years.

When the fire service bureau representative came by to review our drawing plans for the retrofitted floor, he nary inspected the relocated water sprinkler system. I reminded myself to always look for the nearest exit out of the building everyday I go to that building.

On an data center site inspection in GuangDong province, I came across a case whereby the design company endorsed original documentation showed dual incoming MV electricity supply while in fact the building has only one electricity supply.

Defect

How each country handle defect list is different. In China, defect list is sometimes overlooked if the main contractor’s boss and the project owner’s big boss are friend and that was the reason for awarding to that main contractor in the first place. When the relationship turn soar for whatever reason, the defect list grows and grows and it was used as a reason for non-payment of remaining sums of the project. It is not good to mix business and relationship.

Protest

A few days before the Chinese New Year holidays, a main contractor asked the sub-contractor to hire some workers to stand in front of the gate of a Shanghai data center company to block incoming car and worker transport buses. The police were called and these protestors were moved to the side of the entrance. After two days of this organized protests, the protest stopped as the police warned the sub-contractor of serious consequences. A romor has it that the sub-contractor’s manager and his colleagues responsible for organizing the protesters were “dealt” with.

Impulsive Decision and Big Grand Upscale (高大上)

Impulsive decision making (拍脑袋做决策) without expert advice by key decision maker is a major problem in Chinese organizations. Many cloud data center projects were announced throughout China, while only a fraction are actually completed and in operation. Most of them have not reached 50% occupancy and are loss making. One organization I know has its cloud strategy in 2015 to have 30,000 racks capacity in 3 years, without regards to market demand and the worst thing is those cities it planned to build its large scale cloud data centers (6,000-10,000 racks capacity each) are in less developed cities which do not need such scale without consideration of the competitors whom are not sitting still. The leaders wants big grand projects, the saying in Chinese is 高大上.

Finding the Gems among all the “promising” projects

In China and elsewhere in Asia, it is a challenge to sieve through all the announced plans by the local governments, the cloud players, and big data center park scale developers to see the gems and find the worthy project. There are data center projects that are more sure footed and data center service providers that are on firmer grounds and growing from strength to strength.

From the trenches of Asia data center build projects

Rebel forces in the China Data Center Market

f1604f457cc643379202372ad3c7f594_th

The headline picture is ChongQing city LiangJiang New District XinShui High Tech Park, whereby there are five data center buildings built or in the process of construction. Two of them are by a server manufacturer called Inspur.

Chinese server manufacturers are going a completely different route versus non-Chinese server manufacturers. The likes of IBM, HP, Dell are offering their services on in-the-rack (i.e. server) and on-top-of-hardware (i.e. software and services), and are most definitely not building data center using their own money.

In China, it is the other way around. The Chinese server manufacturers are funding and building data centers through-out China in a frenzy!

The following five server manufacturers are Chinese owned companies:

  •  华为 (Huawei)
  • 浪潮 (Inspur)
  • 曙光 (Sugon)
  • 紫光 (Unisplendour – majority owner of H3C)
  • 中兴 (ZTE)

Besides they are all server manufacturers, they have built multiple large scale colocation data centers in China.

Huawei need no introduction, and perhaps ZTE as well. The rest of the server manufacturers mentioned above are relatively unknown outside of China. But, they are well known in China.

Huawei had built more than three cloud data center facilities in China (reference 1, 2). For the data center facility sector, Huawei has containerized data center solution, their own UPS, CRACs/CRAHs. LV switchboards, Back-up generator and chillers are sourced from external parties and re-badged as Huawei. They also have their own Huawei BMS software.

Both ZTE and Sugon are state-owned enterprises, while Inspur had transited from state-owned to private ownership, a portion of its shares is still partially owned by the Chinese government.

Inspur had built a data center facility with capacity of 8,000 racks in ChongQin for China Unicom as a built-operate model. (reference 3). Inspur announced plans to build seven large data center facilities and 50 smaller ones through-out China. (reference 3, 4)

Apple will use Inspur to build and operate a data center in China, according to news. (reference 5).

Sugon is well known in China for their super-computer cluster but they also have server line of product. They have built super-computer data centers in three cities (Wuxi, ChengDu, and Nanjing) and also cloud data centers in many cities. They had announced plans to cover 100 cities. They had ventured outside of China and built a data center in Slovenia for Slovenian ICT company Arctur.(reference 6, 7)

Unisplendour is public listed on Shenzhen stock exchange. Unisplendour had bought 51% of HP’s ownership of H3C and started making and selling HP servers that are branded as H3C in China. Unispendour announced that it will spend 2.2B RMB dollars to acquire or to build data centers. They had announced plan to build a data center facility in FangShan district of Beijing, the built-in area of the building will be 39,725 sqm (about 427,500 sqft) (reference 8).

ZTE had buit a data center facility with rack capacity of 13,000 racks in ChongQing for China Mobile as a built-operate model. (reference 9). ZTE is explicitly stating via their website that they provide a total data center solution from planning, building all the way to hand-over. (reference 10). They were looking outside of China to build data centers to meet the demand of Chinese companies that needs to set-up network point-of-presence (POP) or host their applications and data.

For example, in ChongQing, there are at least two server manufacturers involved in building and managing data centers on behalf of China Unicom and China Mobile, which in turn are for cloud service providers like Baidu and Tencent.

This has been going on for more than 4 years and thus quite a number of data center facilities had been completed> Inspur has been leading the charge, followed by the rest and not surprisingly Huawei was among the last to do so (Unisplendour is slow because it had only acquired the 51% shares of H3C in 2016). Huawei was probably hesitant to encroach onto IDC space but couldn’t forego the market share and potential revenue.

In China, the background and the incentives are vastly different from elsewhere which is why these manufacturers have enter the data center build-operate-(optional)transfer cycle. The data center builds are under an umbrella or a framework package deal whereby the city or district government wanted a public-private-partnership deal that is headlined by smart-city or government-cloud initiative, which usually ties together the following:

  • smart-city projects – communities (residents and busineseses) projects – employment (skilled manpower)- infrastructure (cloud, network, servers, storage) – data center

So the data center is the underpinnings for the whole deal, and the server manufacturers sees the volume of network/servers/storage required (when it is 10s of thousands of racks = 100,000 servers = multiple times investment versus data center core and shell plus electrical and mechanical infrastructure).

The server manufacturers will work with partners to deliver the entire package or supplement with their in-house capability to develop some of the big data / cloud solutions. So these server manufacturers view these huge projects as opportunities for growth. The other thing is that in China, there is very few big third party co-location data center service providers that has the reach and deep pockets compared to these mostly state-backed server manufacturers. Therefore, this creates a unique situation for the Chinese server manufacturers to enter the data center build-operate-(optional)transfer model. Note that these data centers are built by appointed main contractors and not strictly by these China server manufacturers.

Intertwined in this mix is that the Baidu/Alibaba/Tencent (BAT) are also one of such partner with the Chinese server manufacturers in those smart-city projects, thus not necessary that the server manufacturers are squeezing out the BAT in the smart city projects. For example, the Inspur built data center in ChongQing LiangJiang new district, is known as the Baidu ChongQing LiangJiang Data Center.

How this will play out in the future remains to be seen. For the moment, it is worth noting that the China server manufacturers can be a channel or a partner to work with in the China data center market as well as overseas.

Reference:

  1.  http://mt.sohu.com/20170308/n482702087.shtml
  2. http://news.eastday.com/eastday/13news/auto/news/china/20170209/u7ai6476772.html
  3. http://www.liangjiang.gov.cn/Content/2015-06/19/content_90027.htm
  4. http://www.ebrun.com/20150716/141014.shtml
  5. http://www.gold678.com/dy/P/67321072
  6. http://cn.chinagate.cn/news/2016-11/16/content_39716087.htm
  7. http://www.jifang360.com/news/2014422/n402358123.html
  8. http://news.idcquan.com/news/60548.shtml
  9. http://fiber.ofweek.com/2016-09/ART-210022-8120-30033419.html
  10. http://www.zte.com.cn/cn/solutions/cocloud/201508/t20150825_443862.html
Rebel forces in the China Data Center Market

A list of Singapore Colocation Data Centers

AAEAAQAAAAAAAA2rAAAAJGUyNjE4YmJmLTE1NzYtNGVmNS1iMDRjLTk2NGMxYjM0ZTAxMA

For Enterprise looking for data center co-location service provider, google search will yield pages of pages of links.

Apparently, there is a few website that list the data centers (see reference 2 to 5) of which some of the listed data center colocation service providers were no longer available, but this post is to list those that offers co-location service to enterprises, and lists the web page that contains contact information. The list will only contain data centers that are ready to take in customer’s 1 rack full of IT gear.

This post will help an enterprise that is searching for Singapore data center co-location service provider to house at least one rack with IT gear, and not specifically looking for managed service. However, we do not distinguish between those that offers whole rack, or private cage, or private data hall. If the enterprise is looking for colocation plus managed services, then the likes of HPE, IBM, Dimension Data, Atos-Origin will be able to bundle the hosting facility (be it a private suite / data hall or a private cage) from one of the service provider below and managed service.

Some of the data center co-location service providers listed below has more than one data center facility in Singapore.

1. 1-Net Singapore

http://www.1-net.com.sg/

2. AIMS

http://www.aims.com.my/co-location/data-centre-locations/singapore/

3. Ascenix Singapore

https://ascenix.net/index.php

4. AT&T

https://www.corp.att.com/worldwide/att-you-singapore.html

5. CenturyLink

http://www.centurylink.com.sg/business/enterprise.html

6. Colt

http://asia.colt.net/services/data-centre/about-colt-data-centres/sgdc1/

7. CyrusOne

https://cyrusone.com/enterprise-data-center-services/colocation-solutions/cabinets/

8. DataPipe

https://www.datapipe.com/data_centers/singapore_one_data_center/

9. Digital Realty Singapore

https://www.digitalrealty.com/data-centers/singapore/

10. Epsilon

http://www.epsilontel.com/

11. Equinix Singapore

http://www.equinix.sg/locations/singapore-colocation/singapore-data-center/

12. Fujitsu

http://www.fujitsu.com/sg/services/infrastructure/data-center-services/

13. Global Switch

http://www.globalswitch.sg/locations/singapore/

14. InterNap

http://www.internap.com/data-centers/data-center-locations/singapore/

15. IO

https://www.io.sg/data-centers/singapore-data-centre/

16. KDDI Telehouse Singapore

https://www.telehouse.com.sg/

17. Keppel Data Centres

https://www.keppeldatacentres.com/

18.Kingsland

http://kingslanddatacenter.com.sg/

19. LeaseWeb

https://www.leaseweb.com/platform/data-centers/SIN-11

20. M1

https://www.m1.com.sg/business/datacentrehosting

21. NewMediaExpress

https://www.newmediaexpress.com/colocation.html

22. NTT

https://sg.query.ntt.com/en/contact-us/office-location.html

23. Racks Central

http://www.rackscentral.com/

24. SingTel

https://www.singtel.com/business/products-and-services/singtel-data-centre-services

25. Starhub

http://www.starhub.com/business/cloud-and-data-centre/data-centre.html

26. ST Electronics DCS

http://www.steedcs.com/

27. ST Telemedia

https://www.sttelemediagdc.com/

28. T-Systems

http://www.t-systems.com.sg/abouttsystems/about-t-systems-data-centre-managed-network-services-hosting-services/621404

29. Tata Communications

https://www.tatacommunications.com/products-services/enterprises/cloud-hosting

30. Telekomunikasi Indonesia International (Telin SG)

http://telin.sg/

31. Telstra

https://www.telstraglobal.com/sg/products/cloud/colocation

32. ViewQwest

http://corporate.viewqwest.com/products/colocation.html

33. Icon-Webvisions

http://www.iwv.com.sg/services/colocation/

34. WebZilla

http://www.webzilla.sg/colocation/

Note: some of these data center co-location service provider are themselves leasing physical data hall from a large scale data center co-location service provider.

Just for completeness sake, here are some hyperscale data center, i.e. data center built by the enterprise that offers cloud services for the enterprise. Given that they have cloud set-up in Singapore, there will be speed advantage to consider cloud services from the following providers:

a. Aliyun aka Alibaba Cloud

https://intl.aliyun.com/

b. AWS

https://aws.amazon.com/ (Singapore is one of the Availability Zone that you can pick)

c. Microsoft Azure

https://azure.microsoft.com/en-us/regions/ (Singapore is one of Azure’s region)

d. Google (Jurong West)

https://cloud.google.com/about/locations/ (Singapore is one of Google’s cloud zone)

e. Softlayer

http://www.softlayer.com/data-centers

Reference:

1. https://www.google.com/maps/d/viewer?mid=1EG49nDzDZ-NLjVyqbkmDLS4V5TE&hl=en&ll=1.3304834515591697%2C103.83486799999991&z=12

2. http://www.datacentermap.com/singapore/singapore/

3. https://cloudscene.com/search/data-centers?searchTerm=Singapore&pDc=1&pSp=1&pFb=1&pMar=1&sDc=providers&sSp=pops&sFb=markets&sMar=facilities

4. http://wiredre.com/singapore-data-center/

5. https://www.datacenter.sg/

A list of Singapore Colocation Data Centers

A list of data center mechanical and electrical design consulting firms in Singapore

planounifilar_22647

In two separate occasions, a common question came up – what are the data center mechanical and electrical design consulting firms can they call for submissions, as they only knew at most one or two.

Well, quite a lot apparently. The list below are those that had at least done a data center project or data center technical review in Singapore:

  1. ARUP
  2. Aurecon
  3. Bescon Consulting Engineers
  4. Cundall
  5. DSCO
  6. HurleyPalmerFlatt
  7. I 3 Critical Facilities
  8. J Roger Preston
  9. M+W Group
  10. Meinhardt
  11. NTT Facilities (formerly Pro-Matrix)
  12. Plan One Engineering Services
  13. RED Engineering
  14. SJ Thames
  15. TW International Counsel
  16. Wah Loon Engineering
  17. Worley Parsons

I am not associated with any of the above company in a business capacity.

You can easily obtain the company contact information from google search of these companies’ website. If you would like another mechanical and electrical design consulting firm to be listed, please let me know with a project reference information (just project name) and I will update as and when I am able to.

Reference:

  1. http://www.datacenterjournal.com/selecting-a-data-center-consultant/
  2. https://www.linkedin.com/pulse/making-sense-data-center-standards-james-soh-%E8%8B%8F%E6%97%AD%E6%B1%9F
  3. https://www.bca.gov.sg/PanelsConsultants/panels_consultants.html
A list of data center mechanical and electrical design consulting firms in Singapore

On the same page – data center co-location market segments and terms

AAEAAQAAAAAAAA2PAAAAJGE1ODU5NjIyLWU4MzYtNDcwOS1iYmNmLTcyMWEzYzVhY2E1ZA.jpg

In China, Internet Data Center (IDC) is synonymous with all the following terms:

  •  Wholesale co-location data center
  • Multi-tenanted co-location data center
  • Retail co-location data center
  • Webhosting

They don’t classify the IDC into the above terms – simply just IDC.

The data center industry sometimes dropped the term co-location, i.e. wholesale co-location data center = wholesale data center.

I believe these separate terms are created by the data center service providers ourselves.

I believe the different market segment terms came about due to the sheer size of the US market that these somewhat distinct segments are formed and they tend to differentiate along the scale and size of the client’s data hall size requirements.

  • Enterprise aka in-house data center
  • Wholesale co-location data center (preference to have one client per building)
  • Multi-tenanted co-location data center (preference for 1 customer = 1 data hall = 1 private vault =1 private data room)
  • Retail co-location data center (One rack or within one rack, includes dedicated server hosting)
  • Webhosting aka managed hosting, shared server hosting, and virtual server hosting
  • Cloud Data Center

Let me put this into perspective. With the exception of enterprise data center, the co-location data center terms are from the perspective of the data center co-location service providers and not usually from the perspective of clients.

The big boys are those that are big scale, the DRTs, Equinix, but there are lots of data center co-location service providers in the US that we outside of that market are not aware of. In the end, DRT is still less than 10 percent of data center white space in the US, so there are a lot of service providers while it is a big pie to slice and dice.

There is little agreement to the definition of these terms globally though. A particular type may be dominant in a particular country due to the fact that clients are only looking for say, retail co-location data center. In Jakarta, while dedicated data center building is being planned, most of the data centers are the retail and multi-tenanted in mixed use buildings as the scale of demand are aggregately large but coming from lots of small and middle clients because most large enterprises prefers self-built and in-house.

There is no agreement on the term Cloud data center, since it can exist in any of the data center market segment types.

It is perhaps one of the reason why there is very few campus-scale wholesale co-location data center in China because customer demand have not reached that scale yet and also the way land are allocated in China do not favor data-center-industry-park as the local government prefers job-creation while data center ultimately do not employ a lot of workers. They do like cloud data center as they perceive these will employ lots of IT engineers/programmers.

The client do not care how you position your data center in the data center market by size, they only care if your data center space meet their requirements in terms of technical and financial attributes.

While the US and thus the US-based data center co-location service providers has enjoyed growth in demand of large scale data centers, the rest of the global market may not see the same level of demand for large scale data center.

In Europe, when I was attending London’s edition of DCD conference in 2016, the European multi-tenanted co-location service providers are giving feedback to OCP workshop organizers that their data center space generally do not cater to entire facility meeting the average power requirements of OCP racks. While in China, only Alibaba has gone the route to want large scale while Baidu and Tencent are managing their data center space growth via presence in multiple buildings by multiple multi-tenanted co-location service providers. Large scale data center park builder in Hebei has faced problem of building ahead of predicted demand when there is little takers of their already built 3 data center buildings.

I was working in a data center co-location service provider that started off with offering retail co-location, i.e. one rack or sub-division of a rack, which then moved into multi-tenanted co-location, we have clients of all types and the client don’t really care about how we define our data center co-location market position. They only want to know if we will meet their requirement and submit a price proposal. A rack-mount server fits into a standard IT rack, so whether the rack is shared or the room or the building is shared may not matter to the client.

In some cases, clients actually may prefer smaller data center service providers who are considered more responsive while the larger enterprise may tend to go for global wholesale data center co-location service providers because they have a global agreement and consistent standard, to each their own.

Back in the dotcom burst days, data centers closures due to overbuilt supply has seen the then dominant players like Exodus and Digital Island changed hands, scale down, and sold.

In the end of the day, it is the clients and their needs that define which market segment will grow.

Reference:

  1. https://en.wikipedia.org/wiki/Colocation_centre
  2. https://cyrusone.com/corporate-blog/understanding-the-different-types-of-data-center-facilities/
  3. http://www.technavio.com/report/global-data-center-multi-tenant-wholesale-market
  4. http://www.missioncriticalmagazine.com/articles/88290-report-data-center-colocation-market-annualized-revenue-projected-to-reach-33bn-worldwide-by-end-of-2018
  5. https://structureresearch.net/product/marketshare-report-global-data-centre-colocation/
On the same page – data center co-location market segments and terms

HUMAN ERROR: The Biggest Challenge to Data Center Availability and how we can mitigate it – Part 2

it_engineer_with_server

The previous article on this topic can be found via this link.

The layered approach to upkeep data center infrastructure availability should not look like the Swiss cheese, i.e. the hazard or trigger should be eventually stopped and preferably as early as possible.

swiss-cheese

The layers should include the following:

  • Design (in accordance with design intent of owner) with either concurrent maintainability objective or fault tolerance
  • Implementation (in accordance with design brief) and fully tested via comprehensive testing and commissioning phase before handover with fully documented SOPs/MOPs/EOPs
  • Maintenance and Operations Management, work by equipment service providers or any work on site through Method of Statement and Risk Assessment matrix by suitably qualified person/persons
  • Incident and Problem management process, escalation management and mitigation process

and so forth

Possible problems arising from inadequacy in each of the layer can result in:

  • Inherent Design / Setting flaw
    • Outdated / swiss cheese situation
    • Requires analysis and manual intervention
    • Error Producing Conditions (EPC)
  • Weakness in manual processes
    • Inadequate automation
    • Inadequate training / familiarity
    • Inadequate operations procedures
  • Insufficient Information / knowledge
    • Capacity limit reached earlier than design intent
    • Inadequate training / knowledge
    • Inadequate documentations
  • Insufficient Risk Assessment
    • MOS / RA, risk matrix
    • Vendor experience

 

Learn from other industry

Our data center industry is a relatively young industry and there are other industries with mission critical infrastructure that have undergone extensive research and iterative enhancements which we can learn from and adopt.

  • Airline’s Crew Resource Management
    • Checklist and double checking by pilot and co-pilot on the airplane air-worthiness
    • Communications within the cockpit and the cabin staff with the cockpit to ensure timely and prioritized response
  • US Nuclear Regulatory Commission
    • Standardized Plant Analysis Risk – Human Reliability Analysis (SPAR-H) method to take account of the potential for human error
  • OECD’s Nuclear Energy Agency
    • Ways to avoid human error, e.g.,
      • Systems should also be designed to limit the need for human intervention
      • distinctive and consistent labelling of equipment, control panels & documents;
      • displaying information concerning the state of the plant so that the operator don’t need to guess and make a faulty diagnosis; and
      • designing systems to give unambiguous responses to operator actions so incorrect actions can be easily identified.
      • operators to be better trained for plant emergencies, use of simulators

 

 

Error Reduction Strategy and Error Predictor

In addition, error reducing strategies can be applied in all areas of data center maintenance and operations management to reduce the probability of occurrence of human error. Whether in design of the data center power and cooling infrastructure, or determining the risk of operations of particular maintenance operations (e.g. power switch-over exercise to perform UPS or back-up generator maintenance), all the strategy below should be applied.

Take for example the case of the AWS US-East-1 outage incident (http://mashable.com/2017/03/02/what-caused-amazon-aws-s3-outage/), the command set is powerful and a typo could bring down a lot of servers real quick. So AWS said in their post incident summary (https://aws.amazon.com/message/41926/) that they will limit the speed of the effect of the command and tool, i.e. put in safety check which is basically an application of the constraint strategy.

error-reduction-strategy

When service or repair task are assigned to operations staff, or by qualified technicians of equipment service provider, an evaluation of the existence of error precursors and eliminating these pre-cursors will reduce the likelihood of human error. For example, the combination of time pressure, inexperienced staff already at the end of the long work shift and ambiguous task objective are all contributor to a higher risk of the assigned task. Eliminate, reduce, and re-direct to an experienced staff at the start of the work shift with clear task objective will reduce the risk of the assigned task.

error-precursors.jpg

Risk Mitigation is a Continuous Process

A multi-prong multi-layer and attention to details approach are required to mitigate the risk of human error causing an outage in a data center facility.

Risk Mitigation Process Flow.jpg

 

Design and implementation of a data center to a set of clear and tested design intent (e.g. objective of data center being concurrently maintainable). Day-in and day-out the operations staff, vendors, client personnel interact with the system within the data center. So there need to be a well oil system in place, not only just documentations, that works 24×7 for as long as the data center is in existence.

An iterative risk mitigation system, relying upon consistent management support and attention, with knowledge learned from near misses and incidents are key attributes of an environment that is resilient in terms of the human aspect.

We Human can reduce Human Error, effort required

We should look at the data center organization especially the operations team, the resources and tools, the capability of the operations team, and so forth. A culture of no blame, and encouraging active participation by all staff to address potential weakness or error precursors, addressing near-miss which is a sign of error inducing conditions, are important to mitigate effects of human errors. We should get away from pointing fingers and to learn from past problems, like what AWS did with their incidents. And our data center industry can do more to share and learn from one another, to prevent future occurrence of issues that were faced and dealt with elsewhere.

This built-up knowledge of good practices should be documented and disseminated, with management support. The weakest link is an inexperience staff hesitating or worse making a wrong decision, so training everyone on the operations team is critical to maintaining availability of data center.

A periodic (for example annual basis) no-nonsense third party data center operations and management review coupled with improvement plans to strengthen those weakest links will boost insight and assurance to data center C-level executives, data center operations managers, and clients. Most operations manager will be too busy to review their own data center operations, coupled with the difficult position of finding your own fault and limited experience if the staff have not worked in more than one or two data center sites, therefore a third party operations and management review is the next best thing to enhance resilience against human error provided it has the full co-operation from top to bottom of the data center staff.

Furthermore, if a data center service provider has grown beyond 2 to 3 data centers, it will be difficult to consistently manage the data center operations across them especially if they are managed independently. A third party review that is applied to all of them will help to reign in inconsistent operations processes, subject to having a central data center operations programme function within the data center service provide, of course.

Therefore, a data center facility is ultimately dependent on well trained and knowledgeable staff, whom are clear about their data center facility information or knows where to quickly find the documentation that contains the detail information, do the risk assessment work of evaluating equipment service vendor or upgrade works properly.

In summary,

  • It is worthwhile to commit resources to reduce errors
  • We can do improve our resiliency and thereby uptime through available methods and tools
  • There are proven methods and tools we can borrow from other mission critical environments
  • Third party data center operations and management review coupled with improvement plan should be considered for large data center operations especially those that have multiple sites

 

References:

  1. https://en.wikipedia.org/wiki/Human_error_assessment_and_reduction_technique
  2. http://news.delta.com/chief-operating-officer-gives-delta-operations-update
  3. https://journal.uptimeinstitute.com/examining-and-learning-from-complex-systems-failures/
  4. http://www.enterpriseinnovation.net/system/files/whitepapers/1_2016-cost-of-data-center-outages-final-2.pdf
  5. https://www.oecd-nea.org/brief/brief-02.html
  1. http://www2.lbl.gov/ehs/training/assets/docs/Error-Precursors.pdf
  2. https://www.linkedin.com/pulse/data-center-human-factor-james-soh-%E8%8B%8F%E6%97%AD%E6%B1%9F

 

  1. https://www.linkedin.com/pulse/human-errors-biggest-challenge-data-center-how-we-can-james-soh-%E8%8B%8F%E6%97%AD%E6%B1%9F
HUMAN ERROR: The Biggest Challenge to Data Center Availability and how we can mitigate it – Part 2

2+2000+3000 = 1 big challenge

thousands-of-app-in-a-bank

You read it right.

2 DCs + 2,000 servers with 3,000 applications are going into a new Data Center in three years time, and the man/woman to do it is yet to be found.

I have had the unfortunately experience of network interruption that caused slow and unacceptable access to hundreds (my then estimate) of online application in a large Enterprise data center, but then again we were an internal shared co-location data center so we only count the projects and departments but never the number of application systems.

 

In the above picture is an advertisement that calls for a data center migration project manager, and it has been repeatedly put up for more than 6 months and I have seen the indicated salary range gone up (from 10k SGD per month to now 15k SGD). The scale and complexity is also fully spelled out, as previously their advert did not indicate two data centers moving into a new one, and 2,000 servers with 2,000 application systems, and to move all in (I presume by phases, it will be nearly mission impossible if all at once) by 2020. Don’t forget, system change and new system and services will not stay frozen during this period, while new system may still be added to the current data centers given the inter-dependencies of system and data required to introduce these new systems and services. Hopefully, no IP address change for any of the system is required, oh that is only one out of many possible things to consider for such a move.

I had one data center move done in the mid 1990s with the enterprise having one mini-computer sytem as the one and only mission critical system, we went beyond the planned downtime window of 24 hours by additional 30 hours because the new site’s telecommunication cables are digital compared to the old site which was analog, and our analog modem was unable to work and we had to bring in new ones while the migration took place on a Sunday when the vendor’s warehouse are closed.

I had on occasions when talking to data center facility owners, sales people and fellow consultants about the mission critical nature of IT for most of today’s enterprises and mentioned hundreds of applications are in use on average by medium to large enterprises.

098733b

3,000 applications is one big number. I hope only 10% of them are mission critical and the entire application system portfolio are prioritized and inter-dependencies already mapped out.

When a bank data center ran into problem (see reference section below), what we see externally is ATM being down, and counter staff may switch over to back-up system and service became slower. What really happen will involve lots more effort to bring the critical applications back to service.

What makes me wonder though, shouldn’t the bank have identified such a role and bring him in earlier in the process of decision making to have a new data center, and this person or better a team of data center migration experts be better in all sorts of way rather than bring someone from outside and who is to manage any knowledge and therefore mitigate any migration risks? I am pretty sure that the local financial regulator will dive in to audit and assess the bank’s migration plan.

Anyway, I have learnt a number (1,000s of servers, applications) for a Singapore bank is probably typical.

Best of luck to their data center migration.

Reference:

  1. http://www.datacenterknowledge.com/archives/2013/12/16/year-downtime-top-10-outages-2013/
  2. https://www.theregister.co.uk/2017/01/13/lloyds_bank_in_talks_to_outsource_bit_barns_to_ibm/
  3. https://www.linkedin.com/pulse/20140616192008-655694-best-practices-for-data-center-migration
2+2000+3000 = 1 big challenge