Incident #1

Mid last year while I was in Shanghai conducting a data center training course, a friend and I arranged to meet up in a café and ordered lunch, something is wrong when the waiter brought three main course and we told him we only ordered two. He said he also felt it strange when he thought we ordered three mains. He has penciled down our orders on his order sheet which he should have used it to repeat our orders back to us. His supervisor came back to apologize for the confusion and will cancel the extra order.

Incident #2

Back in the days when I was working in an IT outsource company, I was called out to assist another site’s investigation of a problem.

A request by a VIP user in the company to retrieve his mailbox because he accidentally deleted an important email raised a service request to restore his mailbox to the previous night’s copy. Zilch. Nothing. So the server admin use the day before tape, nothing.

All the way to a week ago. Still nothing. They checked the tapes, not only was there no backup of that ViP user mailbox, the entire exchange server mailboxes are not backed up. And then they checked all the backups of all the enterprise servers which is linked to a server dedicated to perform this backup for six hours starting from midnight. Nothing. It has been doing back up of empty since a year ago when they migrated to a centralized backup server using netbackup software.

The CIO demanded an investigation by the IT outsourcing vendor and I was called onsite.

I asked my senior support engineer who is a backup expert to come along. A quick check by him noticed that a check box on the netbackup software on the centralized backup server was not ticked. This checkbox if ticked is meant to backup the system and data as an incremental backup if there was a full backup previously, if no full backup previously it will backup the entire system and data as a full backup. Now the situation was there was not a full backup in the trail of backups, but this checkbox was not ticked, therefore the tape drives has been backing up empty, nothing.

A simple daily backup checklist to check the backup log, do a restore and test function would have prevented the problem.

Incident #3

A couple of years ago, there was this brown field data center project, i.e. retrofitting an industrial building into a dedicated data center co-location facility was underway in Beijing, which took more than two years (that is another story on why it took so long for the retrofit). There are four parties involved in the project, the owner’s data center project team (let’s call them A), main contractor for the project (let’s call them B), 3rd party project supervision company appointed by A (let’s call them C), data center design firm (let’s call them D), and various other parties.

Three outdoor chilled water storage tanks were delivered on site and arrangements were being made to install them. These tanks are cylindrical and are to be 8 meters tall with circumference of 1.5 meters. However, project superintendent C finds that the specifications are wrong (according to last approved drawings by D) which should be 12 meters by 1 meters. The sub-contractor for the chilled water storage tanks (let’s call them E) and main contractor B were adamant that there was an agreement in response to request by owner A to reduce the height of the tank (so that it is shorter than the building’s height) thereby changing the specifications. D said there was no such record of request nor did they change it. A said there should be.

The tanks were on site but were not installed for four full days. Finally, on insistence by project manager of A for a face to face meetings of all five involved parties (A, B, C, D, E), the request to change the design of the chilled water storage tanks were contained in an email communications between D and E. The problem lies with no documented project minutes and all the parties not keeping proper records causing delays to sort things out. Time was wasted on this and many other things, which is one of the reason why the project was late and taking so long.

So, quoting a famous sentence that is attributed to Ronald Reagan: “We trust, but Verify!”, and follow up immediately with documentation.


