Blast from the past – a lesson on documentation

Background

I have joined one of the largest IT outsourcing company in Singapore which has most of the government IT outsourcing business as a technical service manager for the national public library that has headquarter, main branch and 9 large branches, total 1000 odd users. When I came on board, I learnt that I have a dozen staff and besides 2 IT engineers, and an almost non-existent helpdesk, the rest of the IT support staff are such that one IT support engineer supports two branches and I have no idea who is where. We gave such bad IT support service that I was surprised that we are still holding on to this account. The user IT director took back the network planning and support functions, and he is revamping the entire library IT network.

The account has been with our company for at the longest time, at least 8 years, but there is no documentation of LAN or the entire WAN network. Our company did not take good care of this account, no technical service manager lasted more than a year, mind you in was in the days of transition from main-frame / mini-computer to client server and LAN technology. All my predecessors do not like to take out this account which has implemented LAN/WAN and is planning a new LAN/WAN coinciding with a move of the library IT system cum computer room move to another site in the next three months. So this senior IT engineer, i.e. me, suddenly became the pseudo technical service manager even though officially there is a technical service manager but he is on the transition out to another account which needs his mainframe knowledge.

Entire Library System grinds into a crawl

In the middle of the day, suddenly frantic calls came in and one of my senior IT support engineer took the calls that told her the library books borrowing and returning system is extremely slow. I was next to the computer room which housed the mini-computer that runs that critical IT system for main library and all the branches. As the clock moved minute by minute, the system is almost non-accessible outside of the computer room local area network.

All of us are scratching our heads, and my supervisor calls in other senior IT support engineers but they are not able to find the problem. The user IT director gave us stern warnings both verbal and via email. At the end of the day by the time the libraries had closed, the problem still persisted.

The next day, the same problem continues to wreck havoc to the main function of books borrowing and returning and it was switched to manual borrowing and returning of books, a process that was not used after the automated IT system was used three to four years back.

My rudimentary knowledge of the main IT system (runs on Unix) and the computer room LAN and the wider WAN that links all the branches to the main library hub (the computer room is in the same main library building) and I being the man-in-charge (no one else has any clues and I am the manager on site) meant I took the brunt of the users and IT director’s fire. I ran ping to all the main system’s IP address and to all the branches’ network. A trend shows that out of all the time-outs only a few packets get through. Clearing running non-essential processes on the main IT system doesn’t seem to help much. By Friday’s evening, my wife came to meet me to have dinner at 8pm because I told her I would be there through-out the night to try and solve this problem. My staff Cheng Woon was there with me until 11+pm and I persisted.

I only knew than that the entire library LAN and WAN is one entire LAN, all the branches are bridged onto the main library LAN via bridge router that were never set to router mode, all the while its simply set as bridge mode; for reason never known.

Suspects

By midnight that night, I reason to myself, that it must be a network problem caused by an errant source that blast the entire main library LAN and WAN with network packets has flooded the entire main library LAN and WAN. I cannot easily find where is the source of the problem. Making the main IT system less loaded doesn’t help. I have to find out the source of the network storm, i.e. cut the bad branch to give the rest a chance to access the main IT system.

I came to the desperate move at about 3-4am on Saturday to pull out network cable one-by-one from the 5 to 6 LAN switches to find out if I can isolate the source. i.e. a cut and slash. Slowly but surely, when I split the network in halves, and then halves again, the main IT system became usually for the isolated LAN (at that time, we have not used VLAN, everything is one big LAN). But pulling out all the network cables still showed nearly the same symptoms, the problem is still there. That leads me to think, it has to be in the main library LAN but I have put out all the LAN FTP cables, what is left?

At about 8am, time for breakfast and coffee, I had my favorite wanton noodles from the old library coffee shop next to the side entrance of the old library building (it has since torn down in 2000s).

Eureka

I go back and start anew, now looking for the elusive LAN cable. I looked at a old IT rack cabinet, it has a couple of old LAN switches and running with flashing lights.

Curious, I peered around that IT rack, going behind the rack and found a couple of thick Ethernet cables coming out of the back of those two old LAN switches. Thick Ethernet cables were like used in the late 1980s. No ones use them anymore even by 1990s. But they were there, probably been there since 1980s. So, I unplugged them, praying for luck, lo and behold the network problem is gone. All my pings to the main IT system and all the branches suddenly back to normal condition. No dropped packets and responses are all within 100msec. Wow. I hung around for another 30 minutes and no change. So I called Cheng Woon to come in to replace me, and I went home to shower and take a rest. The user IT director was informed of this and he said not to change the manual book borrowing and returning process during this weekend and let things stayed as it is until Monday.

The Apparent Culprit

Everyone is in on Monday first thing in the morning, tested the system access without problem. And the system performed as per normal. In the afternoon, I was told by the user IT director that the thick Ethernet cable was traced to the basement level of the main library, where some of the main library support staff works while some had already been relocated to some of the branch libraries ahead of the closure of the main library building that is planned to take place a few months later. So a subcontractor that carries out removal works had cut some of the cables above the ceiling board. The thick Ethernet cable is supposed to be non-essential since it only serves to that section of the building linking to old PC connection (the details I do not know and it does not matter anyway), and if cut properly will not cause any issue. But the thick Ethernet cable works in a way that if cut incorrectly, the cutter caused two copper wires to touch each other that in turn causes the LAN switch to experience a transmit/receive error which gets broadcasted to the entire main library LAN, and that is the cause of the problem that plagued the main library system for three days. Just a non-essential cable that was supposed to be removed but cut in a certain way to cripple the entire organisation’s main IT system and the entire network for two days (library stays open on weekends in Singapore).

Tens of Thousands of public library patrons were forced to queue behind counters manned by staff who were recalled back to man those counter while previously they breeze through the automated self check-out systems. Complaints were published in newspapers and letters were written in.

I am glad that the library staff are very understanding, but to this day, first thing when I see any operations is to see if their documentation is comprehensive and updated.

from this incident, I learnt a few things:

  1. If we have documented the library LAN/WAN, we may be able to identify the problem much earlier and easier.
  2. If we have experienced staff who had been with the account, and knows the LAN/WAN and the systems, we could have save some time learning what we should have known. I think this is why the user IT director was not happy with us, but he did not blame me since I was only on the job for a month.
  3. If we have known where everything is and what they are for in the computer room, we may have found the problem much earlier.

 

The silver linings of this incident, and the subsequent problems that I managed in that site for that year, gained the confidence and trust of the user IT director and the CIO, that I have done several things right:

  1. Leave no stone unturned. Dig to the bottom, go back to basics.
  2. Take responsibility, your people and your users trust someone that take charge.
  3. Something small can trip something big, the small stuff that we ignore will cause big problem.images-of-singapore-in-60s-70s-6-728-national-library.jpg
Advertisements
Blast from the past – a lesson on documentation

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s