CASE STUDY: Main Server Crash!

We expected to find a few things wrong when we started working with a client who was running a 15-year-old server.

This case was no exception! During the initial Technical Assessment, where we only scratch the surface of the computer network, we found:

  • 10 year old “swiss cheese” firewall (multiple Internet-facing ports open)
  • 15 year old server running Windows Server 2003 (purpose unknown)
  • Another server running Windows Server 2008 (main file server)
  • Backups of main file server appeared to be failing
  • Hard disk drives (HDDs) on main file server storage array appeared to be failing (see image above)
  • Servers located in same room as the high-security safes (we had limited access to the room)
  • Undocumented IT environment (nobody knew how anything was set up)
  • Weak Wi-Fi password (company phone number)

And that was just for starters. It was clear to us almost immediately that the client’s previous IT server provider had completely neglected to properly upgrade, secure, and maintain the client’s business network for many, many years.

After discussing these issues with the client and making sure they were aware of the dire circumstances, we began forming a plan to get them out of this mess!

Backup & disaster recovery

Since the main server hard drives AND backups looked to be failing, the first task was to create a known, good backup, and to test our ability to restore the backup when (not IF) the server(s) crashed.

The problems began after discovering that the file server was so crammed with legacy apps (for example, multiple deprecated SQL databases consuming over 90% of RAM) that it barely worked. Trying to run a new backup failed, so the next step was to try a reboot to start fresh…

… and that’s when the main file server crashed – and did not reboot.

Down-time and data loss

“The server is down” is never something you want to hear – especially if you’ve been tasked with keeping the server “up”. As suspected, their old backups did NOT work when we needed to use them for disaster recovery.

The client was now at serious risk of extended down-time (end users could no longer access their files over the network) and data loss (all the files were stored on a server that just crashed and was not backed up).

We spent days (and nights) scrambling to recover the client’s data and restoring user access to shared files. Amazingly, we were able to recover 100% of their nearly 2 TB (2,000 GB) of data. Thanks to the ingenious work of one of our senior engineers, we hacked together a temporary virtual domain controller and a network attached storage (NAS) file server in only a days, which allowed users to resume work fairly quickly (given the situation).

Major down-time and data loss: narrowly avoided! However, the client continued to operate on life support for nearly two (2) weeks while we waited for their new replacement servers to be delivered.

New servers

We installed two new servers, and overall the new server implementation went according to plan. What made the implementation particularly difficult was the *unique* setup of the client’s network (which we were still learning because they were a brand-new client) combined with the technical challenge minimizing down-time while migrating an old Windows 2003 domain.

It was also challenging to completely remove their old 2003 server because some users were still accessing files on it, and it was acting as the DNS server for their VoIP phones. Finally, we were able to disconnect this hunk of junk (and it’s failed 2008 partner in crime) from the network – which was something the client’s old IT provider was never willing (or able) to do!

The client’s building somewhat randomly decided to do “maintenance” the same week we started replacing the failed servers. We had to work around an electrical outage that delayed our planned data migration from running over the weekend to the following week. This was only one of several surprises we encountered during the new server implementation!

Perhaps worst of all, the whole situation looked really bad. Even after we managed to get a handle on the tech, my fear was that this brand new client probably thought we broke their servers intentionally, just so that we could charge them extra to install new ones (good news, they didn’t think that!)

Conclusion

Needless to say, this was a tough case. The last part of the project was testing our client’s new backup and disaster recovery (BDR) plan to make sure that it works (more good news – it does!)

Now, if there is ever a serious problem with the server in the future, we have a plan to get the client back up and running quickly (within hours, not days or weeks).

Of course, now that the client has a server that isn’t 15 years old and on the verge of failing, the risk of experiencing serious problems has been significantly reduced!

Author: Kevin S.

Kevin Sanders is a Los Angeles native who has worked in tech support and customer service since 2000. He specializes in professional IT consulting, cloud technology, cyber security, networking and Wi-Fi, hardware/software diagnostics and repair, and custom systems building.