After today’s power outage, I guess it would be beneficial to have a little debrief listing out what went well, and what didn’t go so well, and some of the things that we can do with whatever resources we have to make things better for the next outage.
FAILURES:
- No emergency lights were available anywhere in the building! quite hazardous to walk.
- Outside generator seems to have powered on, but never switched over to power the UPS units inside.
- Had the outside generator been working, it would’ve only powered half the data center. With current virtual infrastructure, and everything resting on a few key components, a generator powering only one UPS is as good as non-existent
- UPS power lasted about 10 minutes from the time we lost power; after which all servers went down, including all network equipment
- There was no mechanism to automatically shutdown servers when power outage occurs. Shutting down everything manually, requires way more than 10 minutes, and is not a viable solution
- Non of us knew where some of the key breakers are to the data center. When the power was flickering in and out, we had to guess which breakers would completely cut out power to the servers in order to avoid frying the hardware.
- After the power was restored, it took us longer than needed to get things back up, because too many components were required to be functional before core network services are restored for basic access. (i.e: DHCP/DNS)
SUCCESSES:
- After the power came back up, we were able to bring everything back online and functional in about 20 minutes.
- We managed to make it across the street to get our coffee.
CHANGES REQUIRED:
- Get emergency lighting installed in the building. Especially up in the tower area and in the server room.
- With the lack of budget to do drastic changes to our backup power systems, at least do the following:
- Maintenance department be responsible, and guarantee the functionality and successful switch over of our backup generator by performing monthly tests on the equipment.
- Update / Replace the batteries on our UPS systems, so that we can have a bit more time to allow us to power things down.
- Look at the APC interface, and perhaps bring in APC to help figure out how to make use of the functionality of the server auto-shutdown upon power loss.
- Get maps of the key breaker locations to cut and restore power to the data center
- Deploy a physical server that provides DHCP and DNS on it, independent from any of the VMware cluster, which could be self powered on, and self contained. Having this in place, and pending no catastrophic hardware failures, could bring our server recovery time from 25-30 minutes to about 5-10 minutes.
- Get a hamster operated coffee maker
It’s unfortunate that the list of failure is much longer than the list of successes, and it’s also unfortunate, that we will still be in this boat, and running huge risks of potentially losing big chunks. if not all of our data if such outages happen again; Until we either have a big fund available to revamp our data center, or be able to build a new one according to code, we will have to deal with this, and only improve on the failure in small ways as suggested above.
No comments:
Post a Comment
Please make your comment. (GMK)
Note: Only a member of this blog may post a comment.