Right now I'm waiting on a QA person to verify the code push that just happened, which seems like no big deal ...until I mention that half the team was unable to connect in for most of the time I've come to find out that things happen that you NEVER see coming and I'm talking about really odd ball things you can NEVER plan for. This is a perfect example.
This evening, after making contact with everyone involved, I disabled the load balancer and let the others know its time to do their thing. One at a time, each server went down, files were being copied, everything was as it should be, noooo problems ...until about 1 hour into it, I lost my VPN connection. Not good. Why? Last time this happened it was (still is) a huge fire causing all sorts of pain and suffering and it happened right @ 1am on the dot. I message the other two connected in and they report no problems. About that time I ask the 4th person -- they lost connection about the same time I did. Long story short, road runner must have been doing and outage around the same time we were. All DNS requests resolved, but IP routes fell flat. Trace routes did their usual hop a few then disappear (no ping reply). The other two (At&t / WOW) had no problems. After about an hour and a half, we were back up.
So how do you plan for this? Indirectly, you do, by having multiple people in multiple places if at all possible. If you're all at work and that has an issue, it would've been a fire sale and people would've been freaking out (rightfully so) -- your app is down in a remote location, you just disabled your network gear and there's no way to get it back quickly other than to get the connection you just lost, back. At the same time, because we had multiple networks at our disposal, we were able to nail down that indeed it WAS road runner and not the production systems perceived to be lost. Imagine the conversations about what really happened ...and what the REAL answer was?