Don’s Corner: More Training = More Uptime

Posted by on June 22, 2012

Don Melchert, Critical Facility Specialist

Late in the night at the NOC, an alarm sounds…

John:  Hey, there’s an alarm on the UPS.”

Larry:  What is it?  Did you log in?

John:  What’s the password?

Larry:  I dunno, try “GeauxT1gers”

John:   Nope, but now it has 3 alarms.

Larry:  Oh wait…change the S to a Z and try again.  What alarms are showing?

John:  That’s it…I’m in.  I don’t know what all this means, but there’s a lot of red on the screen.  Oh, here it is, ALARMS.   It says…”Input out of tolerance…UPS on Battery…Low Battery Warning…”  We might have lost power.

Larry:  The lights are still on, but I don’t know, did we lose power?”

John:  Guess so…wait, now it says…”Battery Weak.”  What do we do?

Larry:  Go to Bypass, maybe, you think?  Do you know how to do that?

John:  Heck no!  Let’s call that UPS guy.  Where’s the book for it?

Larry:  I think it’s in Frank’s cube? If the power is out, won’t the generator kick on?

John:  Maybe it takes a few minutes to kick over?  Never mind, we just lost the network.

In working with Data Centers around the country, I’ve come to realize scenarios like this happen more often than we care to discuss.  The thing is, after the damage assessment is performed and the data analyzed, another odd similarity shows itself.  As it usually is with most accidents, every one of them is preventable, in one way or another.  While reading the scenario above, I’m sure many solutions were jumping to mind, but the one that would have made the biggest difference, in my humble opinion, is Site Training.  Let’s break it down and see how Site Training would have saved the day, possibly even prevented John and Larry from their inevitable butt chewing, or worse.

Speed– Regular Site Training would have given our confused NOC members the familiarity to know how to access the alarming UPS quickly, allowing them to have more information available to begin the process of fault analysis.  Each member of your IT support team, that’s anyone with access to your Data Center, should be trained in how to respond to an alarm on any piece of critical equipment, not just the servers.  Consider this, if your UPS has a battery runtime of less than 10 minutes, a team member must be able to respond and correct the problem in less time than it takes most people to take a shower.  How long does it take your newest team member to diagnose a fault and know what to do next?

Understanding– Properly coordinated Site Training brings to light the idiosyncrasies of your particular data center and how each piece of your NCPI is dependent upon the other.  In our scenario, once the UPS screen was accessed, their training would have allowed them to realize they had lost a phase of their utility feed and were rapidly draining their weak batteries into oblivion.  Even if they had forgotten what in the universe an electrical phase is, John or Larry would have at least realized they were missing one.  In having attended their quarterly Site Training, the late night NOC crew might have saved the day by manually starting their generator.  When is the last time your IT and Facilities practiced starting the generator and transferring the critical load on the ATS?

Confidence– Site Specific Training gives people, the backbone of any critical operation, the ability to push fear and confusion aside, allowing them to see the way out of a bad situation. John and Larry, having been provided regular Hands-on Site Training, would have been confident enough in their ability to operate the critical equipment that they would not have hesitated to get up out of their chairs and walk up to the alarming UPS to investigate things further.

As an instructor for Data Center University and today with UNS Data Center Institute, I’ve learned that the majority of today’s most intelligent professionals become frozen when faced with the fear of failing in front of their peers.  I’ve practically had to shove students toward the training lab, but open the door for lunch and off they run!  Old habits from grade school die hard, don’t they?  In many cases, both IT and Facilities staff take a “hands-off approach” when it comes to touching their NCPI assets, simply because they are afraid of causing a failure themselves.  Think about it, in the scenario above, what was the outcome?  Exactly!  Confusion and fear resulted in a total failure of their critical network.  Instead of having to explain to the CEO why their company couldn’t take orders for 5 hours, John and Larry’s IT Director could have been praising the speed, understanding and confidence of their IT Team.  Only Site Training, and the hands-on familiarity that comes with it can give you that.  Their IT Director may have had a lot more fun in the morning meeting, and hey, while the rest of the Execs are still smiling and clapping, now would be a great time to ask for that new In-Row cooling unit…and maybe even some new, comfy chairs for the NOC..?

If your organization hasn’t been afforded the opportunity to conduct Site Training in the past few months, or if you’re unsure where to even start when it comes to determining which NCPI assets to train on, never fear, UNS is here to help!

To learn more about Site Training please visit Universal Networking Services Institute (Click Here).

Comments are closed.