Recently we had a very strange issue occur with a Dell VRTX, the system is a sub-two-year-old production chassis running two physical M620 server blades. One module is a physical Windows Domain Controller (DC) the second being Microsoft Hypervisor (HV) host. One day during routine administrative task we notice the DC was “not responding” correctly.
A very strange situation, the server was still serving requests for authentication and for DNS, but we were unable to login to the console, either via Remote Desktop, Remote Management Software, or via the Dell iDRAC (Dell Remote Access Card). It was very strange. The server did not respond to remote shutdown/restart commands, but yet seems to still be function part of the network, user where not experiencing any performance issue.
As a critical component of the network we held our breath and power cycled the server. Not a preferred method of recovery but since we had no other way of connecting to the server to gracefully reset the operating system, we rolled the dice and the server rebooted without issue.
In hind sight we should have sought out expertise at this point, but as a Managed Service Provider, the situation was resolved and other pressing issues immediately took the stage. Our thought at the time was, most likely and OS issue and resetting the server will correct the problem.
The Weirdness Returns
Needless to say, some days later, we have a reoccurring issue. Once again, power cycle, although this time once the server reboot we started to receive multiple log event with Active Directory errors, AD not syncing correctly and various other errors. We punted, pulled the server off-line, locked down the switch ports to ensure the server could not connect to the network, seize the FSMO roles, and rebuilt a virtual server as a secondary server for redundancy.
We called Dell Enterprise Support and the recommendation was to upgrade a myriad of drivers and firmware on the entire VRTX including Chassis, Main-Board, IOMINF, Shared PERC, Backplane, Physical Disks, iDRAC, Lifecycle Controllers, Broadcom, BIOS, PERC, UEFI Diagnostics, and Intergraded Switch, Dell had a list of firmware and drive updates which was extensive.
To Patch or Not to Patch
This brings us to the subject of this posting, when to patch and when not to patch. We have in the past taken a “if it not broken, don’t fix it” when it comes to firmware and drives on servers. If we start to experience an issue with a system then we start to look at what needs to be upgraded to bring the system up to current release standards.
So for example, we have a number of other servers which are stating there are firmware upgrades which need to be performed, but these servers are running and preforming without issue and the patches are minor. So we do not have a plan to update the firmware on these servers at this time. Now let me be clear, I am not referring to Microsoft Windows Updates, these are performed on a regular basis, we’re talking primarily about firmware and driver updates.
Newer technology requires more frequent updating
One of the things we are finding is if the server platform is “newer” there are more critical components which need to be updated and it is more important to stay on top of the most current firmware and drivers, than servers which have been released to market for some time. If the server has been released to customers for some time many of the initial bugs will be removed from the hardware components.
To bring back our VRTX example, this particular hardware installation was put into production when VRTX barely had a website available. Even our Dell support techs commiserated with us on the change to the standard operating procedure stating, “I know, normally I wouldn’t think that you’d need to do firmware patches this soon in the production life, but with the newness of the VRTX all of these firmware and driver updates are falling under the critical and recommended Updates category instead of optional.”
If it’s important, make sure it works
One final thought, the more complex and integration of the system the greater the need to stay current with firmware and drive releases. If you’re not actively patching, it’s at least important to be aware of the release notes and fixes present in each new release. The more important any given hardware system is in your environment, the more important it is to be stable.
So there’s no clear-cut answer on when to patch or even a clear and obvious schedule. A lot will depend on your personal business and more importantly on the type of hardware and importance of that hardware in your environment. But it pays to be aware of when updates are available and what issues they fix. If an update is recommended, there’s probably a reason. If it’s critical, it pays to complete the update in your next update cycle.
Hamlet image from the RSC production, 2008