An otherwise invisible piece of equipment might not seem critical to a business, but when a single cooling fan fails, causes a generator to give up the ghost, and causes tens or even hundreds of thousands of users costly problems for an extended period, you can see that being able to estimate which components of your infrastructure might fail - and when - is of paramount importance. That's where mean time between failures (MTBF) comes in, the method upon which IT professionals rely to give accurate estimates about when critical equipment will fail. Here we take a look at what finally kills some common types of critical equipment, and how MTBF can help save the day.
What Is MTBF?Every piece of IT equipment manufactured is assigned a unique model number. Those that play some part in critical infrastructure are supplied to customers with an MTBF estimate. The complex calculations to work out the MTBF for a piece of equipment take place during the lengthy testing phase within a product's research and development and are relatively specific to a particular model.
If you are looking to find the MTBF for a particular piece of equipment, you will find it in the detailed specification sheet supplied by the manufacturer. You can also contact the manufacturer directly.
RoutingAn enterprise-grade router includes many parts, some moving and others static. Power supply units (PSU) and cooling fans both have moving parts, and it's those elements that tend to be points of failure, especially if the unit isn't housed inside a relatively dust-free data center. Thankfully, with some administrator input most routers will report to a SysLog facility, so that any failed components can be flagged.
SwitchesAlong a similar vein, the next level within an enterprise network is the switching hardware. Although enterprise-grade switches also tend to rely on fans, there are usually fewer of them than those found within a router chassis. If the fan's whirring mechanisms are intact, then a faulty switch will usually misbehave at the software level, either by disabling a switch port unexpectedly or, more commonly, exhibiting unusual behavior such as dropping packets, causing varying levels of traffic disruption, or incorrectly changing user-defined settings without being requested to do so.
The networking behemoth Cisco advertises one of its routers as having an MTBF of 188,574 hours for the Cisco Catalyst 3750G-24TS model. If we divide that by 8,765.81277 (the number of hours in a year) then we see that this model has an MTBF estimate of around 21.5 years. That figure is of some reassurance when you consider that this equipment needs to perform well 24/7 without fault, although of course in reality it's simply an indication of its reliability. Even so, it gives users an educated guess as to how long that piece of equipment can be expected to last.
Resilient PowerUninterruptible power supplies (UPS) hooked up to a large number of batteries can provide backup power within the enterprise during the brief spell before generators spin up during a power outage. Certain specific software faults can materialize within a UPS, as with any piece of equipment, but generally, the batteries from which they draw power will usually cause the most concern. If a UPS battery is frequently powered down and recharged, its capacity will diminish more quickly and its operating time will shorten dramatically. Unsurprisingly, it's also possible for UPS batteries to fail entirely. A UPS can report over modems and networks when faults develop, but more often than not, older UPSs will trigger audible alarms when an issue first arises.
Protected StorageThe hard disks we use today and rely on to such a high degree have become significantly more reliable over the past decade or so. They are, however, far from being infallible and, depending on which study you might believe, they appear to function correctly for a longer period depending on a number of factors. (A great opinion piece about this can be found here on The Remarketer.) If detailed reporting is enabled and the drive is providing feedback about errors, then corrupt sectors and read/write failures are the key to spotting when a disk within a storage array is failing. Another common issue within servers that use several disks connected to a RAID controller is that the controller itself will fail. Unfortunately, sometimes hard disks simply stop working without any warning whatsoever, an issue that is hard to reliably guard against.
ServersAside from the drives built into servers and the moving parts, such as the aforementioned cooling fans and PSUs, a number of issues can also arise within a server's hardware components. Reporting at the software level (which usually refers to the BIOS or other low-level hardware component diagnostics) is key to spotting when things have failed or, more importantly, are showing signs of failing. One issue that may not be immediately obvious is that which affects motherboards. It makes perfect sense that machines dislike too much heat. But even today, if a modern circuit board is subjected to a rapid heat loss - or goes from running very hot to suddenly becoming cold - cracks can appear, causing the board to fail disastrously. It's an issue to bear in mind, especially if you're moving equipment between buildings within a maintenance window's unforgiving time frame.
MTBF: It Can Fail TooAs useful as MTBF predictions are it's important to calculate levels of acceptable risk with any equipment upon which a business must rely. Unfortunately, even with all the statistical reassurances provided by manufacturers, the only concrete way to guarantee the availability of the equipment that runs critical systems is by doubling it up to enable a timeout failover.
Each and every individual piece of hardware used in the enterprise is made up of many different components, so the true MTBF is far from a trivial calculation. Clearly, it's critical not to rest a business's future on these measurements of likelihood but instead use them as a yardstick to make informed decisions in relation to business continuity and disaster recovery procedures. After all, reducing downtime through meticulous advance planning might mean the difference between a successful business and business failure.