Keeping IT services up and running is obviously important. System manufacturers have given a lot of thought to the subject. Some critical financial computers have been running continuously for years. There is a story on the internet about a Novell Netware 3 computer that was finally shut down after 16 years. In considering network uptime, the standard is for "Five 9s," or 99.999% availability. Achieving maximum uptime is an important consideration for any IT service offering.
How is maximum uptime achieved? Good management is the key. The International Organization for Standardization (ISO) created a framework for network management called FCAPS, which stands for:
- Fault management
- Configuration management
- Accounting management
- Performance management
- Security management
Issues with individual network components are both proactively and reactively handled using this model. Faults are monitored using alarms and event notifications. These are collected by agents of protocols such as SNMP (systems network management protocol) or some other proprietary solutions. Customizable thresholds may trigger alarms and even automatically generate tickets that end up in the queues of monitoring personnel in data centers. Large carrier networks may have separate departments to deal with the core, distribution or access layers of the network. Root cause analysis attempts to isolate and define critical issues after a major event.
Similar processes are used for system management. Internet service providers (ISPs) and managed hosting centers employ system administrators to monitor and manage the viability of servers, storage systems or other devices. Individual processes on Windows or Linux machines, for instance, can be viewed and controlled through graphical user interface (GUI) management programs in the same way that network protocols are.
Remote surveillance and configuration of network components and systems provide real-time capability for maximizing system uptime. That extends to configuration changes, collecting key performance indicators, or implementing security enhancements.
One way to look at uptime and the robustness of any system is with the model that IBM called RAS: reliability, availability and serviceability. To ensure RAS, many methods have been developed. These include redundancy, data backup, uninterruptible power supply (UPS), hot-swappable components and automatic updates. Planned changes and maintenance windows offer opportunities to correct or improve known issues without distressing the user.
Eventually systems and networks will fail. Redundancy is one of the keys to system resiliency. This can apply to hardware, software or data. Those responsible for ensuring reliability in a network or software system will look for what may be considered a single point of failure (SPOF). Does the entire network flow through a single switch or cable? Are all processes taking place on a lone server? Is there only one copy of a critical data set? Without redundancy, a company can – in an instant – lose what may have taken years to develop.
Maximizing uptime is an “all-of-the-above” endeavor. Best practices have been developed through decades of experience and collaboration. New solutions are continually being put in place, such as self-healing networks, virtualization, data analytics and improved architecture. No single method will answer all issues that arise in complex systems. Every company attempts to make best use of its IT resources as efficiently as possible within the life cycle of the equipment at its disposal.