5 Best Practices for Automating Major Incident Management
With a smart automation strategy, you can make incident response quicker and easier than ever - minimizing downtime and potential security breaches.
Major IT incidents take place within companies every single day. While only a handful make the headlines, events like outages and security breaches can seriously cripple employee productivity, negatively influence customer perceptions, and most importantly, result in lost revenue.
So when it comes to managing major IT incidents, it’s best to focus on the business impact and the bottom line. According to the Ponemon Institute, the average cost of downtime in 2016 was $8,851 per minute – that’s over $500,000 per hour, and typical downtimes average more than 90 minutes. And this is just the immediate cost! The longer-term impact like reputation damage and customer attrition are unpredictable and potentially catastrophic.
While you can’t entirely avoid all major incidents, you can arm your organization to be as prepared as possible to tackle them when they arise. And a major component of your strategy should be to incorporate automation. Organizations that maximize the use of automation in their major incident resolution processes achieve faster restoration of service and far fewer mistakes due to human error. This is because automation directly impacts your ability to shrink the duration of the business impact window – or that costly period in which your users and business operations actually feel the impact of an incident. (To learn more about automation, see Automation: The Future of Data Science and Machine Learning?)
In order to maximize the benefits of automation, you should examine which activities need to take place during the impact window, and figure out how to move all other activities to either before the incident starts or after business has returned to normal operations. Here are five helpful ways to get started.
1. Develop & Define a Process
Defining a major incident management process is about pinpointing what can be planned, coordinated or executed during an incident. This may mean identifying key support team members by skillset and schedule, for instance, so that your service desk can engage them as quickly and efficiently as possible. It also means figuring out how you will relay relevant information to your team so they can begin resolving the issue right away, as well as keeping the right stakeholders informed and updated.
Automation is critical for key aspects of this process. For example, you could automate the inclusion of relevant information from your monitoring tools in your service desk tickets, or include information from the service desk in notifications to the incident resolvers. You can also document the entire incident to a single source of comprehensive truth that is accessible by all. Remember that you can practice this process to get it right – you don’t need to wait for a real-world incident to test your approach.
2. Get Your Infrastructure Right
In this day and age of alert fatigue, it’s essential that you don’t continue to bombard your teams with irrelevant notifications and information that doesn’t apply to them. Applying filters to your monitoring alerts will empower your teams to more easily zero in on the needle in the haystack of routine noise. This is key to making all of your insights and data truly actionable, rather than just adding to information overload.
Good ways to automate include using an APM solution to crawl all of your applications and systems to proactively pinpoint root causes at the point of any performance degradation, prior to causing major service outages. You can also integrate your monitoring, service desk, collaboration apps and chat tools to share contextual information in real-time.
3. Accurately Measure MTTR
How do you measure mean time to repair (MTTR)? Do you base it on the total time that IT teams are engaged, or on the total time that the business is actually impacted? If your answer is the former, you should reconsider measuring the impact window using the business perspective instead. This is a much more accurate context for your optimization efforts, because your goal is to minimize the impact of incidents, and not simply present better response reports to your board. (To learn more about downtime and how it's handled, check out What Mean Time Between Failures Really Means.)
You can automate by providing full visibility into applications to retroactively “start the clock” if necessary, and preserve a full record of your resolution activities and communications for analysis and audit to improve your processes.
4. Keep Stakeholders Informed – But Without Interrupting Resolution
Stakeholders expect effective and timely communications while also expecting subject matter experts to stay laser-focused on fixing problems. While you could designate a communications point of contact to monitor and engage business users, a more effective strategy would be to create a self-service web page with status updates. This empowers the stakeholder to check for themselves without bombarding your team with further calls and emails. Just remember to update your stakeholders at regular intervals so they always receive, and know to expect, the latest status report. Don’t forget that communication shouldn’t stop simply because service is restored! It’s important stakeholders get a summary of what happened, what was learned, and how the situation can be prevented in the future.
Automation in this case can be implemented to create an automatic, real-time status page for stakeholders, as well as building slash commands into your chat tool to update that page.
5. Collect Data to Support Problem Management
Restoring service does not represent the end of incident management! In fact, some of the most valuable activities occur in the aftermath of resolution. By collecting diagnostic and impact data and performing root cause analysis, you can perform a full audit of a major incident that includes putting preventive measures in place to avoid similar incidents in the future. In addition, even if a recognizable incident does occur again, you can create a defined procedure for what kinds of data you need to collect and the steps that need to occur to drive resolution. This way your team simply has to refer to a checklist and focus on their core objective of restoring service, rather than worrying about what they need and when.
Automation here can capture and preserve resolution activities, including things like chat transcripts, in a single system of record for analysis. In addition, it will help you build a catalog of familiar incidents or issues, solidify best practices for each, and therefore increase the speed of resolution in the future.
In Conclusion: Automate Smarter, Not More
Be cautioned that more automation isn’t necessarily the better approach! It’s more important that you understand when, where and how to connect your IT systems together to support incident management. You don’t want to add any unnecessary complexity for the sake of increasing automated processes. Remember the goal is to simplify and consolidate operations as much as possible in order to make your teams feel empowered to efficiently tackle problems. It’s about intelligently implementing automation to facilitate a well-coordinated set of processes, knowledgeable staff and effective stakeholder communications, so as to minimize the overall business impact of major incidents.