During an outage, if you feel like your computer is on fire, chaos is abounding, and the world is coming to an end, it's typically a good sign that your incident management process could use a bit of tuning. Gartner indicated in a now-famous blog post that an outage typically costs an organization $5,600 per minute of downtime. An hour-long outage at that rate can cost an organization nearly $350,000. As Amazon or Knight Capital will tell you, that number can be significantly increased if it occurs in a revenue-generating system.
IT teams must find a smart, stable response and resolution to these incidents, usually very quickly in hopes of calming down a manager doing his best Vernon Dursley impression. With the myriad of tools available, at Praecipio Consulting, we've seen IT teams develop creative solutions to acknowledge, respond, and ultimately resolve downed services and systems. But like most processes, we've also seen overly-complicated procedures requiring messy integrations that are unreliable, at best. The key to managing an outage gracefully is to understand not only that the system is down, but ownership, recovery procedures, and communication.
At Praecipio Consulting, we typically see three big inhibitors IT teams face in reducing downtime:
- working in multiple systems
- alert overload
- lack of communication and visibility
Working in Multiple Systems
As microservices become more prevalent in IT organizations, ops engineers are frequently required to work in several disparate systems, resulting in costly context switches that impact productivity. In addition to the (very expensive) wasted time that this incurs, information can be lost in the transition. An effective solution is a single system with several integration points, where information can flow into and be actioned on. Reducing the need for context switches helps users retain information and provides a single source of truth. As a bonus, after the incident is triaged and resolved, the information on how the incident was resolved is all in one location.
This is just one of the many reasons we love the Atlassian products. Jira Service Desk, in combination with Confluence as a knowledge base, can serve as the central location for all things outage. Whether or not the creation of a request is triggered automatically or manually, the creation of a central ticket where the team can swarm, communicate, and collaborate is essential in dealing with the outage quickly. Coupled with the knowledge base filled with Standard Operating Procedures, the IT team can reduce the chaos and confusion of an outage and move toward resolution. Notifications can be sent automatically through Jira Service Desk to any interested parties using Filter Subscriptions and the root cause analysis can be shared via a page in Confluence.
There are a plethora of wonderful monitoring tools in the market today providing a wealth of information to system engineers. The problem is that during an outage, we don't want to wade through a mountain of data to figure out what happened. Instead, we need a way to reduce the noise and get straight to the source of the incident.
Enter companies like Moogsoft, who specialize in aggregating all of that data and sifting through it to identify cause and effect. Building out timelines of when certain alerts were triggered and applying machine learning to identify patterns can greatly reduce the time it takes to get to a root cause.
Of course, an integration into your single system for work is critical. The information should funnel in automatically, thus enhancing the system instead of pulling users away from it. Integrating alert systems into Jira Service Desk to trigger the creation of an Outage, running out of disk space, or even access alerts is invaluable to an IT team looking to respond and resolve as quickly as possible.
Lack of Communication and Visibility
We spoke with a client recently who was reminiscing on 70-person emergency bridges, recalling how chaotic and comical they were. After a good laugh, we were glad he was able to reminisce on those times, as for many IT teams this is still an all-too-real part of the job.
We prefer systems that provide an integration with a collaboration tool and enable a user to proactively reach out to the right support. Ideally, once we're in the communication and collaboration stage, relevant information has already been gathered to a single ticket. Spinning up a chat room from that ticket, and then using an application like xMatters to proactively alert the on-call members of the right support group, enables us to quickly and effectively get the right people looking at the issue. When integrated with Jira Service Desk, the chat room is created via the click of a button and if integrated with an asset management tool such as Insight by Riada, the right people are automatically notified and can join the conversation.
Connecting the right people with the right process in the right tools empowers IT teams to quickly and effectively address incidents. While we all know incidents are painful, the process to identify, work on, and resolve them doesn't have to be. Having a mission control system that intelligently handles alerts, allows for proactive notifications, and promotes collaboration can drastically reduce the time spent working incidents.
How we can help
If you're interested in learning more about how you can establish your own mission control system, give us a call. We can assess your current toolchain configuration and provide next steps on how you can move forward with the technology you have, or help you find the tools that work best for your team.