We have all been there - we need to place an urgent purchase order and the IT system goes down just then. So we log a ticket and wait for a response. In some cases, the resolution comes within minutes, while for others the delay leaves us guessing as to what is happening behind the scenes. Should we escalate it? Now? Or exercise some more patience?
In today’s world, we expect enterprise IT systems to be available round the clock every day of the year, 24 X 7 x 365. This stringent requirement places tremendous pressure on the operations and engineering teams. Application architectures have gotten complex and problems are often cross-functional, requiring multiple subject matter experts to be involved. Teams can be dispersed geographically across different time zones, adding another layer of complexity to the problem investigation. According to Gartner, the cost of an outage can easily run into hundreds if not thousands of dollars per minute.
Having lived and breathed this scenario in real-life high pressure enterprise environments, I and my co-founders wanted to design a transformative solution with the power of new AI-based cognitive paradigms to help resolve operational problems faster and easier. So, what would be our guiding principles? As a budding startup we were in uncharted territory and could let our imagination have free rein!
Intense discussions with my co-founders led us to identify 4 essential requirements for our next-generation Incident Response Center:
Augmented Intelligence (aka AI) to help diagnose & resolve problems: Our system has to help humans reach good decisions rapidly through recommendations for causes & solutions of new problems based on historical data of prior solved problems.
Explainable Analysis: The #AI/#ML in our system should be explainable, i.e., for the suggestions to be trusted and easily understood by humans, it shouldn't be a black box with opaque logic. For example, investigators need to be able to see why specific recommendations were made, i.e. connect the suggestions back to the underlying data.
Continuous Learning: Our solution needs to keep improving with time, gathering knowledge automatically from user actions as they solve problems. To maximize its power, it should also try to leverage crowd-sourced knowledge in an anonymous way, while strictly preserving data security & privacy.
Efficient Collaboration for Teams: Last but not the least, it has to enable efficient #collaboration among teams of 2 to 200+ people. In real life, investigation teams can be quite large, in some cases involving hundreds of engineers and managers. Prime example (pun intended :-) is the Amazon Prime day 2018 crash & scramble. It is simply not efficient to collaborate and reach shared understanding on investigation status and strategies using text-based updates in a ticketing or service management tool.
Our mission is to make understanding long convoluted problem tickets a thing of the past. Today, major incidents often result in complex knowledge articles to be written by the experts in an effort to improve future response. However, no one has the time (or even the capability!) to read and assimilate the huge amounts of information available in internal and public knowledge bases - there's simply too much! In our world, knowledge articles would be superseded by intelligent recommendations automatically generated in our Incident Response Center from the prior history of solved problems, while enabling efficient and effective collaboration amongst team members.
Curious to know what our next-gen Incident Response solution looks like? Could it help transform how you resolve your #ITOps problems? We are still in stealth mode, so contact us directly for an in-depth demo, and see how to take your #ITOperations & #ITSupport to the next level with the power of AI & visual collaboration!