AI-Driven Automation for ITSM Incident Management
Self-service generated incidents can be light on data and frequently require level 1 service desk triaging. The primary function of the triage team is to assign a category or business service which triggers the assignment process. While it seems simple, this extra touch-point can increase both resolution time and support cost. Let’s take a look at how we can leverage artificial intelligence (AI) to streamline this IT service management process.
ITSM Incident Management: Current Processes
First, we’ll review how humans on the triage team would handle a typical ITSM incident management categorization problem.
A service desk agent receives a new incident that has the following information:
- caller reporting the issue
- a short description of the issue
- possibly a more detailed description
- urgency and/or impact level
The following workflow is an example of the high-level process an agent would use to triage a new incident. The process is heavily dependent on the existing knowledge of the agent or the agent’s access to readily available and contextual knowledge.
The agent can either make an assessment based on existing knowledge and experience or they must look for information to help make a decision.
There are a few primary concerns with this approach:
- Once agents acquire the knowledge to quickly assess and categorize incidents, they usually move on to a more technical support role so their knowledge is no longer available for this process.
- The “learning” that happens during the process is rarely documented so it only has a single-use value.
- There usually isn’t a feedback system to help the team improve the process. When incidents are incorrectly categorized and assigned to the wrong team, that team will usually just re-assign the incident to the team they think is the best resource.
You could attempt to automate some of the process by checking against a list of common keywords and map them to categories or services. This requires someone own and manage the keywords mappings, which could easily become out-dated as systems and services change.
The AI Way
Now let’s see how machine learning concepts can be used to improve the existing ITSM incident management process and reduce the amount of time it takes to get the incident into the correct hands for resolution.
On the Job Training
Just like a newly hired agent, machines also require some training before they can perform their job. Luckily machines can learn much faster and build more complex associations of related data, and even determine what associations are the most useful in making decisions.
The bulk of the training is accomplished by analyzing a collection of existing incidents, identifying key attributes that have patterns of correlation to categories, and creating a model to make predictions from these patterns. In the basic sense, a model is just an algorithm of how to take the attributes of an incident to make a prediction. In our case, the prediction target is the category or business service. The features (inputs) would be the known information from the new incident (caller, descriptions, …).
There’s a lot more to making this happen, I’ll cover the training, testing, and models concepts in a followup post.
The real value of machine learning is that the learning process is continuous. It gets smarter by learning from the success of the previous predictions.
No machine learning model is perfect. For the cases where it was unable to predict a value with high enough confidence, the model will attempt to learn better ways to handle similar incidents in the future. And for cases where an incident was later manually corrected by a human, the change event is collected and used to improve the model.
Hopefully, you’re already seeing that this solution isn’t limited to predicting categories. Improved assignment rules, change risk analysis, and predicting major incident and outages before humans identify the patterns are all possible. My layman explanation is, anywhere you have to manage data lookup definitions or rules is a case that can be solved with machine learning.
While this may seem like a turn-key “be all and end all” solution, my experience with implementing this for Astound.ai customers on the ServiceNow platform has identified a few challenges. As noted, the training process is heavily dependent on existing data to learn from. If you’re light on data, or your process is inconsistent, the initial accuracy can be low. Another challenge is the common practice of using an “Other” category. While this is a well-known bad practice, sometimes you just need that bucket to put things in that don’t smell like anything else. This can quickly turn into an overused option. The machine learning analysis and training will attempt to identify common patterns or features of the collection of “Other” incidents. Since the collection of incidents greatly varies, there’s little that the model will be able to leverage. This usually leads to excluding the data from training.
If you don’t get anything else from this article, at least understand that the more you can clean up your ITSM incident management process and data, the easier it will be to implement machine learning solutions and the more value you’ll get once it’s implemented.
Stay tuned for more exciting machine learning and data science concepts explained in “easier” to understand definitions.