In most of the IT organizations, who specializes the support and maintenance contracts to the customer’s IT estate and their software estate, most of the emphasis is given to achieving 100% Availability and adhering to Service Level Agreements.
However, one of the most important aspect of the Service Management that helps achieve above is to have an effective incident management system.
The ITIL definition of an Incident is as follows,
"any event which is not part of the standard operation of a service and which causes, or may cause, an interruption to, or a reduction in, the quality of that service"
Fair enough definition, simple to understand and crystal clear !
In my opinion, to have the incident management process as effective in operations as possible, you need to have a basic framework ready that obviously involves right people, right processes and right tools. In this blog post, I would currently concentrate on the process part out of these three.
There are three main parts of the incident management process.
- Detect – detecting the occurrence of incident and understanding the nature of incident & its implications
- Diagnose – diagnosing the cause and carrying out the investigation to find a solution
- Resolve – resolving the incident either by putting a permanent solution or a workaround in place
I would like to go a little bit deep in explaining the tasks that would be done typically by the support team in above phases of incident management.
First of all it is very very important to detect if an incident has actually occurred. What I have seen many of a times that someone reports an incident and the team starts investigating and potentially wastes time in investigating a non-cause issue.
In this phase, it is really very important to understand and establish nature of incident and define the context in terms of the impact & urgency, assign the priority to the incident & progress with carrying out the quick impact analysis.
Use of technology & tools is very important in this phase. It is always recommended to have proactive monitoring in place to keep an eye on the system components to ensure that enough alarms and alerts are in place to inform the respective teams of any potential issues / incidents within the system components and associated services.
Some of the suggested tasks that need to be done immediately after establishing the incident context are as follows,
- Understand if this is manual or automated trap. In all cases if its a manual report of issue, you would have made some customer unhappy with your service !
- If this is an automated alert then you have done a good job. Now establish the source of the alert and establish in what scenarios the alert raised of the alarm is triggered.
- Establish if this alert or alarm has turned into an incident. Sometimes proactive monitoring raises one off traps and the system goes back to stable state.
- If an incident has been detected, establish the nature of incident and do a quick impact analysis and understand the business implication of the incident
- Inform key stakeholders of the incident occurrence
- Refer to known error database and knowledge base for any information / clues that might help you in diagnosing the cause and resolving the same.
- Proceed with diagnose
Once you establish the context of the incident and have informed the key stakeholders of the nature of the incident, it is really important to proceed with diagnosing the cause of the incident and doing a thorough investigation to resolve the same.
Some of the key tasks that need to be done as a part of diagnose are as follows,
- It is expected that the support teams will have a checklist to do a health check of the system that would help them understand the cause of the incident and whether the cause lies within the supported components or not. If you have one, first thing you do is run through the checklist and perform most common tasks.
- Most probably the common tasks will be checking your servers individually, machine to machine connectivity, network checks via telnet & ping and http checks for the web pages. Do them and see if you can find the cause, if the cause if obvious, you are most likely to find it within 5-10 mins if you have a good health check plan and a supporting checklist for managing the incidents.
- Visit the log files, error files – they are in all probability and in all likelihood are expected to contain information about the errors and misbehaviour of the system
- Check the components individually for errors and establish if the cause could be isolated to a component / system / software piece or any other entity.
- Prepare a resolution plan. – mind you, during the incident management it is topmost priority to do a quick service restoration. The RCA analysis could always be done later, as long as the information and evidences are kept secure during the resolution process.
- Inform the stakeholders of the progress of the incident.
- Proceed to resolution
This phase makes life easy of the incident manager, only if and a big if, you had done your work in earlier two phases diligently. When you enter this phase it is expected that you have already found out the cause of incident and you have a plan in place to resolve the incident and you are ready to implement the incident.
Some of the most common tasks that are expected within this phase are as follows, mostly process oriented and obviously depends on project to project / team to team.
- Gather evidences, backup log files & necessary stuff that would help the later RCA analysis
- Obtain the necessary approvals & sign offs for the incident resolution plan. i.e., if the resolution involves bounce of a server, can this be done in day time? Does the business manager agree to a daytime bounce?
- Implement the resolution on production
- Inform the stakeholders of the resolution and update on the expected RCA completion timeline
- Update the knowledge base with the resolution steps & common symptoms of the problems that would help you detect it quicker next time if it occurs
- Progress with the RCA on the basis of the gathered evidences and complete the analysis. Add the same to knowledge base and if required, proceed for the permanent resolution i.e., code fix, patch upgrade, software upgrade etc.
In a real world scenario and where the teams are working under the pressure of supporting business critical systems such as banks, share trading sites & financial transaction sites. Its really and utterly important to have thorough checklists, proactive monitoring and very strong processes based on above points to ensure that you resolve your incident to satisfactory level.
Further to having a strong incident management process based on the above three parts, its equally essential to complete a thorough RCA of the incident and now to allow the repeat incident of same nature.
In the next series on the incident management, I would like to cover the information about various RCA techniques and how you should put it to the practice.
Huh .. its well over midnight now and my small daughter is now crying a bit in her sleep. So time to go back and execute the duty of a
father … !