I wrote quite a few articles recently on Incident Management and you can find all of them here. In this article I am trying to put my thoughts on the incidents that are categorized as “Self inflicting” or “invited incidents” (henceforth SIIs) and how to protect / prevent them from occurring.
Self inflicting incidents that result into service outage or disruption are normally followed by remedies to the vendor providing the application support services. The customers now a days are sensitive in putting the remedy clauses in their contracts and thus its overly more important to keep the incidents, let alone self-inflicted incidents away by doing additional monitoring & proactive measures in place.
I wrote earlier about the DDR framework to manage the incidents and how it is important for major incidents to be detected earlier, diagnosed quicker and resolved sooner. It is worth reading the article if you have not done as yet.
It is important to understand if the incident could be classified as a self inflicted or not, while you are doing the incident management. The sooner you detect the type of incident (self inflicted or situational) the more chances you have to “manage” the incident appropriately and avoid heavy fines / remedies against your organization.
The most often cause of having SIIs is manual overlooking, carelessness while doing a change to the production system. Any change done to the production system without understanding its implications could be really harmful and could come back and bite you hard. Hence its really important for application support teams to understand each change going on the platform, around the platform and then line up the implementation steps, pre & post implementation checks accordingly to safeguard from potential SIIs.
Once you detect an incident as an SII, its very important to “manage” it properly. Two key lessons you can keep in mind while managing an SII are,
Never hide from customer about any SIIs
Never lie about the facts around the SIIs
Most of the times, I think the support team management would take a political route for handling the aftermath of the SIIs to save themselves from potential remedies & fines. While, in some cases, it make sense to do so, more often than not, for a wiser and slightly smarter customer, it falls right on the face. After all, your reputation is on the line !
If you have a very good working relationship with the customer, try to speak with the customer and explain the situation in a full honesty. While you do that, its equally important to learn the lesson and ensure to take steps not to repeat the incident again. No point in giving fake promises to the customers if your team can not keep it. If there are situations that have forced your team into managing a SII, then explain the customer about the situation and see how this could be overcome. In most the cases, where the customer is slightly sensible (rather than horrible :-)), this trick would prevail.
Remember ! It is always important to keep the customer informed and not keep him in dark over the investigation. After all, you are the service provider and he is paying you for your services.
Now, moving on to tricks on avoiding the SIIs. Well, there is no defined process or guaranteed path that would ensure that there will not be any SII while you are providing application support, but surely there are enough tips and tricks that would help you reduce the probability.
First of all, find out the most common root causes of the incidents happened in past one year. More often than not if the incident has happened in past one year and root cause has been found and the fix has applied, there is a learning you could take from that experience.
Have a very good checklist for doing the health check of the system. Automate the monitoring of the components and potential failure points as much as you could so that in case of an incident, they would be useful to gather any evidence.
Have a useful incident checklist handy with you. You can read about how to prepare a incident checklist on my previous topic here. You need to take all your understanding of the platform, its connection points, failure points in consideration when you create the incident checklist to detect and diagnose the incident.
Most importantly, for all scheduled / unscheduled changes on the platform, ensure that they are thoroughly checked, implications are understood and risk is flagged accordingly. There is no point in keeping quiet if you know that a network change might cause an outage to your portal if its switched over. You might want to give a heads up to the customer and seek an approval prior to such change than keep on explaining why you allowed it on production later on ! If the application support team are able to detect and predict which changes are potentially harmful to the system before they are approved for implementation, your more than half of the job is done.
While doing the change implementation, obviously be very careful on what you are working on. Even if the change sounds simple and non intrusive or disruptive to the service, there no point in being careless about doing it. I have an experience of managing an incident where one of my colleague (few years ago) had deleted production database tables, instead of the reference database tables and the system went down for full 3 days !
There are lot of things you could as a application support team to avoid the potential SIIs and then eventually ensuring you maintain a stable system. I have noted few of them above, you might want to let me know if there are more and share your knowledge with me too !