Production incidents are one of the best avenues to accelerate the maturity of a managed platform. While incidents are stressful when we are dealing with them they provide clear and direct feedback on gaps in the platform. First, don’t indulge in blame games and don’t waste time fretting that it has happened. Second, if you step back from the heat incidents are an excellent means to learn more about your assumptions and risks.
- Did you assume that an external dependency will always be available? More specifically, did you assume that the dependency will respond within a certain threshold latency window?
- Was there manual effort involved in identifying the problem? if so, how much time did it take to get to the root cause? what was missing in your supportability tooling? Every manual task opens the door for additional risks so examining them is key. Think about how to get to the root cause faster:
- Instrumentation about what was happening during the incident – were there pending transactions? pending events to be processed? how “busy” was your process or service and was that below or above expected thresholds?
- Is there a particular poison / rogue message that triggered a chain reaction of sorts? did your platform get overwhelmed by too many requests within a certain time window?
- Did you get alerted? if so, was the alert about a symptom or did it provide any clues to the underlying root cause? did it include enough diagnostic information for additional troubleshooting? was there an opportunity to treat the issue as an intermittent failure – instead of alerting, could the platform have automatically healed itself?
- Was the issue caused by a ill-behaved component or external dependency? If so, has this happened before (routine) or is it new behavior?
- Think about defect prevention and proactive controls. There are a variety of strategies to achieve this: load shedding, deferring non-critical maintenance activities, monitoring trends for out of band behavior, and so on. Invest in automated controls that warn threshold breaches: availability of individual services within the platform, unusual peak / drop in requests, rogue clients that hog file system or other critical platform resources, etc.
The above isn’t an exhaustive list but the key message is to use the incident as an opportunity to improve the managed platform holistically. Don’t settle for a band-aid that will simply postpone a repeat incident!