Make Services Fault Tolerant & Supportable for Production Use

A lot of teams are building services for clients both internal and external to your organization. Typically, there is quite a bit of focus on succeeding from a functional sense – did we get the key requirements addressed? does it cover the plethora of rules across markets / jurisdictions? and so on. Some of the more experienced teams, consider the non-functional aspects as well – e.g. logging, auditing, exception handling, metrics, etc. and I talked about the value of service mediation for addressing these in an earlier post.

There is an expanded set of capabilities that are also necessary when addressing non-functional requirements – those that are very relevant specially when your service grows in popularity and usage. They fall under two categories: operational agility and fault-tolerance.  Here are a few candidate capabilities in both these categories:

Operational Agility / Supportability:

  • Ability to enable / disable both services and operations within a service
  • Ability to provision additional service instances based on demand (elastic scaling)
  • Maintenance APIs for managing resources (reset connection pool, individual connections, clear cached data, etc.)
  • Ability to view Service APIs that are breaching and ones that are the risk of breaching operational SLAs
  • Model and detect out of band behavior with respect to resource consumption, transaction volumes, usage trends during a time period etc.

Fault Tolerance:

  • Failing fast when there is no point executing operations partially
  • Ability to detect denial of service attacks from malicious clients
  • Ability to gracefully handle unexpected usage spikes via load shedding, re-balancing, deferring non-critical tasks, etc.
  • Detecting failures that impact individual operations as well as services as a whole
  • Dealing with unavailable downstream dependencies
  • Leveraging time outs when integrating with one or more dependencies
  • Automatically recovering from component failures

In future posts, I will expand on each of these topics covering both design and implementation strategies. It is also important to point out that both these aspects are heavily interconnected and influence each other.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: