June 9, 2016
When you build and evolve a managed platform, there are a variety of resources to think and reason about. These could be:
- Local dependencies – file system (local, network based mounts)
- External dependencies – REST services (proprietary APIs, public APIs)
- State stores – data stores (firm-hosted, cloud-hosted stores,…)
- Platform services – services that manage the platform functionality (provisioning, runtime management, operational tooling,…)
- Managed Resources – these are resources that your platform is managing on behalf of clients. These are the bread-and-butter abstraction over which you typically will have most control
- Desired States – this is the list of features that must be made available for regular processing, when platform is being upgraded, when platform is undergoing routine maintenance, etc.
This isn’t an exhaustive list by any means but my intention is to summarize resources a managed platform is made up of. There are numerous benefits in making your platform machine readable. You can now:
- Think about the assumptions your team is making about non-functional requirements and characteristics of your internal services, dependencies, and managed resources. These would be things like response time, availability, etc. and can help you design and tune integration proxies.
- Catalogue features that are important for your clients and under what conditions (e.g. ability to handle a provisioned resource, upgrading, etc.). More specifically, define the relationship between features and resource dependencies.
- Define all these elements in a single, consistent, machine readable definition. This will allow your team to view resources and their state, visualize and report feature dependencies.
- Design/ implement feature toggling – feature states can be derived using resources
- Apply self-healing techniques – reset resource (e.g. close and re-initialize a corrupt connection pool, automatically start service instances in the event of a host crashing, etc.)
I will explore each of these points in much more detail in follow up posts. Needless to say, getting your platform in a machine readable state has several benefits.
June 8, 2016
There are a number of risks when your platform integrates with an external service / dependency. For instance, here are a few risks and things that can go wrong:
- Doesn’t respond at all. Just blocks indefinitely eating client-side resources.
- Responds progressively slower – i.e. response time degradation.
- Needs retry logic to deal with transient failures (Note: obviously care needs to be taken if the call isn’t idempotent!)
- Responds with an unexpected return code – e.g. internal server error or service unavailable error, etc.
- Gets overwhelmed by the rate of requests being set to it. Ideally, it should have protection against this but what if it is not entirely in your team’s control.
- Becomes unavailable throwing runtime exceptions forcing undesirable side-effects on the caller rather than failing fast.
Michael Nygard in his book Release It! talks about leveraging circuit breakers to deal with integration risks. Broadening that idea a bit, we could combine circuit breakers and mediation into a more generic Integration Proxy component. This proxy could implement a number of common concerns when working with external APIs:
- Capture response time and route metrics to an analytics agent asynchronously
- Monitor stale connections and automatically reset them if possible
- Host the circuit breaker with associated logic to toggle based on service health
- Provide “fallback” responses if circuit breaker kicks in to disable integration point.
- Host sleep / retry invocation logic using parameters like interval and max attempts
- Automatically flush pending / bufferred messages when service is available again.
- Enable request and response capture – specially for debugging production issues.
June 5, 2016
Production incidents are one of the best avenues to accelerate the maturity of a managed platform. While incidents are stressful when we are dealing with them they provide clear and direct feedback on gaps in the platform. First, don’t indulge in blame games and don’t waste time fretting that it has happened. Second, if you step back from the heat incidents are an excellent means to learn more about your assumptions and risks.
- Did you assume that an external dependency will always be available? More specifically, did you assume that the dependency will respond within a certain threshold latency window?
- Was there manual effort involved in identifying the problem? if so, how much time did it take to get to the root cause? what was missing in your supportability tooling? Every manual task opens the door for additional risks so examining them is key. Think about how to get to the root cause faster:
- Instrumentation about what was happening during the incident – were there pending transactions? pending events to be processed? how “busy” was your process or service and was that below or above expected thresholds?
- Is there a particular poison / rogue message that triggered a chain reaction of sorts? did your platform get overwhelmed by too many requests within a certain time window?
- Did you get alerted? if so, was the alert about a symptom or did it provide any clues to the underlying root cause? did it include enough diagnostic information for additional troubleshooting? was there an opportunity to treat the issue as an intermittent failure – instead of alerting, could the platform have automatically healed itself?
- Was the issue caused by a ill-behaved component or external dependency? If so, has this happened before (routine) or is it new behavior?
- Think about defect prevention and proactive controls. There are a variety of strategies to achieve this: load shedding, deferring non-critical maintenance activities, monitoring trends for out of band behavior, and so on. Invest in automated controls that warn threshold breaches: availability of individual services within the platform, unusual peak / drop in requests, rogue clients that hog file system or other critical platform resources, etc.
The above isn’t an exhaustive list but the key message is to use the incident as an opportunity to improve the managed platform holistically. Don’t settle for a band-aid that will simply postpone a repeat incident!
June 4, 2016
Creating a managed platform is a powerful strategy – key is to help your clients and proactively manage adoption risks. Risks are everywhere from losing control on infrastructure, release management, upgrades to reduced learning curve and operational supportability. Here are a few strategies to manage adoption risks – these will not only help your clients but help the platform team as well:
- Understand key technical drivers for platform adoption – what do your clients care about the most? Is it faster functional development? ease of deployment? rich tooling? testability? ability to dip into a rich developer ecosystem?
- Provide an integrated console for integrating provisioning, runtime management, and operational support. The key word here is integrated – an integrated toolset that makes it easier for a team to provision a resource, deploy / activate it, elastically scale it , and troubleshoot problems is extremely important.
- Empathize with your client’s adoption challenges: they are losing direct control and access in exchange for a host of powerful platform benefits. But they still need answers to questions like:
- how rich and useful is the instrumentation (for transparency into transactions or events or requests being handled, for errors / warnings whilst processing, historical metrics / trends)?
- how do I get access to log messages? are the logs linked to particular request ids or transaction references? how much is the latency between actual processing and log messages reflecting them?
- can I help myself is something goes wrong during production use? e.g. what if a process or execution takes longer than expected? what if it crashes mid-way? is there support for automatic alerting? how easy or difficult is train my devops team members?
- Provide automated controls to reduce risk when hosting untrusted code. Let’s face it – managed platforms take on a large amount of risk by hosting code that is largely outside it’s control. It is therefore, very critical to reduce defects and address risks via automated controls. You can check for unsupported API calls in your SDK, risky or unsafe libraries being packaged, etc. to address risks while provisioning. This is a vast topic and I will author a follow up post on controls and why they are indispensable to create stable managed platforms
June 2, 2016
In an earlier post, I wrote why managed platforms can drive large scale reuse alongside other benefits. Given their importance, it is imperative to think about avoiding failure. Below are 5 reasons why platform efforts fail:
- Not providing a public API that speaks to your client’s domain and exposing implementation details. Remember if you don’t manage this, whatever is shared will become the de facto public API. This in turn will tie your hands considerably when you need to refactor, improve, refine the API.
- Ignoring developers and their experience when using the platform’s public APIs and tools.
- Not investing in automated tests that can certify platform functionality
- Making it difficult (or sometimes impossible) to customize behavior. Clients forced to learn platform-specific terms and practices at the expense of their actual problem.
- Assuming your team has all the answers – this manifests in the form of inability to listen to what your clients are saying, not collaborating with them effectively, and not exploring opportunities to co-create / co-evolve the platform solutions
May 31, 2016
I wrote earlier about easing integration when providing reusable software assets. In this post, will elaborate on specific techniques for driving platform adoption & their rationale. Recognize that most developers who will evaluate your platform are trying to solve a specific business / functional problem. They want to quickly ascertain if what you are providing is a good fit or not. Instead of convincing them, help them arrive at a decision. Fast. How exactly do you do that? Here are a few ideas:
- Provide details on the kinds of use cases your platform is designed to address. Equally important is to be transparent about the use cases that you don’t support. Not now and never ever will.
- Create developer accelerators – e.g. a Maven Project Archetype or a sample project to try out common functionality
- Identify areas where developers can extend the platform functionality – how will they supply or override new behavior? How will you make it possible to inject, easy to test, and safe to execute? There are lots of techniques that you can use but first you have to decide to what extent you want to allow this in the first place.
- Make your platform available in “localhost” mode – i.e. conducive for use with the IDE toolset. This is more challenging than what it sounds – e.g. if your platform isn’t modular, making it work in local mode will be very challenging. Ditto if your platform relies on external services / connectivity / data stores, etc. that aren’t easily replaceable with in-memory / mock equivalents.
- Allow developers to discover your platform via multiple learning paths. Some might want to explore using a series of Kata lessons that tackle increasingly complex use cases. Others might be looking for answers to a specific problem. You need a user guide, code kata, examples, and more importantly, you need to make them easy to access.
- Identify which areas of the platform adoption curve are the most time consuming and figure out how to reduce if not eliminate them entirely. For instance – does your platform require an elaborate onboarding process? Are there steps that can be deferred till production deployment?
May 29, 2016
Managed platforms are a very effective and pragmatic way to drive systematic reuse across an organization. Managed platforms can provide a number of benefits ranging from simplified developer experience, out of the box productivity components and tooling, and most importantly, a whole host of non-functional concerns being addressed in an integrated fashion. Good managed platforms exhibit some common traits. They:
- Solve a specific problem really, really well
- Are easy to signup, develop in, and use via a developer SDK
- Provide a number of integrated components that address specific pain points
- Are extensible and provide clear and safe injection points via a public, published, well maintained API
- Free the developer from having to procure hardware or orchestrate deployment activities
- Make it easy to report bugs, fixes, and contribute enhancements
The best part of managed platforms? They can dramatically alter the productivity curve for your development teams. Every developer doesn’t have to worry about high availability, horizontal scaling, capacity management, public APIs, version management, backward compatibility, and ongoing care and feeding of core reusable components. The platform provides value and more specifically peace of mind via powerful abstractions.
This isn’t an exhaustive list but given the general push towards cloud based architectures, good platforms will give your teams much more than just reuse! I will elaborate on each of the above traits in follow up posts.