June 13, 2016
There are a lot of teams building platforms – whether for an internal organization, for a public cloud, or somewhere in between. Question to ask yourself – are you focused on providing a managed service? or are you too focused on only being a technical platform? These questions are key because it is all about understanding customer priorities, focusing on customer success, and being paranoid about all aspects of the offering and not only the technical bit.
Don’t get me wrong – without the technology underpinning the platform doesn’t work, it doesn’t exist, and there isn’t any foundation to stand on. However, the purely technical bit is necessary but not sufficient for your platform to succeed. Shift your perspective and viewing from the vantage point of offering a managed service. You will appreciate that there are additional elements that are equally important:
- How do customers sign up to use the platform? what pre-requisites exist and at what point in the process do they have to address them?
- How is the platform supported? is there a dedicated support team, documented and communicated procedure for escalating production issues? how are incidents, follow ups handled?
- What is the release management philosophy for the platform? is there a published schedule and is that honored? what about critical bug fixes and their roll out time windows?
- Are you promising any specific uptime / availability numbers to customers? do you have business rationale for these commitments (i.e. non-functional requirements aren’t being assumed…)?
- Will you charge your customers for platform usage? if so, how will you capture usage statistics and how will that translate to billing units? what changes have to be made in your software stack to account for usage exceeding limits/quotas, usage during critical operational time windows, usage during business critical events, etc.?
- How will your current and prospective customers find out about new features (including ones that are beta/early-release vs. those that are available for broader use)? is there a committed testing procedure that promotes a feature into the platform?
- How do you certify that the platform does what it is supposed to do – iteration after iteration, release after release? do you have tests and test evidence tying customer facing features and their availability against a particular platform version under test?
- What is your philosophy on public APIs? do you have / plan to provide language bindings against your public REST APIs for instance? if so, who owns that and who keeps that up to date as the underlying platform goes through revisions?
- How is capacity managed in the platform? is there an overall resource pool? or are their customer-specific runtimes? how can you accommodate unforeseen spikes in demand? have you tied new customer provisioning and platform inventory management?
The point of the above isn’t to list every possible facet of what you need to think and plan for. It is to illustrate the fact that providing a managed service is more than just having a technical platform.
June 12, 2016
Managed platforms are quite effective in providing a host of benefits. There is quite rightly a strong focus on the core value, offering that a platform provides. To increase the odds of success with your platform, focus on building a broader ecosystem.
What do I mean by a broader ecosystem? Here are a few things to consider:
- Co-creation: enable your client developer community to create capabilities for the platform. Design, publish, and evangelize APIs that allow them to extend and contribute to the platform. You can start simple via interfaces or have a full-fledged Plugin mechanism. Either way, ensure there is a well defined process to assess the fit, maturity, and platform integration. More importantly, ensure there is an easy way to allow contributors to test their creations. You can provide utilities that works with common testing frameworks such as JUnit (e.g. Rule that encapsulates complexities with using the underlying Platform APIs, bootstrapper classes that initialize / shutdown platform APIs, etc.)
- Enterprise Integration: Make your platform work out of the box with existing platforms / standards within the enterprise. Specifically for areas such as authentication / authorization, alerting / notifications, instrumentation, discovery, etc. Taking optionality away from your clients is a very good thing in this space. They can save costs as these things work natively – no need to worry about creating, testing, and integrating. Obviously, if something doesn’t exist and needs to, co-create it with your clients if that can be done as part of a business deliverable!
- Documentation: Platform teams must make client-facing documentation a priority. Think about making your firm’s developers more productive. Make it easy for them learn the core concepts and API constructs to get up and running on the platform. Provide API documentation (such as Javadocs), cheatsheets, user manuals, and code examples that demonstrate typical usage scenarios.
- Support Tooling: Getting clients to self-serve via enterprise-approved tools will be extremely important for their day to day experience interacting with the platform. Think about logging, remote debugging, tracing, reconstructing incident state, etc. and provide integration with tools that they are already familiar with for existing apps. You may not need to or want to expose too much here but constantly think about native integration that is presented in a manner that makes it easy for clients to get to the root cause. If they keep calling you to troubleshoot issues, that is time to introspect and improve!
- Community Events: provide opportunities and channels for your clients to communicate not only with the platform team but also with each other. Having regular meetups – from simple introductions to joint design / contribution reviews and updates – are very effective. If there is a broader community to rely on, clients can help themselves and learn from each other on an ongoing basis.
June 10, 2016
Let’s say your team built an amazing managed platform – one that provides large scale systematic reuse and provides several compelling benefits to your organization and it’s developer community.
You break open the champagne and celebrate right? Not so fast!
Sustaining and evolving the platform requires discipline and persistence. Like any other complex system, entropy will set it and if you don’t get ahead of it, the platform will wither away.
So, what does platform entropy look like and how do we tackle it? Instead of trying to define it, let me suggest a few signs to watch out for:
- Focus shifts disproportionately from improving a platform’s key functional use cases – i.e. the platform is not constantly improving the manner in which it is addressing the bread-and-butter business problem it was designed for.
- There isn’t a core group of committers who are constantly monitoring the health and well being of the platform codebase. This includes tracking and fixing nagging bugs, modifying and correcting abstractions, introducing better documentation, making it easy for new developers to understand and extend functionality, etc. If there isn’t anyone obsessing about a sustainable codebase, entropy will win. It is just a matter of time.
- There isn’t a committed platform road map and releases start to become more and more ad-hoc. If the platform code base isn’t released in a frequent, easy to execute fashion, don’t be surprised if it gets harder and harder do deploy critical fixes and upgrades!
- The feedback from existing and potential client developers aren’t acted upon. Are you truly listening to your client developers? are they complaining of onboarding complexities? platform jargon that takes a long time to learn? inability to test their code without jumping through hoops?
- Not investing effort in making the platform easy to support. Supportability helps with overall platform health in numerous ways – most importantly, drives ability for your clients to self-service their interactions. If support tooling and automation is short-changed, you will spend ever increasing amount of time and effort trying to get the platform to behave in production. If your team is having a tough time differentiating between platform issues vs. issues in the client code, it is time to invest in better and more supportable tooling. Both your teams and clients deserve that investment.
June 9, 2016
When you build and evolve a managed platform, there are a variety of resources to think and reason about. These could be:
- Local dependencies – file system (local, network based mounts)
- External dependencies – REST services (proprietary APIs, public APIs)
- State stores – data stores (firm-hosted, cloud-hosted stores,…)
- Platform services – services that manage the platform functionality (provisioning, runtime management, operational tooling,…)
- Managed Resources – these are resources that your platform is managing on behalf of clients. These are the bread-and-butter abstraction over which you typically will have most control
- Desired States – this is the list of features that must be made available for regular processing, when platform is being upgraded, when platform is undergoing routine maintenance, etc.
This isn’t an exhaustive list by any means but my intention is to summarize resources a managed platform is made up of. There are numerous benefits in making your platform machine readable. You can now:
- Think about the assumptions your team is making about non-functional requirements and characteristics of your internal services, dependencies, and managed resources. These would be things like response time, availability, etc. and can help you design and tune integration proxies.
- Catalogue features that are important for your clients and under what conditions (e.g. ability to handle a provisioned resource, upgrading, etc.). More specifically, define the relationship between features and resource dependencies.
- Define all these elements in a single, consistent, machine readable definition. This will allow your team to view resources and their state, visualize and report feature dependencies.
- Design/ implement feature toggling – feature states can be derived using resources
- Apply self-healing techniques – reset resource (e.g. close and re-initialize a corrupt connection pool, automatically start service instances in the event of a host crashing, etc.)
I will explore each of these points in much more detail in follow up posts. Needless to say, getting your platform in a machine readable state has several benefits.
June 8, 2016
There are a number of risks when your platform integrates with an external service / dependency. For instance, here are a few risks and things that can go wrong:
- Doesn’t respond at all. Just blocks indefinitely eating client-side resources.
- Responds progressively slower – i.e. response time degradation.
- Needs retry logic to deal with transient failures (Note: obviously care needs to be taken if the call isn’t idempotent!)
- Responds with an unexpected return code – e.g. internal server error or service unavailable error, etc.
- Gets overwhelmed by the rate of requests being set to it. Ideally, it should have protection against this but what if it is not entirely in your team’s control.
- Becomes unavailable throwing runtime exceptions forcing undesirable side-effects on the caller rather than failing fast.
Michael Nygard in his book Release It! talks about leveraging circuit breakers to deal with integration risks. Broadening that idea a bit, we could combine circuit breakers and mediation into a more generic Integration Proxy component. This proxy could implement a number of common concerns when working with external APIs:
- Capture response time and route metrics to an analytics agent asynchronously
- Monitor stale connections and automatically reset them if possible
- Host the circuit breaker with associated logic to toggle based on service health
- Provide “fallback” responses if circuit breaker kicks in to disable integration point.
- Host sleep / retry invocation logic using parameters like interval and max attempts
- Automatically flush pending / bufferred messages when service is available again.
- Enable request and response capture – specially for debugging production issues.
June 7, 2016
You need to practice minimal design to be effective with systematic reuse. The design needs to continuously look for opportunities to align iteration goals with your systematic reuse roadmap. Too many developers mistakenly think that adopting agile means abandoning design. This couldn’t be farther from the truth. You design whether you explicitly allocate time for it or not. Your code will reflect the design and you will impact the technical debt for your codebase in one way, shape, or form. Implementing user stories and paying down technical debt should be your end goal and not avoiding design altogether.
Always design for meeting your iteration goals. Avoid designing for several weeks or months and surely avoid putting technical elegance ahead of delivering real user needs. You should design minimally. Just enough to take advantage of existing reusable components, identify new ones, and plan refactoring to existing code. Specifically this means:
1. Keeping a list of short tem and medium term business goals in mind when designing
2. Always looking for ways to make domain relevant software assets more reusable
3. You are aware of what distribution channels your business is looking to grow
4. Design reflects the domain as close as possible and that your reusable assets map to commonly occurring entities in your business domain
5. Value is placed on identifying the product lines that your business wants to invest in and evolving your reusable assets to mirror product line needs.
6. Design isn’t a pursuit of perfection but an iterative exercise in alignment with your domain.
What you decide to encapsulate, abstract, and scale are all natural byproducts of this design approach. Rather than spend a lot of effort with a one time design approach you need to do just enough design.
June 5, 2016
Production incidents are one of the best avenues to accelerate the maturity of a managed platform. While incidents are stressful when we are dealing with them they provide clear and direct feedback on gaps in the platform. First, don’t indulge in blame games and don’t waste time fretting that it has happened. Second, if you step back from the heat incidents are an excellent means to learn more about your assumptions and risks.
- Did you assume that an external dependency will always be available? More specifically, did you assume that the dependency will respond within a certain threshold latency window?
- Was there manual effort involved in identifying the problem? if so, how much time did it take to get to the root cause? what was missing in your supportability tooling? Every manual task opens the door for additional risks so examining them is key. Think about how to get to the root cause faster:
- Instrumentation about what was happening during the incident – were there pending transactions? pending events to be processed? how “busy” was your process or service and was that below or above expected thresholds?
- Is there a particular poison / rogue message that triggered a chain reaction of sorts? did your platform get overwhelmed by too many requests within a certain time window?
- Did you get alerted? if so, was the alert about a symptom or did it provide any clues to the underlying root cause? did it include enough diagnostic information for additional troubleshooting? was there an opportunity to treat the issue as an intermittent failure – instead of alerting, could the platform have automatically healed itself?
- Was the issue caused by a ill-behaved component or external dependency? If so, has this happened before (routine) or is it new behavior?
- Think about defect prevention and proactive controls. There are a variety of strategies to achieve this: load shedding, deferring non-critical maintenance activities, monitoring trends for out of band behavior, and so on. Invest in automated controls that warn threshold breaches: availability of individual services within the platform, unusual peak / drop in requests, rogue clients that hog file system or other critical platform resources, etc.
The above isn’t an exhaustive list but the key message is to use the incident as an opportunity to improve the managed platform holistically. Don’t settle for a band-aid that will simply postpone a repeat incident!