June 9, 2016
When you build and evolve a managed platform, there are a variety of resources to think and reason about. These could be:
- Local dependencies – file system (local, network based mounts)
- External dependencies – REST services (proprietary APIs, public APIs)
- State stores – data stores (firm-hosted, cloud-hosted stores,…)
- Platform services – services that manage the platform functionality (provisioning, runtime management, operational tooling,…)
- Managed Resources – these are resources that your platform is managing on behalf of clients. These are the bread-and-butter abstraction over which you typically will have most control
- Desired States – this is the list of features that must be made available for regular processing, when platform is being upgraded, when platform is undergoing routine maintenance, etc.
This isn’t an exhaustive list by any means but my intention is to summarize resources a managed platform is made up of. There are numerous benefits in making your platform machine readable. You can now:
- Think about the assumptions your team is making about non-functional requirements and characteristics of your internal services, dependencies, and managed resources. These would be things like response time, availability, etc. and can help you design and tune integration proxies.
- Catalogue features that are important for your clients and under what conditions (e.g. ability to handle a provisioned resource, upgrading, etc.). More specifically, define the relationship between features and resource dependencies.
- Define all these elements in a single, consistent, machine readable definition. This will allow your team to view resources and their state, visualize and report feature dependencies.
- Design/ implement feature toggling – feature states can be derived using resources
- Apply self-healing techniques – reset resource (e.g. close and re-initialize a corrupt connection pool, automatically start service instances in the event of a host crashing, etc.)
I will explore each of these points in much more detail in follow up posts. Needless to say, getting your platform in a machine readable state has several benefits.
June 8, 2016
There are a number of risks when your platform integrates with an external service / dependency. For instance, here are a few risks and things that can go wrong:
- Doesn’t respond at all. Just blocks indefinitely eating client-side resources.
- Responds progressively slower – i.e. response time degradation.
- Needs retry logic to deal with transient failures (Note: obviously care needs to be taken if the call isn’t idempotent!)
- Responds with an unexpected return code – e.g. internal server error or service unavailable error, etc.
- Gets overwhelmed by the rate of requests being set to it. Ideally, it should have protection against this but what if it is not entirely in your team’s control.
- Becomes unavailable throwing runtime exceptions forcing undesirable side-effects on the caller rather than failing fast.
Michael Nygard in his book Release It! talks about leveraging circuit breakers to deal with integration risks. Broadening that idea a bit, we could combine circuit breakers and mediation into a more generic Integration Proxy component. This proxy could implement a number of common concerns when working with external APIs:
- Capture response time and route metrics to an analytics agent asynchronously
- Monitor stale connections and automatically reset them if possible
- Host the circuit breaker with associated logic to toggle based on service health
- Provide “fallback” responses if circuit breaker kicks in to disable integration point.
- Host sleep / retry invocation logic using parameters like interval and max attempts
- Automatically flush pending / bufferred messages when service is available again.
- Enable request and response capture – specially for debugging production issues.
June 7, 2016
You need to practice minimal design to be effective with systematic reuse. The design needs to continuously look for opportunities to align iteration goals with your systematic reuse roadmap. Too many developers mistakenly think that adopting agile means abandoning design. This couldn’t be farther from the truth. You design whether you explicitly allocate time for it or not. Your code will reflect the design and you will impact the technical debt for your codebase in one way, shape, or form. Implementing user stories and paying down technical debt should be your end goal and not avoiding design altogether.
Always design for meeting your iteration goals. Avoid designing for several weeks or months and surely avoid putting technical elegance ahead of delivering real user needs. You should design minimally. Just enough to take advantage of existing reusable components, identify new ones, and plan refactoring to existing code. Specifically this means:
1. Keeping a list of short tem and medium term business goals in mind when designing
2. Always looking for ways to make domain relevant software assets more reusable
3. You are aware of what distribution channels your business is looking to grow
4. Design reflects the domain as close as possible and that your reusable assets map to commonly occurring entities in your business domain
5. Value is placed on identifying the product lines that your business wants to invest in and evolving your reusable assets to mirror product line needs.
6. Design isn’t a pursuit of perfection but an iterative exercise in alignment with your domain.
What you decide to encapsulate, abstract, and scale are all natural byproducts of this design approach. Rather than spend a lot of effort with a one time design approach you need to do just enough design.
June 5, 2016
Production incidents are one of the best avenues to accelerate the maturity of a managed platform. While incidents are stressful when we are dealing with them they provide clear and direct feedback on gaps in the platform. First, don’t indulge in blame games and don’t waste time fretting that it has happened. Second, if you step back from the heat incidents are an excellent means to learn more about your assumptions and risks.
- Did you assume that an external dependency will always be available? More specifically, did you assume that the dependency will respond within a certain threshold latency window?
- Was there manual effort involved in identifying the problem? if so, how much time did it take to get to the root cause? what was missing in your supportability tooling? Every manual task opens the door for additional risks so examining them is key. Think about how to get to the root cause faster:
- Instrumentation about what was happening during the incident – were there pending transactions? pending events to be processed? how “busy” was your process or service and was that below or above expected thresholds?
- Is there a particular poison / rogue message that triggered a chain reaction of sorts? did your platform get overwhelmed by too many requests within a certain time window?
- Did you get alerted? if so, was the alert about a symptom or did it provide any clues to the underlying root cause? did it include enough diagnostic information for additional troubleshooting? was there an opportunity to treat the issue as an intermittent failure – instead of alerting, could the platform have automatically healed itself?
- Was the issue caused by a ill-behaved component or external dependency? If so, has this happened before (routine) or is it new behavior?
- Think about defect prevention and proactive controls. There are a variety of strategies to achieve this: load shedding, deferring non-critical maintenance activities, monitoring trends for out of band behavior, and so on. Invest in automated controls that warn threshold breaches: availability of individual services within the platform, unusual peak / drop in requests, rogue clients that hog file system or other critical platform resources, etc.
The above isn’t an exhaustive list but the key message is to use the incident as an opportunity to improve the managed platform holistically. Don’t settle for a band-aid that will simply postpone a repeat incident!
September 7, 2014
Systematic reuse initiatives don’t have to be big-bang events preceded by a lot of noise. It can be done quietly – project by project with a resolute focus on getting targeted wins. As I’ve blogged before, the key is discipline – not technology. The most fundamental question to ask your teams – do they have the basics in place? Specifically:
- How do they go from requirements to design – is there a set of known patterns and frameworks that will anchor the design? If so, ensure every project is aware of these and when appropriate leverages them in the design.
- Are there trust issues with the software being produced by the sister teams? Before you dismiss this as a ‘soft’ issue – remember, your developers and development leads are human and need healthy social relationships at work before they let others influence them. Influence translates to systematic reuse – not occasional but project after project.
- What happens in cases where a project develops a lot of potentially reusable code – who knows about their existence outside the immediate development team? who is going to be accountable for appropriate teams to leverage this work? If you don’t know the answers – don’t be surprised that your software solutions are siloed. No getting around Conway’s Law
- Do you send project updates and accomplishments like many development teams? Most of the time, the target audience is management and the intended message is to gloat how successful the delivery was. Celebrate and reward your teams but also take the time to reflect on two additional themes: did we best utilize the organization’s existing software assets – that includes prior requirements, component libraries, frameworks, services, and patterns? and did we contribute back to the organization’s repository of shared software assets? These aren’t tough questions to ask but you will be surprised by the answers!
- What are the biggest roadblocks to sharing software assets? Don’t assume it’s communication or organization structure or code quality or learning curve or integration ease (or all of them!). Go to the scene of action – spend time pair programming with developers to empathize their circumstances. Watch them struggle to get something to play well in their IDE or plain compile or execute with a myriad tweaks. Don’t assume – collect evidence and focus your improvement efforts on making their lives better.
December 26, 2011
One common criticism against systematic software reuse is the myth that it implies perfection – creating a reusable asset automatically conjures up visions of a perfect design, something that is done once and done right. Many developers and managers confuse reusability with design purity. However, reusability is a quality attribute like maintainability, scalability, or availability in a software solution. It isn’t necessary or advisable to pursue a generic design approach or what one believes is highly reusable without the right context.
The key is to go back to the basics of good design: identify what varies and encapsulate it.
The myth that you can somehow create this masterpiece that is infinitely reusable and should never be touched is just that – it is a myth and is divorced from reality. Reusable doesn’t imply:
- that you invest a lot in big up front design effort
- you account for everything that will vary in the design – the critical factor is to understand the domain – well enough, deep enough, so you can identify the sub-set of variability that truly matters
In the same vein, reusablility strives for separating concerns that should be kept distinct. Ask repeatedly:
- Are there multiple integration points accessing the core domain logic?
- Is there a requirement to support more than one client and if so, how will multiple clients use the same interface?
- What interfaces do your consumers need? is there a need to support more than one?
- What are the common input parameters and what are those that vary across the consumer base?
These are the key questions that will lead the designer to anticipate the appropriate places where reuse is likely to happen. Finally, it is important that we don’t build for unknown needs in the future – so the asset is likely to solve a particular problem, solve it well, solve it for more than one or two consumers, and finally has potential to be used beyond the original intent. At each step there are design decisions made, discarded, continuous refactoring, refinements to the domain model – if not re-definition altogether.
Don’t set out trying to get to the end state or you will run the risk of adding needless complexity and significant schedule risk.
November 7, 2011
Many teams are pursuing BPM and SOA based initiatives to automate, streamline, and standardize business processes. As more solutions start to embark on BPM-based solutions, there is a need for a common set of software components that aid in hosting and managing business processes. The following are capabilities that need to be present in such a solution:
- Common messaging architecture & utilities for facilitating the development and maintenance of stateful business processes & stateless services.
- Support business process orchestrations that join across multiple services (data services, business services, legacy services, etc.). This is essential for orchestrating complex
- Handle workflow and system business process events via a configuration driven Event Handler Service, enabling reuse of event handler processes
- Provide ability to reuse sub-processes across larger business processes.
- Runtime metrics including reporting and the ability to perform diagnostic troubleshooting
- Reusable schemas for request dispatching, event handling, generic transport listeners, metrics, and error handling
- Supports synchronous and asynchronous request/reply & fire/forget message exchange patterns
- Provides the ability to create reusable components for assembling new business processes
- Standard client interfaces across multiple transports such as HTTP and EMS
- Ability to query various data sources, rules engine, as well as write custom java code to integrate with existing functionality
- Provides interface for executing administrative functions
- Provides developer tools for WSDL generation, unit testing, deployment, & viewing metrics