Engineering for Resiliency

Money transfers, fare payments, fleet movements: Ximedes plays a central role in realizing core processes where resiliency is critical. Achieving system resiliency, that is, ensuring adequate quality of service while being fault tolerant, quickly becomes tricky due to the inherent complexity of such cyber-physical systems. With critical business processes operating across disparate platforms and devices, how would one go about engineering resiliency to emerge "naturally", similar to herd immunity and other emergent phenomena?

TL;DR - Extending system test scenarios with simulation basics (common clock and reproducible randomness) can help to rigorously analyse the complex contexts that trigger undesired system behaviors to emerge, and help to prevent them.

Hold the Chimps

Although important, "standard" functional testing may fall short in verifying a system's resiliency to system-level risks, particularly risks regarding the physical portion of the cyber-physical systems mentioned earlier. To maintain sanity and prevent regression, business logic typically undergoes verification & validation continuously, and perhaps also the occasional (costly) accreditation during development.

A myriad of software testing methods will help to achieve satisfactory levels of code coverage and other quality assurance metrics across many use cases beyond merely the happy flow. Unfortunately, consistently reproducing any (non-functional and/or physical) disturbances and disruptions from which the system may need to recover autonomically, could take testers a bit more work.

Companies like Netflix for instance apply chaos engineering, unleashing a full Simian Army to wreak havoc on production environments and so help improve system resiliency before final release. Once released, remaining undesired behaviors may be mitigated or even resolved during quality control and after-sales service. Presumably however, one is better safe than sorry.

At Ximedes, we understand the challenges involved in handling critical events in large-scale platforms. Examples include our omni-channel PSP services for merchants (OmniKassa), controlling a fleet of public transport vehicles (GIVA), and seamless payments for vending machines, fares, unattended shops, etc. Supporting the primary business process, these mission-critical systems require not only secure coding throughout (enroll your team for a course here). To validate their robustness, extended testing practices are also required.

Determinism

In order to analyse system behaviors with rigor, we would like to reproduce the exact steps that culminated in notable (undesired) emergent phenomena in the system under test. Engineering deterministic system behavior is however not straightforward: concurrent events occurring across cloud-based and mobile devices are preferably treated asynchronously in a reactive manner while avoiding the need to synchronize behaviors (across threads, actors, nodes, etc.) to prevent deadlocks.

In cases where synchronization across disparate system components is critical, coordination languages help to effectively re-introduce determinism by making explicit the timing assumptions along with their associated (fault) behaviors. Lingua Franca for instance extends the C and TypeScript programming languages (others to follow) with keywords like delay and deadline, applying universal coordination principles based on simulation formalisms like DEVS and standards like HLA.

Even without time coordination, just using events or an event-driven architecture (EDA) for the business logic can prove helpful, particularly when history and scale are essential. By guaranteeing that all changes to domain-specific objects are initiated by event objects, the system not only

provides an event log suitable for auditing and debugging the intent or reason of state changes, but also
allows temporal querying of the system's state across multiple, altered timelines, and even
enables system recovery via event replay, possibly starting from recent snapshots.

The event sourcing approach for instance achieves this by organizing all domain event handling into aggregates and views. Since retrofitting such coordination approaches into your business logic is often cumbersome, difficult, or simply impossible, one had best apply these approaches early on.

There is however a good chance your system is already mature and shows nondeterministic behavior, for instance varying execution and communication performance due to mobile or cloud deployment with asynchronous event handling. Fortunately, you can still benefit from determinism: in the system's simulated test environment!

Modeling discretely

Model-driven engineering (MDE) processes typically involve delivering a set of system test scenarios. Perhaps these could be suited beyond just verifying the business logic integrity, and also help stress-testing its resiliency, by applying some simulation techniques. But what techniques or simulation type should we apply?

Continuous simulation models consist of ordinary differential equations (ODEs), but lack stochastic transitions as well as discrete (integer) counts. Examples are stock-and-flow models such as the beer distribution game for demonstrating the bullwhip effect occurring in supply chains, and compartmental models that approximate susceptible-infectious-recovered (SIR) proportions of a population during an epidemic.

Conversely, discrete-time simulations lack continuous time and are stuck in fixed-length cycles or ticks, as for example in turn-based games and cellular automata such as Conway's Game of Life and simpler ticker tape-like automata.

Alternatively, discrete-event simulation (DES) is generally a good choice, as this approach yields reproducible simulation models that have it all: continuous time, discrete counts and stochastic state transitions. The main ingredients for a reproducible or deterministic DES-type simulation scenario are:

a common clock or time reference which is advanced by a priority queue of time-ordered test events;
a common pseudo-random number generator for all probability distributions representing uncertainty in your test model.

Stay tuned

In a follow-up we will explore how these simulation techniques could be applied with little effort into existing system integration tests, to help engineer some reproducible chaos, and fix undesired behaviors before deploying to production.

Want to know more about how Ximedes can help you evaluate the resiliency of your system? Contact us!

Recent articles

Visit the company profile.

Ximedes

Engineering for Resiliency

Galerija