Imagine a bacon-wrapped Ferrari. Still not better than our free technical reports.
See all our reports

Resilience is by design by Jonas Boner

Jonas Boner Resilience vJUG sessionLast week the Virtual JUG viewers were lucky enough to be treated to a session about software resilience by one of the most accomplished speakers on the topic. Jonas Boner, the founder and CTO of Typesafe, the original author of the Akka framework, and Java Champion has presented his first vJUG session, titled “Resilience is by Design”. The main focus of the session was the inevitability of failures in large and complex software systems. The natural conclusion is that we, as software developers, have to embrace failure and understand how to design systems that bounce back from failure so not to suffer from them.

You can find the full recording of the session below and watch it for yourself. In addition to that, in this post I’ll try to explain what I learn from the session and emphasize its highlights.

Resilience is by design

Almost any good technical session lays the groundwork before it starts diving into code or getting to the singularity of higher matters. Since this was an exceptional presentation, it followed this suit. Jonas started by explaining the differences between terms that mean failure related things, which I’ve summarized below.

Fault tolerance is the ability to accept failures and the challenges these failures pose. Fault tolerance is the stamina of your code. How much can it take and still be somewhat working. While fault tolerance is an important concept and if you start designing a bridge, a fridge or a space station, you have to think about fault tolerance. You cannot anticipate all the conditions in which your project will be run and it will be bent, use in ways you couldn’t imagine and it will get broken. Fault tolerance as an aspect of your project declares how much it can take until it stops moving completely. Surviving works for fridges, but a software project of even moderate complexity will suffer greatly if it can only crawl after something unexpected happens.

The next step in the fault tolerance scale is resilience. Resilience is the quality of the systems ability to fully recover after an unexpected situation. So, instead of limping forward after a failure, it should get back into it’s original shape and keep moving full forward. At the most basic level, think of a runtime exception in a web application. Yes, something went south and the server couldn’t serve some requests correctly, but it’s not the end of the world, is it? After the unfortunate incident has passed, everything is shiny again.

There’s the next level of the failure handling as well. Some systems, perhaps not software systems, are antifragile meaning they get better when everything goes wrong. Like Littlefinger thrives on chaos in the Game of Thrones universe, antifragile systems want to be shaken up to become productive. The term antifragility has been coined by Nassim Taleb and if you want to learn more about such systems, definitely read his book Antifragile: Things That Gain From Disorder.

Now, there’s little chance for a typical system to gain anything from failures, so antifragility is a possible future, but we can certainly talk about resilient systems now. To do that we need to better understand the software systems and their evolution. Since software is really complex, it’s hard to model it correctly and anticipate the impact any tiny change can inflect upon in.

Note that there’s a difference between complex and complicated. Complicated systems consist of many small yet different parts. It might be hard, but you can fully comprehend it throughout. And a change in a single component while might be impactful, it is localized quite well.

On the other hand, a complex system is built from similar parts, but the interaction between these parts is so vast that it makes understanding the system hard. The example Jonas uses is Convey’s game of life, which has very simple rules but the systems it creates have more complex interactions which makes it incredibly hard to judge the behavior of these small, well defined, but connected elements.

Jay Forrester said that complex systems are counterintuitive. If people reason about them intuitively, they usually get it very wrong. So in order to reason about software systems successfully, we need to introduce abstractions and models to reduce the complexity of the systems.

Operating on the edge of failure

In the life of any software project there are three main forces that try to drive the project to failure:

  • Economic boundary – cross that and you’re out of business, your project is no longer sustainable, it doesn’t bring any profit in and the management will at some point cut it.
  • Unacceptable workload boundary – you can operate at a certain workload and if your team takes on more than they can handle, your project will fail.
  • Accident boundary – if you cross it, your software fails. Nobody dies, hopefully, but you fail and your software cannot deliver what it has promised to the clients

The first two are not exactly the concerns of the developers, so Jonas didn’t focus on them. The third, however, is entirely in hands of the developers. We deliver the functionality and it is our responsibility to make sure that this functionality is still working after the software evolves and in a growing the client base, it needs to scale.

A curious thing happens when you scale any project to a certain size. Events of a very low probability start happening all the time. What do you think is the probability of a server catching on fire? Quite low? What about when you have thousands of servers. Almost certainly one of them is on fire at any given point of time. This is an extreme example of course, but it does illustrate the concept well. Failures will happen and we will have to learn how to deal with them.

The main takeaway at this point is that we have to embrace the failure by designing our systems for failure. Since we’ve accepted failure is inevitable and it will happen we have to anticipate it. Otherwise it will bring the systems we build to its knees and quite possibly make it much harder to stay within the reasonable boundaries for staying in business. Resilience is a concept we need to take into account and design our projects with this in mind.

The good thing is that it’s not a new concept. Many other industries, including the evolution of human species and our society have had resilience built-in for ages.

Software Resilience vJUG Jonas Boner

Jonas explained how well established social and biological systems make themselves resilient. They feature both diversity and redundancy. They have different components of different utility value and they diversify the roles within the systems. They have the capacity to adapt and self-organize so their performance degrades for some time, but bounces back to become stronger after a while.

Now in software systems we don’t do that. And one of the reasons why this session was so insightful is that Jonas offered a way to actually achieve it. A way to design a system with the diverse and self-organizing components in mind.

The key concept here will be the “crash only” software. It doesn’t degrade when it starts failing, it doesn’t try to handle all errors. Besides fulfilling its business logic it knows how to do two things:

  • stop – crash safely and fast,
  • start – recover fast and get working again.

These two things are enough to ensure that your code will bounce back from whatever happened to it and quickly restore back to its original form.

Now if you apply this pattern recursively to all the components in your system, you’ll be able to gracefully tolerate failures at all levels of the system. Naturally, it will mean that the components in your code will be allowed to say no and refuse to serve any request to recover themselves if needed and the callee should propagate these events and deal with it accordingly to its own capabilities. It might happen that it cannot handle the failure itself and has to fail in its own turn. However, this way you can bound any failure and limit its effect on the system as the whole.

There’s one catch to having the components of your system routinely go down and restart themselves. While the code is working it accumulates the runtime state and if you restart your system or any part of it some state will inevitably be lost. Luckily, most of the state in the system is either static or re-computable with some resources. The caches can be filled again, the database preserves the state well and so forth. But some of the state, dynamic and not re-computable, might be lost. This suggests that resilient systems should mostly be stateless to accommodate this possibility.

To deal with failures successfully, systems have to treat failures as the first class situations and not like a deviation from the success path. Failures have to be contained, reified (as messages), signalled (async), observed and managed.

At this point in the session Jonas showed us how Akka, the framework for building highly concurrent, distributed, and resilient message-driven applications, embraces this philosophy and allows you to build the systems with these guidelines in mind. Isolating components, passing messages between them, building a structured supervision solutions to keep any failure in the boundaries of the subsystems that created it are the core concepts built in into Akka from the beginning.

If you want to learn more about the actor model for parallel computations, or to see the code examples of how to use Akka to keep the failures at bay and to understand deeper how the concepts we talked about can be implemented in the code watch the full session, it’s worth it.

Interview

After the session, I had a chance to ask Jonas a couple of questions in our regular RebelLabs interview with the Virtual JUG speakers. We talked about what components are the core of the Akka framework and how developers can better understand what is happening when they use it. Can we take software further than being resilient and create antifragile systems that not only bounce back from the failures, but thrive on chaos and become better, smarter and more stable after them. Oh, also if you want to know what Jonas knows about the upcoming Typesafe rename, definitely check out the interview below!

Now you’ve watched the interview. Well done! If you have any more questions to Jonas, ping him on Twitter: @jboner and ask there! Or tell him and us (@ZeroTurnaround) what you think.

If you liked this blogpost, I encourage you to use the form below to send us your email address and we’ll surprise you with an occasional email about the best posts that we have published and notify you when we release the next outstanding report, like these ones.

 


Read next: