Distributed Systems Fundamentals Part 2: You Can Say That Again!
At TheoremOne, we approach problems by reasoning about them from first principles. To truly understand a problem and be confident about your solution, you must have a working understanding of the fundamental constructs and motivating logic behind what you’re doing. In this spirit, we’re releasing a a blog series about the fundamentals of distributed systems. This post is the second of that series.
Whether we know it or not, every solution we build is a distributed system, and sound analysis of the distributed nature of a solution is critical to its reliability. Reading this blog series and thinking critically about how it applies to your work will put you well on your way to internalizing these concepts and leveraging them to build rock-solid solutions that will stand the test of time.
I’m Only Going to Say This Once
As we covered in the previous post, the possibility of message loss is an ever-pervasive concern in all distributed systems. Losing messages usually isn’t a very desirable outcome, so we also introduced strategies for preventing message loss by repeating messages, which adds the possibility of message duplication. This tradeoff is often referred to as the design choice between “at-most-once” and “at-least-once” messaging, respectively.
To be clear, what we’d love to have is an easy solution for “exactly-once” messaging. Both “at-most-once” and “at-least-once” strategies have their unfortunate consequences to deal with, and in an ideal world we’d prefer to “have our cake and eat it too”. We don’t want to go unheard, and we don’t want to repeat ourselves - we only want to say it once.
But on its face, exactly-once messaging seems to be a fool’s errand! In fact, many people claim that exactly-once messaging is impossible, including the author of this blog post who goes as far as to say: “If it claims exactly-once, it’s because they are lying to your face in hopes that you will buy it or they themselves do not understand distributed systems”. These arguments are usually well-reasoned, and they are not wrong.
However, on the other side we continue to have respectable vendors of well-designed software like Kafka Streams touting their exactly-once semantics, and drawing varying degrees of controversy in the ensuing reaction. These vendors (or at least this one vendor in this one case) are also well-reasoned, and they are also not wrong.
So, we have two sides of a disagreement in which both sides are not wrong. What’s going on here? Well, as you may have guessed if you’ve seen this pattern of human interaction before, it comes down to not agreeing on what these terms mean. That’s right, this argument about “exactly-once semantics” is about semantics.
You Can Say That Again!
Let’s rewind for a moment and consider our motivations here from first principles. Is exactly-once messaging really what we want? Do we really care how many times we send or receive a message?
Arguably, the message itself isn’t what we care about at all: we care about the resulting action. Going further, we don’t even really care about the action either: we care about the effect that it has on the world. It matters not to us how many times the message is received, or even how many times the receiver of the message acts on the message.
In other words, if we ignore the quixotic pursuit of exactly-once messaging, we can focus on building a system which guarantees exactly-once effects, which turns out to be what we actually wanted in the first place. You may have to repeat yourself, but that’s okay: you can say that again because repeated receptions of the message will not cause any further or deviating effects.
Semantics and problem-solving from first principles
This scenario is a striking example that demonstrates why it's so important to TheoremOne that we approach problem-solving from first principles. If you chose to consider only the definition of the problem that we started with (i.e. "exactly-once messaging"), any attempts to solve the problem would have been destined for failure. Challenging the problem definition and reaching to understand the true motivation hiding behind it opens up the possibility of redefining the problem in a new way, with a fruitful path to a solution.
While it was said earlier that this disagreement at play is about semantics, that doesn't mean it is a trivial or unimportant point (as people often imply when they say "it's just semantics"). On the contrary, the words we use to define our problem statement and what they mean to us in that context are critical to defining a tractable paradigm in which the problem can be solved. Matters of semantics are far from trivial: they are central to how we think.
The All-Important Idempotence
This system property we’ve been discussing has a name: it’s called idempotence. It literally means “(the quality of having) the same power”. In essence, we’re talking about acting in such a way that the effect won’t happen more than once, even if the action happens more than once.
For example, if I ask you to delete all records from your database matching my username, I’ll get the effect I wanted as long as you do what I asked at least once. If you do it once, then you’ll find all the records that existed and delete them, as I wanted. If you then do it again, you won’t find any records to delete and the repeated action will thus have no further effect.
To reiterate, designing your system to only have effects that are idempotent is a safeguard against problems implied by the possibility of message duplication, and message duplication is a safeguard against problems implied by the possibility of message loss. Taken together, this design pattern is one of the most powerful tools for designing robust distributed systems.
Notice that idempotent effects are often declarations about what the final state should look like, whereas non-idempotent effects are often descriptions of a change that should be made. This is not always so clear-cut, but it’s a good correlation to remember - declarative operations tend also to be idempotent.
Idempotent State Stores
To set the stage for further discussion, let’s consider for a moment what an “effect” is. Any meaningful effect is something that changes the world - in technical jargon, we’d say that it mutates a state store. That state mutation could be something obvious, like inserting into a database or writing a file to a file system. It could also be something a little less obvious, like sending an email (mutating the state of someone’s inbox), or displaying information to the user (mutating the state of someone’s brain). Take note that these actions are not meaningful in and of themselves but only in the effects that they cause on a state store.
Because an effect implies a change in state, then we can reason that if we want to have idempotent effects, we must have a state abstraction that supports idempotent change operations. This statement may seem trivial, but it has some rather far-reaching implications. When selecting a state abstraction, you should keep this in mind if exactly-once effects are required in your application. Similarly, when designing a state abstraction, you should respect this need in consumers of the abstraction and aim to support idempotent operations when possible.
For example, if you are designing an HTTP API for managing resources, you should consider supporting an idempotent
PUT verb for creating/updating a specified resource (e.g.
PUT /products/<code>) instead of or in addition to using a
POST verb to create an unspecified resource into a collection (e.g.
POST /products). There are also other ways of supporting an idempotent creation operation, but the important part is that the API consumer has a way of uniquely identifying a resource that hasn’t necessarily been created yet (which usually implies that the resource has an identifier that is chosen by the API consumer). Again, this can have some rather far-reaching implications in the design of an abstraction, so it is worthy of some forethought and planning.
So what happens when we have to use a state abstraction that doesn’t support inherently idempotent operations? Well, we can do our best to try to simulate them, but in practice, there are often flaws in our approach that end up putting holes in the guarantee we’re trying to construct (that is, edge cases in which the guarantee doesn’t hold, making it not a true guarantee). Let’s take a look at a few examples of this.
As seen above, the approach of checking before making the change is vulnerable to a race condition concurrency bug (unless the check operation and the change operation can be grouped together as an uninterruptible/atomic operation), which in turn leads to a violation of the idempotence guarantee. We could hack around this by preventing concurrency with a lock, but as shown in this optional supplementary diagram the approach of passing a distributed lock around is also susceptible to message loss, which just kicks the can of the problem down the road. So try another strategy to fix it:
This works, but notice that this approach requires the wrapping abstraction that tracks the messages to be capable of storing its own state in an idempotent way. That is, because the original state store didn’t support idempotent operations, the caller was forced to become an idempotent state store.
There’s also a number of other unfortunate limitations arising from this approach: now that the wrapping abstraction is stateful it becomes non-trivial to scale. We’ll cover the challenges of distributed state more fully in a future blog post, but for now just recognize that it’s a very challenging engineering problem in its own right, so if it’s not part of your core product you may want to consider staying out of the distributed state business and using a vendored solution.
This is actually where Kafka comes back into the picture. Kafka is a well-engineered distributed message log and the Kafka Streams framework that we mentioned earlier as touting their exactly-once semantics is built to leverage this to create a holistic system in which your application can have only idempotent effects because it can interact with Kafka as an abstraction for all of them.
The actual details of how the Kafka Streams guarantees are implemented are outside the scope of this article, but the general principle is the same as what we’ve summarized here.
In this post, we’ve discussed message delivery guarantees and the controversy surrounding the utopian ideal of exactly-once message delivery. However, we reasoned that it isn’t the message delivery we really care about, but rather the messages’ effects. We introduced the concept of idempotence, which fills the need of preventing unwanted duplicate effects, and we emphasized the importance of support for idempotence in all stateful abstractions.
Stay tuned for the next post, and in the meantime, carry these ideas back to your day-to-day work to consider how you might use them to model the behavior of whatever you’re building this week.