Turning actors inside-out

Turning actors inside-out

Or... do we really need actor systems?

Featured on Hashnode

The actor model, and its concrete incarnation in platforms like Erlang/OTP or Akka, has been presented as a great way to create scalable and resilient systems.

However, there is a dimension in which actor systems fall short in my experience. They don't scale in terms of complexity. Each time I've had to look at applications built with actors I found the overall behavior of the system pretty hard to reason about. Why?

The original sin

The reason is quite simple. As Paul Chiusano describes in "Actors are not a Good Concurrency Model" the fundamental signature of an Actor is A => Unit. You send it a message and ... something happens. This presents 2 major drawbacks:

  • You need to look inside the implementation of an actor to have the faintest idea of what it is effectively doing

  • Since its behavior is a side-effect, any refactoring can have unintended consequences

Despite these issues, proponents of the Actor model still claim that their approach is a great way to build concurrent and reliable systems. This is not an entirely baseless claim since a company like WhatsApp, using Erlang, was supporting 900 million users with 50 engineers in 2015! Surely I must be very wrong. Or,... is it a question of context?

In this blog post, I want to examine:

  • What are the properties that make actor systems attractive?

  • Are there ways to keep those desirable properties while still making the system as a whole a lot more amenable to navigating and reasoning?

  • What kind of systems are well-suited to the Actor model?

What is so good then?

For this post, I am going to ignore the use of actors for distributed systems and focus on actors used in local, concurrent systems.

There are 4 main advantages to actors:

  1. The sequential access to in-memory data to avoid data races

  2. An inherently asynchronous programming model

  3. The modeling of actors as state machines mapping 1-1 to some conceptual entity

  4. The use of supervisors for error management

Taming concurrency with actors

One major difficulty with concurrent programming is the necessity to keep our data consistent. If I keep a list of bank accounts and want to model a transfer of money between 2 accounts, I'd better make sure that the transfer happens atomically. It should not be possible to observe that money has left Alice's account without being credited to Bob's account. This kind of problem is compounded when multi-threading is involved and when the compiler can freely re-arrange your code.

In that situation, an object like a Bank does not have a concurrency-safe interface

trait Bank {
  fn withdraw(amount: Amount, from: Account, to: Account) 
     -> Result<(), WithdrawError>;
  fn credit(amount: Amount, from: Account, to: Account);
  fn balance(account: Account) -> Result<Amount>;
}

Even if you combine withdraw and credit into one function, the implementation might not be thread-safe if we implement things naively:

trait Bank {
  fn transfer(amount: Amount, from: Account, to: Account)
     -> Result<(), WithdrawError>;
  fn balance(account: Account) -> Result<Amount>;
}

The solution that actors propose is to "lift" the methods into data, as messages:

enum Message {
  Transfer { 
    amount: Amount,
    from: Account,
    to: Account 
  },
  Balance(),
}

trait Bank: Actor {
  fn handle_message(message: Message);
}

Those messages can be sent to the message queue of an actor and become effectively serialized. The actor system of your choice then makes sure that the execution of each actor is single-threaded: problem solved!

Problem solved, at a massive cost

All the problems in computer science can be solved by another level of indirection

https://www2.dmst.aueb.gr/dds/pubs/inbook/beautiful_code/html/Spi07g.html

We solved this local, tactical, problem at the expense of a massive indirection. Now our program turns from:

if let Ok(_) = bank.transfer(Amount(100), alice, bob) {
  sms_system.send_message(bob, "transfer done!")
}

to:

bank.send_message(Transfer::new(Amount(100), alice, bob));

This has many implications in terms of practical concerns:

  • We cannot anymore click on the transfer call and have a look at the implementation. Instead, we need to go through the entire pattern match inside handle_message to see where the case for Transfer is handled. If there are several message variants, the actual code dealing with transfers is probably tucked away in an internal function somewhere else

  • How do we get back a result telling us that the transfer was successful? Now the Bank actor needs to get a reference to the sender and send a result back

In general, specifying what happens next is not obvious. In the first example, with functions, we notify the SMS system that Bob's account has been credited. With actors, we would probably start the Bank actor with a reference to the Sms actor and send a Credited message from inside the actor implementation.

With actors, we have to read the code to understand what is the flow of data in our system. When reading the code is not enough we have to instrument the actors at runtime to get some insight about what's going on.

It's not a magic wand

Each actor has its own internal state, free of race conditions, great. Yet this does not solve all state-related issues. What if we try to model each Account as an actor? How can we implement a transfer scenario so that the transfer between two accounts becomes atomic?

We have to come up with other patterns and pieces of infrastructure to solve these problems. For example:

Don't (a)wait for me!

A system like Akka became quite popular at a time when async programming was becoming an absolute requirement. A web server dealing with thousands of requests could not allow its threads to be idly waiting for responses from other services or a database. On the contrary, an actor system can spin off millions of actors quietly waiting for an IO request to complete:

  • The caller of the actor posts a message and doesn't have to block until that message is processed

  • It can be notified, via a message, when some data has arrived

This is a great recipe for scalability: a few cores handling millions of requests!

But this is not the only solution to that problem, or, let's say, not the only way to package it. The heart of the solution is to have some sort of "continuation": what should I do when some data finally arrives?

Nowadays there are many other alternatives than actors for dealing with such continuations:

The major difference between these approaches and actors is the fact that they don't expose a programming model made of A => Unit functions. With a Future you make as if you already had the value you're waiting on and you map it to describe what to do next. Now the flow of data between computations is obvious and computing resources are used wisely. We will come back to that later.

Actors as top-models

According to Carl Hewitt, unlike previous models of computation, the actor model was inspired by physics, including general relativity and quantum mechanics.

https://en.wikipedia.org/wiki/Actor_model

Actors are a tool of choice to represent independent entities when they have their own internal state and interfaces. For example, it seems quite natural to model a game like Halo with an actor system (using Microsoft's Orleans platform).

Moreover, we can use state machines to represent:

  • The internal state of an actor as some state variables

  • The received messages as state events

  • The processing of messages as transitions in the state machine

This is wonderful because there exist a large number of tools to:

  • Represent state machines

  • Reason about them: are there unreachable states? Will we always arrive at an end state? Are we missing some events?

  • Test states machines: what is the minimal number of tests covering each state?

However, any time we have several independent systems collaborating, we also get a combinatorial explosion in complexity. Each local state might be well specified, but the overall combination of all states might exhibit some emergent behavior that we did not expect. Think about how the Game of Life generates amazing behaviors out of very simple cells.

Not every system is a physics simulation

The thing is, many of our systems do not need to be represented as small individual processes. And the fewer states we need to maintain, the better off we are.

The least amount of state is a pure function. Look ma', no state:

fn compute_taxes(income: Amount) -> Amount

Yet it is not always practical, or even possible, to compute only with non-stateful functions. It would be quite repetitive to pass around all the state required for computations:

fn compute_taxes(
   income: Amount, 
   tax_rules: TaxRules,
   previous_tax_paid: Vec<Amount>) -> Amount

Worse, it is not even desirable. There are implementation details that we wish to hide, like the fact that the tax algorithm is parameterized by rules or that it uses the amount of taxes paid in the previous year to calculate this year's taxes.

In any case, even if we end up with a function having some side effects, by enclosing some state, we are still in a better position than when using actors. Because functions give us a flow of data. "Something in, something out". And the "something out" is ready to become "something in" for another function.

It is not entirely by accident if an Actor system like Akka ended up proposing a Stream abstraction:

We have found it tedious and error-prone to implement all the proper measures in order to achieve stable streaming between actors, since in addition to sending and receiving we also need to take care to not overflow any buffers or mailboxes in the process. Another pitfall is that Actor messages can be lost and must be retransmitted in that case. Failure to do so would lead to holes at the receiving side

For me, this is thinking backward:

  1. "Let's start with a general system where we can do everything"

  2. "Let's constrain our system by adding another layer"

In some ways you can see a similar issue with dynamic vs. static typing in programming languages:

  1. "Haha, no pesky compiler to prevent me from treating this variable as an int here and a bool there. It just works. Why would I care?"

  2. "Oops my system is really hard to understand, give me types please":

    1. for Python: https://mypy-lang.org

    2. for Javascript: https://www.typescriptlang.org

    3. for Ruby: https://sorbet.org

In the words of Runàr Bjarnason

Constraints liberate, liberties constraint

very relevant talk excerpt: https://youtu.be/GqmsQeSzMdw?t=1452

I'm tolerant. To a fault.

One often-cited advantage of actor systems is the idea that those systems can be made fault-tolerant. How does that work?

Turn it off and on again! | The PuterTutor Computer Repair Services

No really, that's it. If some state ends up being corrupted in an actor, a sensible strategy is to start afresh. And if several actors are cooperating to accomplish any task it makes sense to restart all of them to ensure that their global state is correct.

This is the role of Supervisors in a system like Erlang/OTP or Akka. Since actors can spawn other actors, an actor system defines a tree of actors in a parent/child relationship. At each level, we can associate a supervisor. In case of a failure of an actor at that level it can decide to:

  • Restart that actor

  • Restart all the actors on that level

  • Propagate the failure one level up

This way of dealing with faults is efficient for two reasons:

  1. It addresses some common reasons for failures: incorrect state, maxed-out internal queue, or unreachable resources.

  2. It allows the definition of independent processes or sub-processes that can be restarted individually thus containing errors to a limited part of the system

Magic wand again?

I think we must not conflate the problem with the solution. Ideally, we want to

  1. Avoid inconsistent state

  2. Isolate failures

The first point should foremost be addressed by having a minimal amount of state and being able to reason about it. We don't want to end up with a database like Oracle where the fix to some combinatorial state issue is to add even more states!

If an actor fails deterministically with some corrupted state, you might restart it 100 times, you will still get the same result. The "Let it crash!" motto seems to me to be a way of saying "Let us acknowledge that our system is so complex that we will enter some undefined states. Hopefully turning it off/on will work 🙏".

It is easy to understand that this cannot be the only strategy to tame the complexity related to state management. Yet, actors, as a network of interacting state machines, encourage a combinatorial explosion of state!

The second condition, "isolate failures", also does not come out magically from supervisors. Maybe a hundred actors represent individuals in a chat room but if one of them can make the whole chat room fail, the whole system fails. We need to carefully understand the functions of our system and their dependencies to be able to isolate them. Yet, actors, by being able to start other actors and communicate with them, hide the functional dependencies of our systems!

Eventually, it seems to me that the main benefit of supervisors is to force us to think about error modes and how to address them.

Can we do better?

I think so. We don't have to reinvent the wheel. We have ways to program differently for a vast category of systems. I would describe those systems as "linear".

The communication graph is not always complete

A "Linear" system is a system where data flows almost exclusively in one direction (think "activity diagram" in UML). Such a system:

  • Accepts a request, possibly after a round of negotiation, authentication, and authorization. Alternatively, they pull out some event from a queue

  • Process the request or event data in several steps, possibly incorporating some local persisted state

  • Update their local state

  • Publish events for other systems to consume

  • Send back a response

In a linear system:

  • There is a clear direction of data, which makes progress mostly in one direction

  • Data comes front and center, the identity of its processor is less important

  • There is some level of independence between each request and how they are processed

  • Concurrency or asynchronism is desirable, especially if IO is involved

  • Parallelism is desirable if some processes are CPU intensive (cryptographic operations, numerical computations)

Examples of linear systems include many forms of web services, APIs, or even point-to-point communications. Non-linear systems are typically multi-player games like Halo4 or group chat applications like WhatsApp and Discord.

The Future is already there

I hinted earlier that actors have been successful at a moment where asynchronous computing was sorely needed. Yet, actors are not the only solution to that problem. Case in point, the Actix web framework in Rust. It used to be built on an actor library, but:

its usefulness as a general tool is diminishing as the futures and async/await ecosystem matures

https://actix.rs/docs/whatis

Indeed futures and async support in our programming languages and libraries give us all the support we need. We actually get a lot more:

  • We get types. A computation Future<T> produces a value of type T

  • We get combinators to process the value future.map(f). We know what is the type of the resulting value!

  • We get combinators to run processes concurrently and, for example, take the first value that succeeds in being computed give_me_first_available_quote

  • We get a lot closer to structured concurrency which allows us to reason about errors and resources in a concurrent setting. You can read more about this in the fabulous blog post "go statement considered harmful".

There's blocking and blocking

Great, it seems that we solve the question of blocking while still being able to read our code as a series of typed transformations of data. What solving the issues with inconsistent state that actors solve with their message queue?

One obvious solution is to introduce locks to protect resources and guarantee their atomic update:

struct Bank {
  accounts: RwLock<Vec<Account>>
}

impl Bank {
  pub async fn transfer(
    amount: Amount,
    from: Account,
    to: Account) -> Result<(), TransferError> {
    let accounts = accounts.unlock();
    ...   
  }
}

Compared to the actor example:

  • Calls to transfer are serialized so the data will be consistent

  • We don't have to reify the transfer function into a full data type

  • A caller waiting for its transfer action to be processed will not consume any major resource system, like a thread. The async runtime will park that computation until it is ready to proceed and continue

  • We lose the full asynchronicity. Maybe the caller was truly not interested in the result of the transfer.

The last point seems to be a problem. Previously the caller of an actor was able to just leave a message and then could go away and process its own messages. Now, the caller of the transfer function has to block until the underlying resource is free to be updated. This might not block a thread but this will still prevent the function caller from doing anything else useful. I'll come back to that soon.

Note: I am aware that locks come with their own set of pitfalls!

Stateful components

A set of async functions operating on some shared data can be regrouped in a component:

trait ReviewComittee {
   async fn propose_talk(name: Name, talk: Talk) -> Result<Acceptance>;
   async fn withdraw_talk(name: Name, talk: Talk) -> Result<()>;
   async fn get_final_talks() -> Result<Vec<Talk>>;
}

That component might be the source of data for another component:

if let Accepted(talk) = committee.propose_talk(
    eric, 
    turning_actors_inside_out).await?) {
    scheduler.schedule(talk).await?   
}

If we lock the internal state of the ReviewCommitee and use async functions we don't need actors. Actually, this is not different from what GenServer is doing, on top of actors, in Elixir!

defmodule ReviewCommittee do
  use GenServer

  def handle_call({:propose_talk, name, talk}, _from, talks) do
    {:reply, :accepted, talk | talks}
  end

  def handle_call(:get_final_talks}, _from, talks) do
    {:reply, talks, talks}
  end

  def handle_cast({:withdraw_talk, name, talk}, state) do
    {:noreply, ...}
  end
end

The documentation says:

We can primarily interact with the server by sending two types of messages. call messages expect a reply from the server (and are therefore synchronous) while cast messages do not.

With GenServer we get the same benefits as a stateful component in async languages and the same constraints. But we lose types and have to consider the fact that the implementation can spawn other independent processes that we won't know about.

Turning actors inside-out

One other interesting viewpoint for linear systems is to see them as stream-processing systems. I wrote earlier that actors have this non-blocking mode of interaction where an actor can just leave a message for another actor and go on with his day. This is a way of saying: "Here is a bit of information you might be interested in, do whatever you want to do with it, I don't care".

We can invert that pattern and have components that say: "I am producing a piece of information, take it if you want". Essentially have a producer-consumer pattern.

This pattern introduces a level of indirection, like actors do. We now need to create data types to encapsulate the kind of information we want to communicate between computations. But the intent is very different! Instead of creating a message that says "schedule this talk" we create a fact saying "this talk was approved".

We can now have streams of facts that we can transform, join, store, reuse etc... This idea is very well presented by the wonderful talk from Martin Kleppmann, "Turning the database inside-out".

By turning actors inside out as publishers and subscribers we get amazing powers:

  • We get back types: A Subscriber<T> can only subscribe to a Producer<T>. Our software components now come with a manual on how to assemble them!

  • We get combinators: I can map a Producer<T> with a function T -> S to get a Producer<S> (and contramap a Subscriber<T> with a function S -> T to get a Subscriber<S>)

  • We decouple our software: a Producer truly doesn't have to know about its Subscribers. Whereas actors have to know about each other

  • We get a full specification on how to deal with gnarly problems like cancellation and backpressure

Final scene

I don't want to dismiss all the great work that has been done by Erlang/Elixir, and Akka users around the world. People have built large and successful systems with these technologies. Besides, the Erlang VM has some impressive features for distribution and hot code replacement, which are worth considering.

I believe however that the actor model is way too dynamic and untyped for many of the data-oriented systems that we are building every day. The same can be said, in my opinion, about dynamic programming languages. We can do amazing things with a language like Racket, but the safety net and discoverability of Haskell (cf. Hoogle) is preferable in many cases. In the words of John Ky:

Static typing is Lego, dynamic typing is playdough

Which I would steal as

Reactive streams are Lego, actors are playdough

As far as I'm concerned, I prefer building with Legos 😊.