Turning actors inside-out
Or... do we really need actor systems?
The actor model, and its concrete incarnation in platforms like Erlang/OTP or Akka, has been presented as a great way to create scalable and resilient systems.
However, there is a dimension in which actor systems fall short in my experience. They don't scale in terms of complexity. Each time I've had to look at applications built with actors I found the overall behavior of the system pretty hard to reason about. Why?
The original sin
The reason is quite simple. As Paul Chiusano describes in "Actors are not a Good Concurrency Model" the fundamental signature of an Actor is A => Unit
. You send it a message and ... something happens. This presents 2 major drawbacks:
You need to look inside the implementation of an actor to have the faintest idea of what it is effectively doing
Since its behavior is a side-effect, any refactoring can have unintended consequences
Despite these issues, proponents of the Actor model still claim that their approach is a great way to build concurrent and reliable systems. This is not an entirely baseless claim since a company like WhatsApp, using Erlang, was supporting 900 million users with 50 engineers in 2015! Surely I must be very wrong. Or,... is it a question of context?
In this blog post, I want to examine:
What are the properties that make actor systems attractive?
Are there ways to keep those desirable properties while still making the system as a whole a lot more amenable to navigating and reasoning?
What kind of systems are well-suited to the Actor model?
What is so good then?
For this post, I am going to ignore the use of actors for distributed systems and focus on actors used in local, concurrent systems.
There are 4 main advantages to actors:
The sequential access to in-memory data to avoid data races
An inherently asynchronous programming model
The modeling of actors as state machines mapping 1-1 to some conceptual entity
The use of supervisors for error management
Taming concurrency with actors
One major difficulty with concurrent programming is the necessity to keep our data consistent. If I keep a list of bank accounts and want to model a transfer of money between 2 accounts, I'd better make sure that the transfer happens atomically. It should not be possible to observe that money has left Alice's account without being credited to Bob's account. This kind of problem is compounded when multi-threading is involved and when the compiler can freely re-arrange your code.
In that situation, an object like a Bank
does not have a concurrency-safe interface
trait Bank {
fn withdraw(amount: Amount, from: Account, to: Account)
-> Result<(), WithdrawError>;
fn credit(amount: Amount, from: Account, to: Account);
fn balance(account: Account) -> Result<Amount>;
}
Even if you combine withdraw
and credit
into one function, the implementation might not be thread-safe if we implement things naively:
trait Bank {
fn transfer(amount: Amount, from: Account, to: Account)
-> Result<(), WithdrawError>;
fn balance(account: Account) -> Result<Amount>;
}
The solution that actors propose is to "lift" the methods into data, as messages:
enum Message {
Transfer {
amount: Amount,
from: Account,
to: Account
},
Balance(),
}
trait Bank: Actor {
fn handle_message(message: Message);
}
Those messages can be sent to the message queue of an actor and become effectively serialized. The actor system of your choice then makes sure that the execution of each actor is single-threaded: problem solved!
Problem solved, at a massive cost
All the problems in computer science can be solved by another level of indirection
https://www2.dmst.aueb.gr/dds/pubs/inbook/beautiful_code/html/Spi07g.html
We solved this local, tactical, problem at the expense of a massive indirection. Now our program turns from:
if let Ok(_) = bank.transfer(Amount(100), alice, bob) {
sms_system.send_message(bob, "transfer done!")
}
to:
bank.send_message(Transfer::new(Amount(100), alice, bob));
This has many implications in terms of practical concerns:
We cannot anymore click on the
transfer
call and have a look at the implementation. Instead, we need to go through the entire pattern match insidehandle_message
to see where the case forTransfer
is handled. If there are several message variants, the actual code dealing with transfers is probably tucked away in an internal function somewhere elseHow do we get back a result telling us that the transfer was successful? Now the
Bank
actor needs to get a reference to the sender and send a result back
In general, specifying what happens next is not obvious. In the first example, with functions, we notify the SMS system that Bob's account has been credited. With actors, we would probably start the Bank
actor with a reference to the Sms
actor and send a Credited
message from inside the actor implementation.
With actors, we have to read the code to understand what is the flow of data in our system. When reading the code is not enough we have to instrument the actors at runtime to get some insight about what's going on.
It's not a magic wand
Each actor has its own internal state, free of race conditions, great. Yet this does not solve all state-related issues. What if we try to model each Account
as an actor? How can we implement a transfer scenario so that the transfer between two accounts becomes atomic?
We have to come up with other patterns and pieces of infrastructure to solve these problems. For example:
Don't (a)wait for me!
A system like Akka became quite popular at a time when async programming was becoming an absolute requirement. A web server dealing with thousands of requests could not allow its threads to be idly waiting for responses from other services or a database. On the contrary, an actor system can spin off millions of actors quietly waiting for an IO request to complete:
The caller of the actor posts a message and doesn't have to block until that message is processed
It can be notified, via a message, when some data has arrived
This is a great recipe for scalability: a few cores handling millions of requests!
But this is not the only solution to that problem, or, let's say, not the only way to package it. The heart of the solution is to have some sort of "continuation": what should I do when some data finally arrives?
Nowadays there are many other alternatives than actors for dealing with such continuations:
Futures in Scala
async/await in Rust
async and monads in Haskell
async runtimes in Scala: cats effect and ZIO
The major difference between these approaches and actors is the fact that they don't expose a programming model made of A => Unit
functions. With a Future
you make as if you already had the value you're waiting on and you map
it to describe what to do next. Now the flow of data between computations is obvious and computing resources are used wisely. We will come back to that later.
Actors as top-models
According to Carl Hewitt, unlike previous models of computation, the actor model was inspired by physics, including general relativity and quantum mechanics.
Actors are a tool of choice to represent independent entities when they have their own internal state and interfaces. For example, it seems quite natural to model a game like Halo with an actor system (using Microsoft's Orleans platform).
Moreover, we can use state machines to represent:
The internal state of an actor as some state variables
The received messages as state events
The processing of messages as transitions in the state machine
This is wonderful because there exist a large number of tools to:
Represent state machines
Reason about them: are there unreachable states? Will we always arrive at an end state? Are we missing some events?
Test states machines: what is the minimal number of tests covering each state?
However, any time we have several independent systems collaborating, we also get a combinatorial explosion in complexity. Each local state might be well specified, but the overall combination of all states might exhibit some emergent behavior that we did not expect. Think about how the Game of Life generates amazing behaviors out of very simple cells.
Not every system is a physics simulation
The thing is, many of our systems do not need to be represented as small individual processes. And the fewer states we need to maintain, the better off we are.
The least amount of state is a pure function. Look ma', no state:
fn compute_taxes(income: Amount) -> Amount
Yet it is not always practical, or even possible, to compute only with non-stateful functions. It would be quite repetitive to pass around all the state required for computations:
fn compute_taxes(
income: Amount,
tax_rules: TaxRules,
previous_tax_paid: Vec<Amount>) -> Amount
Worse, it is not even desirable. There are implementation details that we wish to hide, like the fact that the tax algorithm is parameterized by rules or that it uses the amount of taxes paid in the previous year to calculate this year's taxes.
In any case, even if we end up with a function having some side effects, by enclosing some state, we are still in a better position than when using actors. Because functions give us a flow of data. "Something in, something out". And the "something out" is ready to become "something in" for another function.
It is not entirely by accident if an Actor system like Akka ended up proposing a Stream
abstraction:
We have found it tedious and error-prone to implement all the proper measures in order to achieve stable streaming between actors, since in addition to sending and receiving we also need to take care to not overflow any buffers or mailboxes in the process. Another pitfall is that Actor messages can be lost and must be retransmitted in that case. Failure to do so would lead to holes at the receiving side
For me, this is thinking backward:
"Let's start with a general system where we can do everything"
"Let's constrain our system by adding another layer"
In some ways you can see a similar issue with dynamic vs. static typing in programming languages:
"Haha, no pesky compiler to prevent me from treating this variable as an
int
here and abool
there. It just works. Why would I care?""Oops my system is really hard to understand, give me types please":
for Python: https://mypy-lang.org
for Javascript: https://www.typescriptlang.org
for Ruby: https://sorbet.org
In the words of Runàr Bjarnason
Constraints liberate, liberties constraint
very relevant talk excerpt: https://youtu.be/GqmsQeSzMdw?t=1452
I'm tolerant. To a fault.
One often-cited advantage of actor systems is the idea that those systems can be made fault-tolerant. How does that work?
No really, that's it. If some state ends up being corrupted in an actor, a sensible strategy is to start afresh. And if several actors are cooperating to accomplish any task it makes sense to restart all of them to ensure that their global state is correct.
This is the role of Supervisors
in a system like Erlang/OTP or Akka. Since actors can spawn other actors, an actor system defines a tree of actors in a parent/child relationship. At each level, we can associate a supervisor. In case of a failure of an actor at that level it can decide to:
Restart that actor
Restart all the actors on that level
Propagate the failure one level up
This way of dealing with faults is efficient for two reasons:
It addresses some common reasons for failures: incorrect state, maxed-out internal queue, or unreachable resources.
It allows the definition of independent processes or sub-processes that can be restarted individually thus containing errors to a limited part of the system
Magic wand again?
I think we must not conflate the problem with the solution. Ideally, we want to
Avoid inconsistent state
Isolate failures
The first point should foremost be addressed by having a minimal amount of state and being able to reason about it. We don't want to end up with a database like Oracle where the fix to some combinatorial state issue is to add even more states!
If an actor fails deterministically with some corrupted state, you might restart it 100 times, you will still get the same result. The "Let it crash!" motto seems to me to be a way of saying "Let us acknowledge that our system is so complex that we will enter some undefined states. Hopefully turning it off/on will work 🙏".
It is easy to understand that this cannot be the only strategy to tame the complexity related to state management. Yet, actors, as a network of interacting state machines, encourage a combinatorial explosion of state!
The second condition, "isolate failures", also does not come out magically from supervisors. Maybe a hundred actors represent individuals in a chat room but if one of them can make the whole chat room fail, the whole system fails. We need to carefully understand the functions of our system and their dependencies to be able to isolate them. Yet, actors, by being able to start other actors and communicate with them, hide the functional dependencies of our systems!
Eventually, it seems to me that the main benefit of supervisors is to force us to think about error modes and how to address them.
Can we do better?
I think so. We don't have to reinvent the wheel. We have ways to program differently for a vast category of systems. I would describe those systems as "linear".
The communication graph is not always complete
A "Linear" system is a system where data flows almost exclusively in one direction (think "activity diagram" in UML). Such a system:
Accepts a request, possibly after a round of negotiation, authentication, and authorization. Alternatively, they pull out some event from a queue
Process the request or event data in several steps, possibly incorporating some local persisted state
Update their local state
Publish events for other systems to consume
Send back a response
In a linear system:
There is a clear direction of data, which makes progress mostly in one direction
Data comes front and center, the identity of its processor is less important
There is some level of independence between each request and how they are processed
Concurrency or asynchronism is desirable, especially if IO is involved
Parallelism is desirable if some processes are CPU intensive (cryptographic operations, numerical computations)
Examples of linear systems include many forms of web services, APIs, or even point-to-point communications. Non-linear systems are typically multi-player games like Halo4 or group chat applications like WhatsApp and Discord.
The Future
is already there
I hinted earlier that actors have been successful at a moment where asynchronous computing was sorely needed. Yet, actors are not the only solution to that problem. Case in point, the Actix web framework in Rust. It used to be built on an actor library, but:
its usefulness as a general tool is diminishing as the futures and async/await ecosystem matures
Indeed futures and async
support in our programming languages and libraries give us all the support we need. We actually get a lot more:
We get types. A computation
Future<T>
produces a value of typeT
We get combinators to process the value
future.map(f)
. We know what is the type of the resulting value!We get combinators to run processes concurrently and, for example, take the first value that succeeds in being computed
give_me_first_available_quote
We get a lot closer to structured concurrency which allows us to reason about errors and resources in a concurrent setting. You can read more about this in the fabulous blog post "go statement considered harmful".
There's blocking and blocking
Great, it seems that we solve the question of blocking while still being able to read our code as a series of typed transformations of data. What solving the issues with inconsistent state that actors solve with their message queue?
One obvious solution is to introduce locks to protect resources and guarantee their atomic update:
struct Bank {
accounts: RwLock<Vec<Account>>
}
impl Bank {
pub async fn transfer(
amount: Amount,
from: Account,
to: Account) -> Result<(), TransferError> {
let accounts = accounts.unlock();
...
}
}
Compared to the actor example:
Calls to
transfer
are serialized so the data will be consistentWe don't have to reify the
transfer
function into a full data typeA caller waiting for its
transfer
action to be processed will not consume any major resource system, like a thread. The async runtime will park that computation until it is ready to proceed and continueWe lose the full asynchronicity. Maybe the caller was truly not interested in the result of the transfer.
The last point seems to be a problem. Previously the caller of an actor was able to just leave a message and then could go away and process its own messages. Now, the caller of the transfer
function has to block until the underlying resource is free to be updated. This might not block a thread but this will still prevent the function caller from doing anything else useful. I'll come back to that soon.
Note: I am aware that locks come with their own set of pitfalls!
Stateful components
A set of async
functions operating on some shared data can be regrouped in a component:
trait ReviewComittee {
async fn propose_talk(name: Name, talk: Talk) -> Result<Acceptance>;
async fn withdraw_talk(name: Name, talk: Talk) -> Result<()>;
async fn get_final_talks() -> Result<Vec<Talk>>;
}
That component might be the source of data for another component:
if let Accepted(talk) = committee.propose_talk(
eric,
turning_actors_inside_out).await?) {
scheduler.schedule(talk).await?
}
If we lock the internal state of the ReviewCommitee
and use async
functions we don't need actors. Actually, this is not different from what GenServer
is doing, on top of actors, in Elixir!
defmodule ReviewCommittee do
use GenServer
def handle_call({:propose_talk, name, talk}, _from, talks) do
{:reply, :accepted, talk | talks}
end
def handle_call(:get_final_talks}, _from, talks) do
{:reply, talks, talks}
end
def handle_cast({:withdraw_talk, name, talk}, state) do
{:noreply, ...}
end
end
The documentation says:
We can primarily interact with the server by sending two types of messages. call messages expect a reply from the server (and are therefore synchronous) while cast messages do not.
With GenServer
we get the same benefits as a stateful component in async
languages and the same constraints. But we lose types and have to consider the fact that the implementation can spawn other independent processes that we won't know about.
Turning actors inside-out
One other interesting viewpoint for linear systems is to see them as stream-processing systems. I wrote earlier that actors have this non-blocking mode of interaction where an actor can just leave a message for another actor and go on with his day. This is a way of saying: "Here is a bit of information you might be interested in, do whatever you want to do with it, I don't care".
We can invert that pattern and have components that say: "I am producing a piece of information, take it if you want". Essentially have a producer-consumer pattern.
This pattern introduces a level of indirection, like actors do. We now need to create data types to encapsulate the kind of information we want to communicate between computations. But the intent is very different! Instead of creating a message that says "schedule this talk" we create a fact saying "this talk was approved".
We can now have streams of facts that we can transform, join, store, reuse etc... This idea is very well presented by the wonderful talk from Martin Kleppmann, "Turning the database inside-out".
By turning actors inside out as publishers and subscribers we get amazing powers:
We get back types: A
Subscriber<T>
can only subscribe to aProducer<T>
. Our software components now come with a manual on how to assemble them!We get combinators: I can
map
aProducer<T>
with a functionT -> S
to get aProducer<S>
(andcontramap
aSubscriber<T>
with a functionS -> T
to get aSubscriber<S>
)We decouple our software: a
Producer
truly doesn't have to know about itsSubscriber
s. Whereas actors have to know about each otherWe get a full specification on how to deal with gnarly problems like cancellation and backpressure
Final scene
I don't want to dismiss all the great work that has been done by Erlang/Elixir, and Akka users around the world. People have built large and successful systems with these technologies. Besides, the Erlang VM has some impressive features for distribution and hot code replacement, which are worth considering.
I believe however that the actor model is way too dynamic and untyped for many of the data-oriented systems that we are building every day. The same can be said, in my opinion, about dynamic programming languages. We can do amazing things with a language like Racket, but the safety net and discoverability of Haskell (cf. Hoogle) is preferable in many cases. In the words of John Ky:
Static typing is Lego, dynamic typing is playdough
Which I would steal as
Reactive streams are Lego, actors are playdough
As far as I'm concerned, I prefer building with Legos 😊.