Refactoring an Actor Model System from Nact.io to Camunda Cloud

by Josh Wulf on Dec 14 2020 in TutorialsUse cases.

This month marks the 20th anniversary of the Open Source Erlang language and the Erlang virtual machine.

Renowned for its reliability, the ability to hot-patch running code, and the Actor model, Erlang has been incredibly influential in software development, particularly in scalable distributed systems. If you look through the Zeebe source code, you’ll see Actor components sprinkled throughout.

Zeebe core engineer Deepthi Akkoorath did her PhD thesis in Scalable Consistency in the Multi-core Era and is a huge fan of Erlang. Here is a video of Deepthi describing the architecture of Zeebe, the workflow engine powering Camunda Cloud:

A couple of months ago, I wrote a statistics collector for the Camunda DevRel team. You can see the project in the Camunda DevRel GitHub here. It queries the NPM API and a number of Discourse Forums via their APIs once a month to collect a running total of package downloads, and forum sign-ups and posts. This allows us to monitor the health and progress of our community.

I took the opportunity to experiment with Nact.io, an actor model library for JavaScript.

The Nact.io version of the stats collector application is in this tag.

Nact allows you to use the actor model to architect your application. Every actor in the system is an independent, optionally stateful component. Actors communicate by passing messages (one-way) to or executing queries (call and response) on other actors. In Nact - in contrast to Erlang - actors run in the same memory space and the same thread, so it doesn’t have the same resiliency. You can optionally configure Postgres to have persistent application state, so if your application code crashes, the application can restart from where it left off.

Messages are events, and the event log is replayed on application startup to rebuild the application state. Nact bills itself as “Redux for the server”, and the idea is that your actors are reducers over actions (messages) that coordinate to build application state.

My DevRel stats collector was a refactoring of a simple script that ran as an AWS Lambda. That one was built by an engineer in another team. When I took it over, we spent a bunch of time trying to move the deployment and credentials between accounts / departments. The lack of debugging visibility while I tried to get the existing APIs working and add new ones lead me to reimplement it as something I could run on my local machine, and inspect while running on a VM.

Oftentimes, total complexity in a system is conserved - it’s just a question of how you distribute and encapsulate it.

While rewriting this system, I figured I’d get my hands dirty with the actor model, and I hooked it up to Camunda Cloud at the same time.

The entire Camunda Cloud workflow looked like this:

This is marketing-driven development, or “aspirational dog-fooding” more than anything else. My application uses the node-schedule package to start an instance of the workflow on the second day of each month, and the workflow has a single task, which is serviced by a worker in my application. This worker sends a message to the actor at the top of the hierarchy.

That actor sends a query each to an NPM actor and a Discourse actor, which retrieve the data from the remote APIs and return it to the system actor. The system actor then reduces the result into a message that it sends to a Google actor, which writes it into a spreadsheet.

Camunda Cloud is doing nothing more than adding a logo to the tech stack in this implementation.

When it came time to write this blog post, I’d had more time to think about it.

The Actor model is like the SmallTalk idea of OOP - objects that pass messages.

It’s very much like the model of Camunda Cloud - distributed stateless workers orchestrated by a stateful engine. The stateful engine holds the reduced application state (effectively the Postgres plus rehydration of Nact).

In my system actor I had a map function to send one query per query subkey to the actors, and a reducer to collate the results. These map to the multi-instance marker in BPMN.

My first cut at refactoring my Actor-based application into the Zeebe distributed worker / BPMN model looked like this:

I call this one “the worst of all worlds”.

I’ve now split the orchestration of my system between Zeebe and my code, increasing its complexity. But I haven’t increased the intelligibility of my system - in fact, it has decreased.

My Nact system relies on query keys to route subkeys of the query to the appropriate actor. The query is a JSON object with two top-level keys: npmPackageDownloads and discourseForumStats. These are also the names that the actors are registered with, so the map function in the system actor is generic, and now you can see that in the model.

This is the kind of “configuration by convention” magic that you find in Ember.js (and I imagine Rails and Elixir, which come from the same family).

However, once I lift that into BPMN, I get all the complexity of a distributed system, with none of the benefits of decomposition or visibility.

So, I went with this model:

This model concretises the query subkeys. They are concretised in the actors / workers, so this is actually a better representation of the system. I got the system documentation for free.

The other day, someone asked in the Zeebe Forum about Zeebe vs Temporal.io. Mauricio and I had a looooong Twitter discussion on this with the former Uber Cadence/Temporal founders on this.

My Nact.io implementation was pure code. In terms of comprehensibility of the workflow nature of the system, I would be better off using something like Temporal, if I were going to keep it as pure code.

Lifting it into BPMN, the generic machinery of my map/reducer - while elegant in code - is incomprehensible in BPMN. The BPMN forced me to be more explicit and communicate my intent, rather than write the most abstract coded solution possible. That’s a win for future maintenance.

However, BPMN has a different audience than code. When I first encountered Zeebe, doing a POC at a previous company, I worked with a Business Analyst, Wendy. Wendy couldn’t read or write code, but she understood processes, and BPMN.

This diagram is aimed at the business side of the business. My manager now, Mary Thengvall, may not be able to parse the code that implements this system - but she can understand that diagram.

And any new engineer who comes to work on it can grok it at a glance. And that includes me in six months time. Your future self can grok Temporal workflows too - so if you want a pure code solution, I’d recommend that over a custom actor implementation. Processes are directed evolutions of state over time, so you want that as a first class entity in your system.

And if you need to communicate and collaborate on the business process across organisational departments, BPMN gives you a system representation that everyone can understand, and that doesn’t go out of date.

Refactoring from Actors to Workers

The refactoring from actors to workers was quite simple. A Zeebe system running in Camunda Cloud turns out to be conceptually very similar to an actor system.

I had to change the shape of the data structures slightly to accomodate the reduction of results by the multi-instance into a specific key in the payload.

I used the parameterisation of the worker to strongly type payloads in the Zeebe Node client. This gave me intellisense while refactoring, and meant that the data structure was itself a first-class entity.

I also had to change slightly where things like the date range is generated and how it is serialised as my workers, although running on the same machine in the same memory space (in this instance), are now communicating using JSON that gets serialised over gRPC.

Using the BPMN multi-instance marker, I got rid of this epic results reducer:

const response = outcomes.reduce((prev, curr) => {
    if (curr.status === "fulfilled") {
    return curr.value.result
        ? {
            ...prev,
            results: { ...prev.results, ...curr.value.result },
            errors: prev.errors,
        }
        : {
            ...prev,
            results: prev.results,
            errors: [...prev.errors, curr.value.error],
        };
    } else {
    return {
        ...prev,
        results: prev.results,
        errors: [...prev.errors, { PromiseRejected: curr }],
    };
  }
}

Good developers copy, great developers paste - legendary developers delete.

Handling success and failure in an array of asynchronous tasks is one of the most common issues for network programmers (based on my analysis of StackOverflow questions in the JavaScript tag).

In my Nact implementation, failures are silently dropped (they do get logged to the console), and successful tasks are written to the spreadsheet.

In the new system, a failure of any of the calls results in the worker handler throwing and the workflow execution halting and raising an incident in Operate.

This raises the visibility of an issue (you don’t have to notice something missing in the spreadsheet to detect it), and forces it to be addressed, while providing debugging information.

What you didn’t know was missing from your life

I was scheduled to stream live with one of the developers that I met at unStack Africa 2020, but she had a power cut right after we started, so I streamed this entire refactoring - all three hours of it.

You’re welcome!