Performance Profiling Zeebe

by Josh Wulf and Klaus Nji on Dec 22 2019 in BenchmarksPerformance.

We frequently get questions about Zeebe’s performance. The answer to any performance question is easy: “It depends”. In this post, Zeebe Developer Advocate Josh Wulf and Zeebe Community member Klaus Nji talk about what it depends on, and how you can get performance benchmarks that answer the question that you actually want to answer: “Can Zeebe do what I need it to do, and how do I need to configure it to do that?”

As Albert Einstein famously said: “There are lies, damned lies, and then there are benchmarks” (or was that Aristotle?)

Every system has a performance envelope. It is multi-dimensional, and its boundaries change in response to different variables. How the boundaries change and the rate of that change in various scenarios give the performance envelope characteristics.

What does Zeebe’s performance envelope look like?

It depends on what you do with it.

There are so many variables that someone else’s performance tests can be irrelevant or misleading to performance in your use-case.

If a test shows that you can start 2000 workflows/second, but you find out later that you need to wait 5s for each one to complete under that load - and you need it to be 3s, now what?

You have to build a mock scenario that matches your use-case, and performance profile it yourself, systematically mapping the performance envelope - and when you get one, you need to re-run it on each release of the broker.

Yes there are existing benchmarks, and I’ll list some below. As more people profile their use-cases, the body of knowledge about Zeebe’s performance envelope will grow. But there is no substitute for doing it with your specific use case.

From experience: I performance profiled Zeebe in late 2018/early 2019, and the wf instance create in our configuration was sufficient. I used a benchmarking repo that Bernd Rücker made, one that Terraforms an AWS cluster with massive nodes in it.

It was only later that we discovered in our own profiling that end-to-end processing of a workflow on 0.22-alpha.1 incurs a 34-52ms overhead per BPMN flow node, and you can’t reduce by adding more brokers, because it all happens on a single broker - and we needed it to be 100ms for the entire workflow. (It will get faster but not in 0.22).

We should have profiled the performance envelope of the entire system, systematically, with the parameters of our use-case and a representative workload.

There is no substitute for that, and someone else’s performance test will not match your parameters. So benchmarks should all be taken with a grain of salt, except for your own, which you bet your tech stack on.

Example benchmarks

These benchmarks are best consumed to get ideas on how you write your own benchmarking / profiling.

Yo Klaus, drop some knowledge!

General observations on tuning Zeebe performance

While thinking about performance, you want to note that clustering Zeebe is mostly about achieving fault tolerance and throughput in terms of how many workflow instances you complete in a certain amount of time. The number of workflow instances which can be started is not really a realistic and good measure of performance as creating and starting workflow instances only for them to fail or not complete is not useful. Clustering comes with the overhead of managing nodes, partitioning and replication which takes away CPU cycles from actually executing workflow instances. In other words, it is not free, which is why you should not expect an increase of x times the number of workflow instances completed if you increase your broker count by x. Expect less than x and be happy with that.

In terms of raw performance on how fast workflow instances complete, this will depend on several factors including broker load. But I would say if you assume a worse case scenario on broker load to deal with the additional overhead of clustering, partitioning, replication, etc, overall workflow complexity and executing time for each of the jobs being executed by a worker will carry a greater weight in this equation. However, having a fast machine allows things to get done quicker.

Guidelines

I like equations, so here are some guidelines we use:

broker load = function (number of partitions + replications)

If you anticipate creating lots of workflow instances to be started (the burst scenario) and are not overly worried about how fast workflow instances complete as some jobs take a long time, be prepared to scale Zeebe horizontally.

number of workflow instances completed = function (broker load + flow complexity )

Same argument. If you are concerned about number of workflow instances that can be started, such as dealing with burst, again horizontal scaling of Zeebe is a good thing.

execution time per workflow instance to complete = function (broker load + complexity of workflow)

If you want workflow instances to complete relatively quickly, then deploy your broker on beefy machines and ensure your jobs are not taking a long time. Also pay attention to variable sizes.

Summarizing notes for best practices:

Large documents incur a serialization hit, not to mention storage space. Think of the performance hit during replication as well.

So leverage the fetchVariables API in the workers, as in:

client.newWorker().jobType("some-type").fetchVariables("only,those,you,need")

Small quick jobs, while apparently providing more chatter and creating more events, allow for better areas of visibility and optimization and frees RocksDB from having to maintain many incomplete instances, which also is a price to pay during replication.

CPU speed will allow Zeebe to do things quicker. Fast memory goes without saying, however, sufficient capacity will allow a broker to be able to save more state which means processing more workflow instances.

Do you have a benchmark that demonstrates Zeebe’s performance envelope in your scenario? Drop a link in the Zeebe Slack.