Announcing Zeebe 0.23.5 and 0.24.2

by Zeebe & Operate Team on Aug 17 2020 in Releases.

New patch releases for Zeebe are available now: 0.23.5 and 0.24.2 and contain various bug fixes as well as minor enhancements. You can grab the releases via the usual channels:

Zeebe 0.24.2 is fully compatible with 0.23.4 and 0.24.1, as is 0.23.5 with 0.23.4. This means it is possible to perform a rolling upgrade for your existing clusters. For Camunda Cloud users, Zeebe 0.24.2 is already the latest production version, meaning your existing clusters should have been migrated by the time you see this post.

Without further ado, here is a list of the notable changes.

Zeebe 0.23.5

All bug fixes in 0.23.5 are part of 0.24.2 as well, except one, which was already part of 0.24.0. You can read about the other bug fixes below as part of the 0.24.2 release notes.

Multi-instance output element expression with space prefix leads to incorrect variable scope

When the output element expression of a multi-instance element directly refers to a variable or a property of a variable, a new variable with that name should be nil initialised at the multi-instance body scope. This is especially important for parallel multi-instance elements, because this is a common trap for a race condition.

However, if the expression was prefixed with a whitespace, the local variable would be wrongly initialized. This was already corrected in 0.24.0 via #4100 but had not been yet backported to 0.23.x until now.

Zeebe 0.24.2

Here are the highlights of the 0.24.2 patch - you can read more about all the fixes on the Github release page.

No leader elected on failover

In our long running QA tests, we ran into an issue where the leader node get preempted, and yet the cluster would not elect a new leader for a partition until that node came back. Without a leader for a given partition, it’s impossible for Zeebe to make any progress, as no consensus can be achieved.

Digging into it, we found that the situation was very similar to a previously fixed issue, and it turned out to be an issue with the resource lifecycle management, which was promptly fixed.

Missing variables in message start event of event sub processes

When publishing a message to Zeebe, it’s possible to send a payload along with it. When the message is correlated to a workflow instance, that payload will be propagated to the scope of the message catch event, setting variables within that scope to the correct values. It turns out there was a bug specifically with message start events inside an event subprocesses: when a message was correlated to one of them, no variables would be propagated to its scope. The variables are now properly set on the message start event scope, which means they will go through the output mappings and be correctly propagated.

Out of memory error on broker startup

When a broker starts, one of the first things it does is scan its log. This allows it to build up its log block index (a way to map logical addresses to physical addresses), as well as perform some consistency checks. Historically, it would also pass the events to the replicated state machine, which would use them to build up its state. This meant handing off the event to a different thread: in order to synchronize this, it would be submitted to a queue, which was consumed by the other thread. On very large log, it could happen that this queue would not be consumed fast enough, and would grow enough to cause out of memory errors. This is now fixed, and the behaviour will be completely refactored out in 0.25.0.

Endless loop on job activation if job does not fit in max message size

Zeebe makes certain assumptions in terms of maximum size of a single entry in the log - this is useful to for various optimizations, but has some downsides, such as the described in this issue. To activate a job, all of its variables have to be aggregated and written to a single entry in the log. This is necessary to ensure determinism, and also as a way to send a consistent view of the scope back to the worker activating the job. However, as the variables are aggregated from the current instance state, it was possible that the cumulative size of all those variables would exceed the maximum message size, which would prevent this job from being activated and could block progress for that instance.

From 0.24.2 and 0.23.5 on, in this case, an incident will be raised. In order to resolve it, you must either reduce the size of the current element instance state, or activate the job by specifying only a subset of these variables, at which point you can resolve the incident and the instance can make progress again.

Broken rolling upgrade from 0.23.x to 0.24.1

There was an issue with the rolling upgrade between 0.23.x and 0.24.1, which only occurred if a 0.23.x would attempt to send a snapshot to a newly upgrade broker during its restart. This was because the snapshot mechanism was improved between 0.23.x and 0.24.1, but there was no test for this particular case (replication during upgrade). A new regression test was added, and the rolling upgrade is now fixed between 0.23.x and 0.24.2.

Failure to export records to Elasticsearch with unicode characters

As part of 0.24.0, the Elasticsearch High-Level REST Client was dropped in favor of the barebones REST client, primarily due to the heavy number of dependencies brought in by the former. According to the Elasticsearch 6.8.x documentation, bulk instertion requests should be sent with Content-Type: application/x-ndjson; however, when doing so it was not possible to set the charset, and the Elasticsearch nodes defaulted to ASCII, resulting in the failure described in this issue. The fix was to switch to the accepted (but seemingly not documented) Content-Type: application/json, which let us specify the correct charset.

Segmentation fault on broker close

When shutting down the broker, there was a race condition previously where it was possible for the local state - a RocksDB instance - to be closed before the exporters. As a common exporter behaviour on shutdown is to update the latest exported position, this would sometimes cause segmentation faults as the RocksDB object would attempt to access memory that had been freed. This is now fixed by ensuring that resources are closed in the correct order.

Removing all exporters causes compaction to stop

One of the guarantees in Zeebe regarding durability is that data will not be deleted until it has been processed and exported. However, sometimes users will want to remove exporters which they are not using anymore. Removing an exporter from the configuration will do this, and should then allow Zeebe to start compacting even if the data has not been exported. Here we had a bug where, when removing all exporters from the configuration, some exporter related services were not loaded, which caused Zeebe to incorrectly calculate what data could be deleted. This is now fixed from 0.24.2 and 0.23.5 on.

Get In Touch

There are a number of ways to get in touch with the Zeebe community to ask questions and give us feedback.

We hope to hear from you!