Announcing Zeebe 0.22.5 and 0.23.4
New patch releases for Zeebe are available now: 0.22.5 and 0.23.4 and contain various bug fixes as well as minor enhancements. You can grab the releases via the usual channels:
Zeebe 0.23.4 is fully compatible with 0.23.3, as is 0.22.5 with 0.22.4. This means it is possible to perform a rolling upgrade for your existing clusters. For Camunda Cloud users, Zeebe 0.23.4 is already the latest production version, meaning your existing clusters should have been migrated by the time you see this post.
Without further ado, here is a list of the notable changes.
The 0.22.5 patch contains 6 bug fixes, all of which are part of 0.23.4. You can read more about these in the 0.23.4 release notes below.
One important thing to note about 0.22.5: it is the last planned 0.22.x release. Once 0.24.0 is out, fixes will be applied only to the 0.23 and 0.24 branches (see our release cycle page for more).
Long polling blocked with available jobs
Long polling in Zeebe is a feature that was built for job activation. It refers to a client sending an
ActivateJobsRequest to the gateway, and if after polling all partitions no jobs were available, the gateway will park that connection and wait for a notification from the broker that new jobs are available.
Previously, it could happen that connections were parked even though there were jobs available - this would result in the connection waiting unnecessarily and eventually timing out, which could have a non negligible impact on performance.
The fix here is two-fold:
- When the gateway did not to activate any jobs, and some partitions returned RESOURCE_EXHAUSTED as an error code, it will now close the connection and propagate the error. This allows the client to back off and react to the gateway’s back pressure.
- The gateway will now immediately enqueue requests that come in for long polling. This avoids a race condition where the broker would send a notification that more jobs were available after the initial polling, but before the connection was parked and waiting for such a notification.
Max message size not respected on write
There is a setting in Zeebe,
maxMessageSize, which controls the maximum size of an entry in the replicated log. It’s used notably as an upper bound in the dispatcher - essentially a ring buffer - which is used to synchronize communication between multiple writers and a single reader.
There was a bug however where writers could write more than the
maxMessageSize, but readers would only read up to that, resulting in unreadable messages. It’s now behaving as expected: it’s not possible to write more than
maxMessageSize to the dispatcher.
Elasticsearch exporter bulk is not limited
To prevent sending overloading its Elasticsearch endpoint, the Elasticsearch Exporter will buffer incoming records in memory and flush them together as a batch. We noticed that when a flush operation would fail (e.g. the database was down), while the batch would be correctly retained, the exporter would keep adding more to it. In the long run, this could potentially cause an out of memory error as more and more records would be buffered.
The batch is now properly capped (1000 by default, but it’s configurable), and makes use of the built-in retry mechanism of the exporters. If it fails to flush, it will simply retry the current record with a small back off, but the in memory batch size will not grow.
Error event is not caught by boundary event on multi-instance subprocess
Error events thrown in a multi-instance subprocess were previously not caught by boundary events attached to the subprocess. This could previously result in seemingly frozen workflow instances.
This is now fixed and error events are properly caught.
Follower partitions are falsely marked as unhealthy
Some follower partitions would sometimes report they were unhealthy even if they were healthy due to a critical component which was not properly removed.
It’s now correctly de-registered and the health status of each partition is correctly set.
Incorrect health status of leader partition
A leader partition in Zeebe is the partition, among other things, the one which does the workflow processing. Previously its health status was indicative of whether or not it successfully installed all its subcomponents. This could result in a false positive, where a subcomponent would fail but the partition would still report it was healthy.
The fix was to have the partition probe its components from time to time and correct its health status, such that the it now correctly reflects its state.
Get In Touch
There are a number of ways to get in touch with the Zeebe community to ask questions and give us feedback.
- Join the Zeebe user forum
- Join the Zeebe Slack community
- Reach out to dev advocate Josh Wulf on Twitter
- Reach out to dev advocate Mauricio Salatino on Twitter
We hope to hear from you!