Incident Alert Exporter

by Josh Wulf on Feb 3 2020 in ResourcesUse cases.

I took a break today from the article I’m working on about “Orchestrating GitHub Actions with Zeebe and Camunda Cloud” (stay tuned, because it is lit) to build an exporter for Zeebe, one that can alert you whenever an incident is raised - for example via Pushover, Pager Duty, or by calling you via the Twilio API.

If you just want to see the code, it is on GitHub: Zeebe Incident Alerter. There are a couple of videos of the stream of me coding it at the end of the post if you want to see that.

Using tutorials to write Zeebe extensions

I followed a couple of tutorials from June last year to accomplish it - Writing a Zeebe Exporter Part One and Part Two.

Not much has changed architecturally since then, and those articles - along with the intellisense in my IDE - got me through. This time I wrote it in Kotlin, which is my favorite JVM byte-code language.

I used an earlier article - Generating a Zeebe-Python Client Stub in Less Than An Hour: A gRPC + Zeebe Tutorial - to write the Zeebe Node client back in the day. These old-school tutorials are great, even if - maybe especially because - you have to do some exploring and understanding to figure out things that have changed since they were written. They act as maps, while letting you experience the thrill of exploration, and the accomplishment of discovery.

Exporter Configuration

One thing that took me some time to figure out - if you’re writing an exporter - is how to structure your config in the zeebe.cfg.toml file so that your exporter gets hydrated on load. I didn’t cover it in my articles last year, and the version I checked into the demo repo was malformed (!) and lead me on a meandering trek through the broker source code before I looked at the Hazelcast exporter docs.

It looks like this, when it is working:

[[exporters]]
id = "IncidentAlerter"
className = "io.zeebe.IncidentAlerter"

    [exporters.args]
    url="https://your-webhook-endpoint"
    # token="Some optional token for authorization header"

Incident Alert Exporter

Anyhoo - today’s exporter calls a configured webhook whenever an incident is created. It’s brought to you by Dan Shapir, who asked in the Zeebe Slack about getting a pager alert whenever an incident is created. That’s dedication to Operations right there.

That should be easy to do in an exporter!” I said, and dusted off last year’s tutorials.

The working exporter is now available on GitHub, with no warranty. It’s a great POC, and you could use it in a running system, or use it to build your own exporter.

Whenever an incident is raised on a broker with this exporter loaded, the exporter posts the incident record to a webhook. There you can configure whatever behaviour you want, based on the incident record.

This one works by filtering records and only accepting INCIDENT events. This is accomplished in a Context.RecordFilter implementation. That has two methods that you have to implement: acceptType and acceptValue.

The Incident Alerter RecordFilter filters for INCIDENT events like this (source):

override fun acceptValue(valueType: ValueType?): Boolean {
    return valueType?.equals(ValueType.INCIDENT) ?: false
}

It’s a standard filter predicate that should return true for any records that the exporter should process.

This predicate will actually pass two records to the exporter for every event. In the predicate function you get only the ValueType to examine. In the export lifecycle you get the entire record, which includes a field called intent. Two records get passed in, one with the intent CREATE, the second with the intent CREATED.

This is the semantic of the Zeebe broker event log: a record that says ‘Hey, I got a command to create an incident’, followed by a record that says “Hey, so you know, I just created an incident’.

We don’t want to call the webhook twice, so we only act on the second intent (source):

if (record.intent == IncidentIntent.CREATED) {
    log.info(record.toString())
    postIncident(record)
}

The postIncident implementation uses the OkHttp library. It has an asynchronous API that manages its own thread pool.

All exporters run in a single thread, so blocking in an exporter can impact performance.

It’s important to advance the exporter position whenever it gets handed a record to export. Records are not marked deletable from the event log until they have been processed by the stream processor and marked as exported by all exporters that see them. So if an exporter throws (like the webhook server is uncontactable), and doesn’t catch and advance the exporter position, then the disk usage will grow.

The async HTTP post of OkHttp happens in another thread, so that doesn’t block, nor does a failure impact the exporter advance.

Planning for failure modes

It does raise an interesting problem though - if the incident alert could not be communicated to the configured webhook, what happens?

Should it retry? If it does, for how long? What happens to broker resources if there are many incidents and no webhook server alive?

If the exporter position is advanced and the retries are held in memory, at what point do we exhaust threads / memory? How does it impact the overall performance of the broker and other exporters? What happens to these in-memory retries if this broker is rescheduled on Kubernetes?

If we don’t advance the exporter position, disk usage will grow. How do we manage the retries? The exporter thread is not the place to do back-off retries, so we will have to manage some other thread to do this. This is getting complicated….

10% of programming is getting it to work. The other 90% is coding for when it doesn’t work.

I did some experiments with Kotlin coroutines, and thread pooling, then realised that the complexity I was introducing was multiplying potential failure modes, and doing it in the core of the system - in the broker. This kind of thing is relatively simple in Node.js (at least the mechanics of multiple retry-state-machines with a single thread) with the event loop and intrinsic async, but on the JVM it is a challenge, with many sharp edges you can cut yourself on. Thread management is the memory management of the 21st century.

And at the end of the day, I’m not even clear that I’m solving the right problem, in the right way, or in the right place. In this case, the effort required to do it may have saved me from implementing the wrong solution in the wrong part of the system.

Atwood’s Law: “Any application that can be written in JavaScript, will eventually be written in JavaScript.”, has a corollary: “Just because something can be done, doesn’t mean it should be”.

So I opted for trying once, and logging an error if it doesn’t succeed.

This means that you don’t have a fail-safe system for alerting on incidents with this exporter - but at least when it fails it won’t bring the broker down with it.

Distributed systems need distributed monitoring

Distributed systems are inherently hard, and require thoughtful compromise. They involve trade-offs: consistency vs speed being a key one. Any solution to monitoring a distributed system needs to be multi-dimensional and layered, to make sure that the monitoring is actually working. The webhook server needs an alert when it goes down. You need to be alerted also when the webhook server is up, but the exporter cannot reach it.

While this is happening, you need a back-up system to check for incidents. It could be manually looking at Operate, or something that reads the Elastic Search export.

In my running Zeebe systems, I use canaries - processes that send pulses through alerting systems on a timer, and services that alert when a heartbeat is missed. This lets me know that the silence from my alerting systems is not because they have failed.

In the case of this exporter: a process with a timer start event that raises, and immediately resolves a special heartbeat incident, and a code path in the webhook server that pings a healthcheck like healthchecks.io on these “heartbeat” incidents, will alert you if the exporter -> webhook link has gone down. It will also alert you if the broker has gone away.

Live stream

I live-streamed a lot of the coding of this exporter on my Twitch.tv channel. Archives of Zeebe coding sessions are archived in the Live Coding on Zeebe playlist on YouTube. These videos are documentary proof that you don’t have to be an expert to get things done with Zeebe!

Part One

Part Two