3 reasons why Webhooks suck and 2 Masterclasses to replace them

The most popular way for different services to send messages to each other should never have existed. We review why and how we can make it better, taking real-world implementations as examples.

Apr 28, 2024

The most common way for independent services to exchange messages — even more so on public APIs — are webhooks. A beauty of simplicity: you simply provide an URL that you want to notify when an event occurs and the other service simply has to make an HTTP call. Except, not really.

Webhooks suck

In the shadow of this superficial simplicity are creeping major problems which make it hard for both ends to exploit webhooks.

Not missing a drop

First of all, the most basic prerequisite for a webhook to work is that the receiving end is able to receive. Meaning that the webservice must be up and running. But what happens if a maintenance is ongoing, a technical issue plagued the server or simply the network connection has an instant of failure?

The message will simply and purely be lost. It’s a Byzantine fault: how can you know if a message was sent if the sender is unable to contact you either way?

In order to remedy this, most providers resort to implementing retry mechanisms. Which is fairly complex to implement: you need to store somewhere that you’ll have in the future to execute a given set of messages and wake up accordingly. Most queuing systems will struggle at doing this reliably because they are working in “at least once” mode, meaning the same message could be sent twice. You can decide you don’t care but then your client has a problem on their side.

Another issue is that if you are doing a maintenance on your server, maybe you configured something wrong and it ends up responding 200 when actually the wrong service was receiving the message. In that case the message simply gets obliterated, given that the sender thinks it’s received and the receiver has no idea the message even exists.

Avoiding flashbacks

This retry logic will also amplify another danger. Messages could very well arrive out of order, and this for different reasons.

For example if you are implementing a retry mechanism but consider all messages as independent. In that case if the receiver gets unavailable for a while, they are at risk of receiving the missing messages after they started catching back with earlier messages. You can let the receiver with the burden of fixing this, but it honestly will get ignored most of the time.

What can happen as well is if your receiver operates at a larger scale and has at least 2 web servers, if two of your messages arrive at the same time and get processed by 2 different processes at the same time, there is no saying which message will be dealt with first.

Harder to develop

Now this is a more practical than theoretical consideration, but most of the time developers won’t have the luxury of a public IP address on their development machine. Which is a big problem since webhooks are actually going to have to initiate the network connection, meaning that you will probably end up resorting to tools like HTTP tunnels.

On top of that it means that your code needs to be aware of its own public URL, which you cannot really do automatically. For example if you use a regular API, you never need to declare what is your public address. But for webhooks you need to know what it is and to declare it. Often through complex back-offices or needing propagation times.

As a result you end up with an extra configuration variable which you could probably avoid otherwise, you probably also need to go through some manual configuration and on top of that free plans of popular HTTP tunnels will change your URL every time so you possibly end up changing it all the time.

There are alternatives

How do we deal with this situation better than with webhooks? First you need to realize that you are actually trying to solve two separate problems:

Knowing that there is at least one update pending — when an event occurs then your code needs to wake up and do its job, preferably as fast as possible after said event.
Synchronizing state — the final goal of this is to have different systems converge into the same state, whether it’s knowing if the user expects the light on or to get the full status of a shared online document.

Waking up remote code

The most naive thing you can come up with is polling. Every X seconds you’ll check if there are updates available. This is however considered as wildly inefficient:

The cost of establishing a connection is pretty high relatively to other options.1
You won’t get the updates “in real time” but rather only every time you poll

That’s why in most cases, polling will not be recommended and both software and hardware architecture were designed to avoid polling. If you were to simplify it to the extreme, modern computers are driven by inputs. A physical electrical signal on your network card will trigger a processing chain that will eventually wake up the relevant process, all the way down to your favorite abstraction from Python, JS or any other language.

This is what makes webhooks attractive: a remote computer can wake up your local process. But it’s not the only way to do it. If you open a network connection from your local machine to the remote API — which is extremely easy to do even without a public IP address — then as long as the connection is up then packets will be flowing both ways.

WebSockets were invented exactly for this. It’s an easy way to have a client, typically behind a NAT or a proxy, to connect to a server and receive real-time updates. That would be my go-to option for waking up remote code.

Alternatively, before WebSockets we used a technique called “long polling”. The idea is to make a regular HTTP query but that hangs for a very long time (typically minutes) until an update happens and the HTTP query returns with the message. A bit messy but almost as efficient as WebSockets if you don’t have a very high throughput and not more costly than webhooks.

When implementing this kind of technique, you need to consider that you will be maintaining one full TCP connection with every single client. That used to be a challenge, it is becoming quite easy these days if you can use an async infrastructure.

Alternatively you can turn towards dedicated services like Google’s Pub/Sub, AWS EventBridge or countless others. For example, Shopify offers webhooks but recommends notifications through AWS and Google. Kind of the same as dealing with the WebSocket yourself but you let someone else manage the scale for you.

Staying on the same page

Distributed systems are notoriously hard and I am not aware of an universal law that allows you to deal with any situation whatsoever, especially as you scale up. However it usually boils down to the same core idea — which can be remixed at will to fit the project’s needs.

Consider that your data model is a bit like a Git repository. At a point in time, the source code has a given state but in order to reach there a series of different edits had to happen. Said otherwise, if you sum up all the edits then you get the state of the code at a point in time.

So the key here will be to identify which edits happen in your model, convert them into a stream of events and re-compose them on the other side. This can be more or less difficult to achieve, for example Google Wave used Operational Transformation which took 2 years to develop but on the other hand if you’re just dealing with a messaging app your life should be much simpler.

Now imagine all those edits as a sequential log. As you read the log, you keep track of your current cursor, pointing to the latest known edit. When you are notified of another event then you need to read starting from this cursor.

This resolves a lot of issues raised earlier:

By using edit logs, your communication protocol is basically writing itself and will look fairly simple. If you’re used to Vuex or Redux, it’s basically the idea behind mutations.
The cursor allows to know where we are in the update stream. If you lost a notification because your program was down or crashed, you can catch back from your latest known state.
Even if the transmission of messages fails, you can easily have a retry mechanism to eventually get up-to-date.
There is no risk from getting the same message twice as messages are basically sequential, numbered items.

From looking at WhatsApp’s WebSocket communications, you can presume for example that they use this kind of strategy and it’s even what enables them to have end-to-end encryption with consistent shared states between participants and devices while having servers completely oblivious to the actual content of conversations.

Masterclasses

Having recently interacted with different APIs, two of them really stand out in my opinion, showing how you can make a public API that avoids pitfalls explained earlier. I picked them up because the choices they made really highlight how you can implement things in correct way while also keeping things simple.

The world of instant messaging is highly competitive, with all major players pushing their platform as hard as they can. Facebook has the two most popular platforms — WhatsApp and Messenger — however the third one is a pure player gaining traction only through their strategy2.

One part of this strategy is to have an amazing bot experience, allowing with very small developer effort to create real-time applications. This is particularly prominent in the cryptocurrency world but for example it’s also a tool heavily used in Ukraine to follow bombing threats.

The basic idea of Telegram is pretty simple. You have different conversations in which you come add messages. Then more complex things can happen like people putting reactions, messages being edited, user clicking buttons, etc. All of them are listed and documented updates.

Now the interesting part. How do you get those updates?

A first method is the webhook. As you know, it sucks. The more interesting method is the long polling getUpdates method. It combines two techniques explained earlier.

Long polling — the HTTP call will hang until either an update or a timeout happens. Not as efficient as WebSockets but very easy to implement because you can do it with literally any HTTP client ever written. And of course it works from a private IP address.
Cursor — the call takes an offset argument, which corresponds to the ID of the last message you received.
- This is a smart way to get you to acknowledge receiving the previous messages and receiving new updates in one single call.
- But on the other hand if you pass an offset of 0 then it will repeat the last offset that used. This means that if you restart your app you don’t need to remember the last offset, which is incredibly convenient.

As a result, developing a client for the Telegram Bot API is a very smooth and simple experience. All you need is a HTTP client and a tiny wrapper around it to get started. You can use a lib of course but implementing a client from scratch is a very easy task both in terms of code (no need for crazy libs) and of infrastructure (almost no constraints).

Plaid

If you never heard about Open Banking, it’s basically all the banks in the world somewhat converging into providing standardized and modern APIs for all their services. At least in theory, in practice of course the capabilities and implementation details vary greatly country-to-country and instead of a truly open standard you need to go through middlemen such as Plaid. This is not my field of expertise so I can’t go into the details but all I can say is that Plaid does a great job at converting ~~dinosaurs~~ banks into REST APIs.

They have a wide number of APIs but the one that I’m interested into is the Transactions API. The most interesting information about bank accounts, especially if you are building a personal finance app, is to see the list of transactions that happened there.

One of three thing can happen, with examples:

A new transaction happened (you bought something)
A transaction got modified (exchange rate got finalized)
Or it can be deleted (transaction was not captured in the end)

In the case of Plaid, they work a lot with batches. I don’t even want to know how they receive those transaction but if you told me they came from a latin-1-encoded CSV file dropped on a FTP every 3h I would not be surprised. As a result it’s much less real-time-ish than Telegram. It’s not extremely relevant to ship every event individually.

Instead they’ll give you a cursor — up to you to keep track of it in that case — and give you aggregated added/modified/removed transactions. Which makes it very easy to update your own database. If you just had the list of latest transactions for example, you’d have to diff the DB to know what to create, update or delete. But here you can blindly do a bulk insert, bulk update and delete. 3 SQL queries maximum and done.

The only issue I have with their system is that… It’s based on webhooks 😓

But that’s not causing much harm. Of course it means you need to setup a HTTP tunnel before developing with their API but on the other hand because they have this sync method you avoid all the other drawbacks pretty easily. You can even poll the API every day if you don’t care about being “as fast as possible”.

Take away

Webhooks suck because they bring a hoard of subtle yet annoying problems. Most queue systems are either “at most once” or “at least once”. Webhooks are “probably once 🤞🏻” and bring with them a terrible developer experience.

But what we really need to do is decouple two problems: the one of waking up remote code and the one of synchronizing state.

Waking up remote code is fairly easy now that async architectures are widespread, you can either rely on an external cloud provider or simply let people open websockets to you.

And then regarding state synchronization, most likely you want a somewhat linear sequence of events to be streamed to your consumer, relying heavily on the concept of cursors to let remote code communicate its current knowledge of the state.

At the end of the day, if you are making a public API, the developer experience is going to matter a lot and involves in the current case two main elements:

How complicated is the code going to be when using your API? The lightest the required wrapping, the least the data post-processing, the better.
How hard will the problems be to solve in terms of infrastructure? States to be kept, network flows, etc. Keep in mind that most apps start from scratch so optimize for small operations rather than world-scale conglomerates.

So if you are making a public API — for the wide web to use or simply for other parts of your company — please think well how you can make the life of your peers easier and safer!

It’s not that high, I still do a lot of polling when I’m short in time and it will make almost no difference on the result.

Not making any judgement or recommendation here. You can be pretty sure that half the secret services in the world read your Telegram messages, but it is a massive platform on which you can build many interesting things.