Design a Resilient Message Sending Service

Ensure that messages are guaranteed to be sent, and never lost

Build your Integration with a resiliency to errors: implementing a Message Queue and a Retry mechanism with exponential backoff

To ensure that clients don't lose any messages or become alarmed by minor glitches in the 360dialog <> partner connection, there are two patterns which if combined, can help us create a more resilient and reliable messaging system: a retry mechanism, and a message queue.

Your hosted WABA instance combines message queues and retry mechanisms to ensure that callbacks are always delivered to your web-hook endpoint, and messages are always delivered to the WhatsApp network. In order to achieve this level of reliability to your systems and integration(s), we strongly encourage you to implement a similar solution.

Let’s break down the problem and analyse it step by step.

Different fault types

Clarity on fault types is essential for Identifying the right solution It's crucial to understand that faults can come in different forms: transient, intermittent, and permanent.

Transient and intermittent faults are time-bound disruptions. Examples include a service glitch that requires a restart, a temporary loss of connectivity between your systems and 360dialog, or scheduled maintenance procedures that result in weekly service restarts.

Permanent faults, on the other hand, persist until the faulty component is fixed. Examples include long-lasting connectivity issues with your ISP and 360dialog, a full-scale malfunction of a cloud provider, and a persistent resource shortage in 360dialog systems that leads to the failure of all API requests for an extended period.

How to solve transient and intermittent faults - Retry Mechanism

For transient and intermittent faults, you can establish a retry mechanism where failed requests are repeatedly attempted. Every time a retry is initiated, the waiting time is increased by a factor of 2, as demonstrated in the diagram below.

Warning: Without an exponential backoff, your integration may send too many requests to 360dialog's systems, triggering rate limiting that can temporarily prevent access to the API (waba.360dialog.io) for your system.

How to solve permanent faults - Message Queue

While a retry mechanism with exponential backoff may be sufficient to handle transient and intermittent faults and/or with low message volumes, it may not be able to deal with permanent faults. Eventually, a retry mechanism based on memory storage will consume all available memory, causing it to fail and potentially lose messages.

A software design pattern that can be implemented to handle such cases is a message queue.

A message queue is an asynchronous service that facilitates the transfer of data between two points. The entity that initiates the message is referred to as the Producer, and the recipient is known as the Consumer.

Message Queue Functionality

The Producer begins the process by adding a message to the message queue, where it remains until it is retrieved by the Consumer. The Consumer retrieves the message(s) and attempts to send it to the 360dialog API. If the message is successfully sent, it is removed from the message queue. In the event of failure, the message is returned to the queue, ensuring that no unsent messages are ever lost. The implementation of this process is demonstrated in the following diagram.

Implementing this solution will guarantee that the messages will always be delivered regardless of the destination status, as the messages will sit in the message queue until the destination is reachable.

Note: Failed messages should only be returned to the Message Queue when the failure is due to a server error (5XX status). Client errors (4XX status) signify that there is something wrong with the message request itself (e.g. sending to the wrong url), so the message will always be rejected by 360dialog. Client errors must be removed from the queue, and handled elsewhere.

Implementing a comprehensive solution

For maximum resilience and reliability, we suggest combining a message queue with a Consumer that implements a retry mechanism with exponential back-off. The message should only be removed from the queue once successful delivery is confirmed. This way, your system will be able to handle its own malfunctions.

Implementation examples and suggestions

To see examples of how to implement a retry mechanism with exponential back-off, we recommend checking out the following resources:

When implementing a message queue, it's important to note that it doesn't have to be built using commonly-used tools like Celery and Redis, RabbitMQ, or Kafka. For example, the widely used mail server Postfix implements a file-based approach that leverages the atomic file rename and move characteristics of UNIX file systems.

You could also implement a message queue on a SQL database, similarly to how it is implemented in the WABA instance itself.

This Python library implements a queue that can be configured to work on files, sqlite, or MySQL.

If you're working in a cloud environment, you can explore the message queue options available in services like GCP, AWS or AZURE.

Pseudocode Implementation of a Message Sending Queue, with a Built-in Retry Mechanism

The first step of sending a message is to add the message to the message_queue

import message_queue

def send_message(message):
    message_queue.add_message(message, backoff_time=0)

The second step of sending a message is done in another dedicated process: the main function will retrieve messages from the queue, and attempt to send the message. If the message is successfully sent, the message is deleted from the queue, otherwise it is returned to the message queue with an increased backoff time.

import message_queue

BACKOFF_LIMIT = 86400

        
def deliver_message_to_waba_api(message):
    # implementation of message sending via 360dialog API
    pass


def send_message(message, backoff_time):
    if backoff_time:
        time.sleep(backoff)

    is_sent = deliver_message_to_waba_api(message)
    if not is_sent:
        if backoff_time == 0:
            backoff_time = 1

        if backoff_time < BACKOFF_LIMIT:
            # Increment backoff time only if lower than 24H.
            # In some cases it could make sense to remove the message from the queue
            # after a certain time limit.
            backoff_time *= 2

    return is_sent, backoff_time


def main():
    messages = message_queue.get()
    for message, backoff_time in messages:
        is_sent, backoff_time = send_message(message, backoff_time)
        if is_sent:
            message_queue.delete_message(message)
        else:
            message_queue.return_to_queue(message, backoff_time)

Last updated