
In a distributed system like Apache Kafka, the "fire-and-forget" principle is a recipe for data loss and inconsistent states. When a producer sends a critical message – whether it’s a financial transaction, a sensor value from a wind turbine, or important logistics information – you as a developer or architect must guarantee that this message reaches the cluster safely and reliably.
Without the right configurations, a network outage or broker failure can cause messages to either be completely lost or duplicated unnoticed. This not only undermines data integrity but can also lead to incorrect business decisions and loss of customer trust. In this guide, we’ll dive into the technical configurations that are essential for guaranteed and loss-free message transmission.
acks=all
and IdempotenceThe first lever for reliable producing is the acknowledgment strategy (acks
). This parameter defines how many acknowledgments the producer waits for before considering a message successfully sent.
acks=0
acks=0
: The producer sends the message and does not wait for any acknowledgment. Maximum performance, but high risk of data loss.
acks=1
acks=1
: The producer waits only for the acknowledgment from the leader broker of the respective partition. This was the default until Kafka 3.4. The message is safe as long as the leader doesn’t fail exactly after acknowledgment but before replication.
acks=all
acks=all
(or -1
): The producer waits until the leader and all In-Sync Replicas (ISRs) have received the message. This is the new default since Kafka 3.4. This provides the highest guarantee against data loss.
For maximum reliability, acks=all
is therefore mandatory. But this alone is not enough. What happens when the producer doesn’t receive an acknowledgment due to a temporary network problem and resends the message? Without further precautions, you would have the message duplicated in the topic.
This is where idempotence comes into play. By setting enable.idempotence=true
(default since Kafka 3.4), the producer assigns a sequence number to each message. The broker stores the last sequence number per producer and partition and automatically discards duplicates.
acks=all
enable.idempotence=true
With this combination, you ensure that messages are guaranteed to arrive and are not duplicated.
These two settings are the foundation for "Exactly-Once" semantics on the producer side.
We’ve set acks=all
, which means the producer waits for acknowledgment from all In-Sync Replicas. But what exactly does "in-sync" mean and how many replicas do we need?
A replica is considered "in-sync" if it’s not too far behind the leader. If a broker fails or becomes too slow, the controller removes it from the ISR list for the affected partitions. So acks=all
doesn’t wait for all configured replicas, but only for those currently considered healthy. This is a clever mechanism, because otherwise the failure of a single broker would block the entire write operation.
Now comes the crucial question: What happens when too many brokers fail and only the leader remains? acks=all
would then be identical to acks=1
, and we lose our safety guarantee.
To prevent this scenario, we configure the min.insync.replicas
parameter at the broker or topic level. This value determines how many replicas must be in the ISR list for a produce request to be accepted at all.
Our recommendation for topics:
Replication Factor: 3
min.insync.replicas
: 2
With this configuration, one broker can fail without affecting production. If two brokers fail, the producer stops production for the affected partitions, since the min.insync.replicas=2
condition is no longer met. The broker throws a NotEnoughReplicasException
. This is exactly what we want: Better to stop production than risk potential data loss.
Warning: It’s rarely sensible to set min.insync.replicas
to the replication factor. If you have a replication factor of 3 and set min.insync.replicas
to 3, no single broker can fail without stopping production.
Even with perfect configuration, errors can occur. The producer could crash or timeout (delivery.timeout.ms
) before receiving acknowledgment from the broker. How do you handle this?
Wrong approach: Blind resending Simply resending the message in a callback is a bad idea. The internal retry logic of the Kafka client library is much more intelligent and respects the idempotence guarantee. A manual retry bypasses these guarantees and leads to duplicates or ordering problems.
Better ideas:
Exponential Backoff: Wait a certain time after an error and try again. Double the wait time with each further failure. This can help with temporary problems but carries the risk of violating message order and is difficult to implement cleanly without losing the Exactly-Once guarantee.
Write messages locally to disk: If a message is transient (i.e., it only exists in your producer’s memory), persisting to local disk can be an emergency solution. A separate process can then later attempt to resend these messages. However, this significantly increases complexity.
My favorite: Let it crash! (Controlled Crash)
This sounds counterintuitive but is often the cleanest and most robust approach, especially in modern, containerized environments like Kubernetes. If your producer reads data from a source system (e.g., a database), processes it, and writes to Kafka, the approach is simple: When an irreparable error occurs during sending, terminate the process.
Why is this good? An orchestration system like Kubernetes automatically restarts the container. After restart, the producer begins again at the last successfully processed position in the source system. You shift error handling from complex code to robust infrastructure. It’s clean, simple to document, and ideally triggers an alert in monitoring, making the problem transparent.
It’s important to understand what our configuration achieves – and what it doesn’t.
Guarantee: As long as the producer doesn’t crash and no delivery.timeout.ms
occurs, the message will be stored exactly once in a partition.
Limitation: This guarantee only applies to the producer. It doesn’t ensure that a consumer processes the message exactly once. For complete end-to-end exactly-once processing (e.g., consume-process-produce), additional mechanisms are necessary.
What if multiple messages need to be written atomically? If you need to write multiple messages, possibly even to different topics, as a single, indivisible unit ("all or nothing"), you need Producer Transactions. This is an advanced topic that would exceed the scope of this article.
For consume-process-produce scenarios, Kafka Streams with the setting processing.guarantee=exactly_once_v2
is often the best and simplest solution, as it manages transactions internally.
Reliability in distributed systems is not a coincidence, but the result of conscious architectural decisions and careful configuration. To ensure your producers don’t lose messages, you’ve learned to use the three most important levers:
acks=all
: Wait for acknowledgment from all In-Sync Replicas.
enable.idempotence=true
: Prevent duplicates during internal retries.
min.insync.replicas=2
(with Replication Factor 3): Better to stop producing than risk data loss with too many broker failures.
For error cases, we’ve seen that a controlled restart through infrastructure is often the cleanest solution. Remember that these guarantees are limited to the producer. For end-to-end scenarios, advanced concepts like Kafka Streams or Producer Transactions are the next logical step.
Official Kafka documentation on Producer: Producer Configs
Our Deep Dive on Transactions (coming soon here in the Knowledge Hub)
Our book Apache Kafka in Action for comprehensive understanding of all concepts.