Kafka: How to Produce Data Reliably

Want to ensure that not a single Kafka message gets lost? In this guide, I'll show you how to achieve maximum reliability when producing with the right configurations and what to do in case of errors.

By Anatoly Zelenin

The Foundation for Reliability: acks=all and Idempotence

The first lever for reliable producing is the acknowledgment strategy (acks). This parameter defines how many acknowledgments the producer waits for before considering a message successfully sent.

Figure 1. acks=0

acks=0: The producer sends the message and does not wait for any acknowledgment. Maximum performance, but high risk of data loss.

Figure 2. acks=1

acks=1: The producer waits only for the acknowledgment from the leader broker of the respective partition. This was the default until Kafka 3.4. The message is safe as long as the leader doesn’t fail exactly after acknowledgment but before replication.

Figure 3. acks=all

acks=all (or -1): The producer waits until the leader and all In-Sync Replicas (ISRs) have received the message. This is the new default since Kafka 3.4. This provides the highest guarantee against data loss.

For maximum reliability, acks=all is therefore mandatory. But this alone is not enough. What happens when the producer doesn’t receive an acknowledgment due to a temporary network problem and resends the message? Without further precautions, you would have the message duplicated in the topic.

This is where idempotence comes into play. By setting enable.idempotence=true (default since Kafka 3.4), the producer assigns a sequence number to each message. The broker stores the last sequence number per producer and partition and automatically discards duplicates.

Producer Configuration for Reliability

acks=all
enable.idempotence=true

With this combination, you ensure that messages are guaranteed to arrive and are not duplicated.

These two settings are the foundation for "Exactly-Once" semantics on the producer side.

Creating Fault Tolerance: Understanding In-Sync Replicas (ISRs)

We’ve set acks=all, which means the producer waits for acknowledgment from all In-Sync Replicas. But what exactly does "in-sync" mean and how many replicas do we need?

A replica is considered "in-sync" if it’s not too far behind the leader. If a broker fails or becomes too slow, the controller removes it from the ISR list for the affected partitions. So acks=all doesn’t wait for all configured replicas, but only for those currently considered healthy. This is a clever mechanism, because otherwise the failure of a single broker would block the entire write operation.

Now comes the crucial question: What happens when too many brokers fail and only the leader remains? acks=all would then be identical to acks=1, and we lose our safety guarantee.

To prevent this scenario, we configure the min.insync.replicas parameter at the broker or topic level. This value determines how many replicas must be in the ISR list for a produce request to be accepted at all.

Our recommendation for topics:

Replication Factor: 3
min.insync.replicas: 2

With this configuration, one broker can fail without affecting production. If two brokers fail, the producer stops production for the affected partitions, since the min.insync.replicas=2 condition is no longer met. The broker throws a NotEnoughReplicasException. This is exactly what we want: Better to stop production than risk potential data loss.

Warning: It’s rarely sensible to set min.insync.replicas to the replication factor. If you have a replication factor of 3 and set min.insync.replicas to 3, no single broker can fail without stopping production.

What to do in case of errors? Strategies for the worst case

Even with perfect configuration, errors can occur. The producer could crash or timeout (delivery.timeout.ms) before receiving acknowledgment from the broker. How do you handle this?

Wrong approach: Blind resending Simply resending the message in a callback is a bad idea. The internal retry logic of the Kafka client library is much more intelligent and respects the idempotence guarantee. A manual retry bypasses these guarantees and leads to duplicates or ordering problems.

Better ideas:

Exponential Backoff: Wait a certain time after an error and try again. Double the wait time with each further failure. This can help with temporary problems but carries the risk of violating message order and is difficult to implement cleanly without losing the Exactly-Once guarantee.
Write messages locally to disk: If a message is transient (i.e., it only exists in your producer’s memory), persisting to local disk can be an emergency solution. A separate process can then later attempt to resend these messages. However, this significantly increases complexity.
My favorite: Let it crash! (Controlled Crash) This sounds counterintuitive but is often the cleanest and most robust approach, especially in modern, containerized environments like Kubernetes. If your producer reads data from a source system (e.g., a database), processes it, and writes to Kafka, the approach is simple: When an irreparable error occurs during sending, terminate the process.
Why is this good? An orchestration system like Kubernetes automatically restarts the container. After restart, the producer begins again at the last successfully processed position in the source system. You shift error handling from complex code to robust infrastructure. It’s clean, simple to document, and ideally triggers an alert in monitoring, making the problem transparent.

Limits of guarantees and atomic operations

It’s important to understand what our configuration achieves – and what it doesn’t.

Guarantee: As long as the producer doesn’t crash and no delivery.timeout.ms occurs, the message will be stored exactly once in a partition.
Limitation: This guarantee only applies to the producer. It doesn’t ensure that a consumer processes the message exactly once. For complete end-to-end exactly-once processing (e.g., consume-process-produce), additional mechanisms are necessary.

What if multiple messages need to be written atomically? If you need to write multiple messages, possibly even to different topics, as a single, indivisible unit ("all or nothing"), you need Producer Transactions. This is an advanced topic that would exceed the scope of this article.

For consume-process-produce scenarios, Kafka Streams with the setting processing.guarantee=exactly_once_v2 is often the best and simplest solution, as it manages transactions internally.

Need support?

If you have questions about Kafka reliability or other Kafka topics, we’d be happy to help. We always look forward to professional exchange.

Schedule an introductory meeting

Conclusion

Reliability in distributed systems is not a coincidence, but the result of conscious architectural decisions and careful configuration. To ensure your producers don’t lose messages, you’ve learned to use the three most important levers:

acks=all: Wait for acknowledgment from all In-Sync Replicas.
enable.idempotence=true: Prevent duplicates during internal retries.
min.insync.replicas=2 (with Replication Factor 3): Better to stop producing than risk data loss with too many broker failures.

For error cases, we’ve seen that a controlled restart through infrastructure is often the cleanest solution. Remember that these guarantees are limited to the producer. For end-to-end scenarios, advanced concepts like Kafka Streams or Producer Transactions are the next logical step.

By Anatoly Zelenin

Schema Management in Kafka

In this post, you'll learn how explicit schemas help you avoid potential chaos in Kafka and how schema registries support this.

By Anatoly Zelenin

How many partitions do I need in Apache Kafka?

Apache Kafka's partitions are essential for maintaining order and structure in all data processing workflows. But how many partitions should you set up? This is an important question that should be addressed early in your system design. This blog post tackles this critical consideration for your Kafka implementation.

Kafka: How to Produce Data Reliably

The Foundation for Reliability: `acks=all` and Idempotence

Creating Fault Tolerance: Understanding In-Sync Replicas (ISRs)

What to do in case of errors? Strategies for the worst case

Limits of guarantees and atomic operations

Need support?

Conclusion

Further resources

Continue reading

Schema Management in Kafka

How many partitions do I need in Apache Kafka?

Contact

Address

E-Mail Address

Links

Kafka: How to Produce Data Reliably

The Foundation for Reliability: acks=all and Idempotence

Creating Fault Tolerance: Understanding In-Sync Replicas (ISRs)

What to do in case of errors? Strategies for the worst case

Limits of guarantees and atomic operations

Need support?

Conclusion

Further resources

Share this post

Continue reading

Schema Management in Kafka

How many partitions do I need in Apache Kafka?

Contact

Address

E-Mail Address

Links

The Foundation for Reliability: `acks=all` and Idempotence