Kafka Sizing & Scaling: Hardware Requirements and Scaling Strategies

You don't need a massive cluster to run Kafka efficiently. We debunk hardware myths and show how to start with 3 brokers and scale easily.

Reading vendor documentation often gives the impression that building a Kafka cluster requires enormous amounts of hardware. Fortunately, the reality is quite different. Kafka can achieve a lot with surprisingly few resources.

In this post, we discuss sensible minimum requirements and scaling options.

The Most Common Mistake: The Clients

Before discussing hardware, an important note: High resource load is often not due to the Kafka brokers themselves, but rather misconfigured clients.

Poor batch sizes or unnecessarily aggressive polling intervals can bring a cluster to its knees, regardless of how much hardware you throw at it. So, check your clients before expanding the cluster.

You can find more information on proper client configuration in the article Configuring Apache Kafka Clients Correctly.

Controller

We recommend starting with three dedicated controller nodes.

Although it is technically possible to run controllers and brokers in the same JVM, separating them makes later scaling significantly easier and operations more stable.

Usually, three controller nodes are sufficient even for larger Kafka clusters. They are relatively frugal:

  • Start: 1-2 GB RAM and 0.5 CPU cores are perfectly fine for the beginning.

  • Storage: 10 GB of storage space will last a very long time.

  • Peak: Even in a heavily loaded Kafka cluster, controllers require a maximum of 4-8 GB RAM and 1-2 CPU cores.

The number of controllers does not affect performance, only the reliability of the quorum. With three controllers, one can fail without causing disruptions. With five controllers, two can fail. In our opinion, three controllers are enough for the majority of use cases.

Note: These rules apply to both the modern KRaft mode and older ZooKeeper setups.

Broker

For the actual brokers, things are a bit more complex. Here we need to consider Storage, RAM, CPU, and Network.

Storage

It is crucial to provide brokers with fast and low-latency storage. Kafka is sensitive to latency spikes when writing to disk.

Stay away from NFS & Co.!

Avoid slow network-based storage like NFS, GlusterFS, or Portworx. Fluctuating latencies almost always lead to stability issues sooner or later.

Use instead: iSCSI volumes, fast network or cloud block storage, or ideally local SSD storage.

The disk size is simply determined by the expected data volume and the replication factor.

  • Formula: Daily Ingest × Retention Time × Replication Factor

  • Example: If 10 GB are produced per day and kept for 7 days, we need a total of 210 GB storage space in the cluster. Divided by three brokers, this means each broker needs at least 70 GB of storage space.

RAM (Memory)

Kafka brokers use RAM in two ways:

  1. JVM Heap: For the application itself.

  2. Page Cache: As buffer memory at the operating system level.

Even in very large clusters, brokers rarely need more than 6 GB Heap (configured via -Xmx and -Xms). Typically, RAM is split 50/50% between Heap and Page Cache until 6 GB Heap is reached. The entire remaining RAM is automatically used by the operating system as Page Cache for faster write and especially read operations. The more Page Cache available, the more requests Kafka can answer directly from RAM without having to access the hard disk.

CPU & Ratio

For very small clusters, it is often enough to start with 1 GB RAM, 512 MB Heap, and 1 CPU core.

To get a feel for the right ratio between RAM and CPU, it is worth looking at the Memory Optimized Instances from AWS. A good rule of thumb here is:

RAM in GB = vCPU * 8

This corresponds, for example, to 16 GB RAM with 2 vCPUs.

Network

For the network, the rule is usually: Take what you can get. Kafka benefits from high bandwidth. If 10 or even 25 Gbit are available, take them. In cloud environments, pay attention to the fact that network bandwidth is often coupled to the instance size – sometimes you have to increase the instance size just to get more throughput.

Scaling

When the load increases, the question arises: Scale horizontally (more brokers) or vertically (bigger servers)?

Vertical scaling is usually operationally easier but has a disadvantage: If one of three brokers fails, the remaining two suddenly have to take over 50% more load (at Replication Factor 3). The larger the individual brokers are, the harder the impact in case of failure.

Our Recommendation:

Start with three brokers. Scale these vertically up to 64 GB RAM (and corresponding CPUs). After that, we would slowly increase the number of brokers, perhaps up to about 6 brokers.

At that point at the latest, you should take a very close look at the metrics to find the real bottleneck. Is it really the CPU? Or rather the network or disk IO? Based on this analysis, you decide whether more brokers or more resources per broker make more sense.

Important: Rebalancing

When adding new brokers, initially nothing happens. The load is not automatically redistributed. This is the job of tools like Cruise Control, which intelligently move partitions to the new nodes.

About Anatoly Zelenin
Hi, I’m Anatoly! I love to spark that twinkle in people’s eyes. As an Apache Kafka expert and book author, I’ve been bringing IT to life for over a decade—with passion instead of boredom, with real experiences instead of endless slides.

Continue reading

article-image
Kafka Training: The 6 Most Important Guiding Principles of My Training

When am I a good trainer? That is perhaps the most important question for me in my role as an Apache Kafka expert. In this blog, I want to answer it with six guiding principles. Do you agree with all of them?

Read more
article-image
What is Apache Kafka?

What is Apache Kafka all about? This program that virtually every major (German) car manufacturer uses? This software that has allowed us to travel across Europe with our training sessions and workshops and get to know many different industries - from banks, insurance companies, logistics service providers, internet startups, retail chains to law enforcement agencies? Why do so many different companies use Apache Kafka?

Read more