Apache Kafka is a high-performance, highly scalable event streaming platform. To unlock Kafka’s full potential, you need to carefully consider the design of your application. It’s all too easy to write Kafka applications that perform poorly or eventually hit a scalability brick wall. Since 2015, IBM has provided the IBM Event Streams service, which is a fully-managed Apache Kafka service running on IBM Cloud®. Since then, the service has helped many customers, as well as teams within IBM, resolve scalability and performance problems with the Kafka applications they have written. This article describes some of the common problems of Apache Kafka and provides some recommendations for how you can avoid running into scalability problems with your applications.
1. Minimize waiting for network round-trips
Certain Kafka operations work by the client sending data to the broker and waiting for a response. A whole round-trip might take 10 milliseconds, which sounds speedy, but limits you to at most 100 operations per second. For this reason, it’s recommended that you try to avoid these kinds of operations whenever possible. Fortunately, Kafka clients provide ways for you to avoid waiting on these round-trip times. You just need to ensure that you’re taking advantage of them.
- Tips to maximize throughput:
- Don’t check every message sent if it succeeded. Kafka’s API allows you to decouple sending a message from checking if the message was successfully received by the broker.
- Don’t follow the processing of each message with an offset commit. Committing offsets (synchronously) is implemented as a network round-trip with the server.
If you read the above and thought, “Uh oh, won’t that make my application more complex?” — the answer is yes, it likely will. There is a trade-off between throughput and application complexity. What makes network round-trip time a particularly insidious pitfall is that once you hit this limit, it can require extensive application changes to achieve further throughput improvements.
2. Don’t let increased processing times be mistaken for consumer failures
One helpful feature of Kafka is that it monitors the “liveness” of consuming applications and disconnects any that might have failed. This works by having the broker track when each consuming client last called “poll” (Kafka’s terminology for asking for more messages). Unfortunately, with this scheme the Kafka broker can’t distinguish between a client that is taking a long time to process the messages it received and a client that has actually failed.
3. Minimize the cost of idle consumers
Under the hood, the protocol used by the Kafka consumer to receive messages works by sending a “fetch” request to a Kafka broker. Changing the Kafka consumer configuration can help reduce this effect. If you want to receive messages as soon as they arrive, the “fetch.min.bytes” must remain at its default of 1; however, the “fetch.max.wait.ms” setting can be increased to a larger value and doing so will reduce the number of requests made by idle consumers.