"Translate" to tune Apache Kafka cluster

Last Update:2017-06-03 Source: Internet

Author: User

Tags apache flink kafka streams

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Today brings a translation "Tuning Apache Kafka cluster", there are some ideas and there is not much novelty, but the summary is detailed. This article from four different goals to give a different configuration of the parameters, it is worth reading ~ Original address please refer to: https://www.confluent.io/blog/optimizing-apache-kafka-deployment/

==========================================

Apache Kafka is currently the best enterprise-class streaming platform. Link your application to the Kafka cluster, and all the rest of the Kafka can help you: automate your load balancing, automatically implement zero-copy data transfer, Automatic rebalance of consumer group member changes and automatic backup of application state persistent storage and partition leader automatic failover, etc.--the dream of the operation and maintenance personnel finally come true!

———— Author: I'm looking at Apache Flink recently. Speaking of streaming this part, Flink can be no worse than Kafka streams. As for is not the best streaming platform, the right ~ ~

With the default Kafka parameter configuration you will be able to build a Kafka clustered environment for development and testing purposes from scratch, but the default configuration usually does not match your production environment, so you have to do a certain degree of tuning. After all, different usage scenarios have different usage requirements and performance indicators. The various parameters provided by Kafka are designed to optimize these requirements and indicators. Kafka provides a lot of configuration for users to set up to ensure that the Kafka environment is able to meet the requirements of the target, so it is important to investigate the meaning of these parameters in detail and to test for different parameter values. All of this work should be done before the Kafka is formally produced, and the configuration of the various parameters should consider the future expansion of the cluster size.

The process of performing the optimization is as follows:

Clear tuning goals
Configure Kafka server-side and clients-side parameters in a targeted manner
Perform performance tests, monitor individual metrics to determine if requirements are met, and whether further tuning is possible

I. Setting goals

The first step is to identify performance tuning goals, which are considered in 4 ways: throughput (throughput), latency (latency), persistence (durability), and availability (availability). Depending on the actual usage scenario, determine which (or which) of these 4 goals to reach. Sometimes it may be difficult to determine what you want, so try this at this point: let your team sit down and discuss the business usage scenarios and see what the main business goals are. There are two main reasons for setting goals:

"You can't have your cake and eat it"-you can't maximize all your goals. There must be a tradeoff between the 4 (tradeoff). Common tradeoff include tradeoffs between throughput and latency tradeoffs, persistence, and availability. But when we think about the whole system, we don't usually think about one aspect of it in isolation, but we need to consider it all. While they are not mutually exclusive, it is almost certainly impossible to make all the targets at the same time optimal.
We need to constantly adjust the Kafka configuration parameters to achieve these goals, and to ensure that our Kafka is optimized to meet the needs of the user's actual usage scenarios.

Here are some questions that can help you set your goals:

Do you expect Kafka to achieve high throughput (TPs, i.e. producer production speed and consumer consumption rate), such as millions of TPS? Because of the good design of Kafka itself, it is not difficult to produce very large numbers of messages. Kafka is much faster than traditional database or kv storage, and it can be done with normal hardware.
Do you expect the Kafka to achieve a low latency (i.e., the smaller the interval between messages from being written to being read, the better)? A practical application of low latency is the usual chat program, receiving a message the faster the better. Other examples include interactive websites where users expect real-time visibility into their friends ' dynamics and real-time streaming in the internet of things.
Do you expect the Kafka to achieve high durability, that is, messages that are successfully submitted can never be lost? For example, an event-driven MicroServices data pipeline using Kafka as the underlying data store requires Kafka not to lose events. Again, for example, when the streaming framework reads persistent storage, it must ensure that critical business events cannot be missed.
Do you expect high availability for Kafka? The overall outage of the service cannot occur even if there is a crash. Kafka itself is a distributed system, and nature is able to fight the crash. If high availability is your primary goal, configuring specific parameters ensures that Kafka can recover from the crash in a timely manner.

Second, configuration parameters

Below we will discuss the optimization of these four goals and the corresponding parameter settings respectively. These parameters cover the different configurations of the producer end, broker side, and consumer side. As mentioned earlier, many configurations have been introduced to some degree of tradeoff, in use must be clear about the true meaning of these configurations, to be targeted.

Producer End

Batch.size
linger.ms
Compression.type
ACKs
Retries
Max.in.flight.requests.per.connection
Buffer.memory

Broker side

Default.replication.factor
Num.replica.fetchers
Auto.create.topics.enable
Min.insync.replicas
Unclean.leader.election.enable
Broker.rack
Log.flush.interval.messages
log.flush.interval.ms
Unclean.leader.election.enable
Min.insync.replicas
Num.recovery.threads.per.data.dir

Consumer End

Fetch.min.bytes
Auto.commit.enable
session.timeout.ms

1 Tuning throughput

Producer End

Batch.size = 100000-200000 (default is 16384, usually too small)
linger.ms = 10-100 (default is 0)
Compression.type = Lz4
ACKs = 1
Retries = 0
Buffer.memory: If the number of partitions is increased appropriately (default is 32MB)

Consumer End

Fetch.min.bytes = 10 ~ 100000 (default is 1)

2 Tuning delay

Producer End

linger.ms = 0
Compression.type = None
ACKs = 1

Broker side

Num.replica.fetchers: If there is frequent ISR access or follower cannot catch up with leader, increase the value appropriately, but usually do not exceed the CPU core number +1

Consumer End

Fetch.min.bytes = 1

3 Tuning Persistence

Producer End

Replication.factor = 3
ACKs = All
Retries = relatively large value, e.g. 5 ~ 10
Max.in.flight.requests.per.connection = 1 (Prevent disorderly order)

Broker side

Default.replication.factor = 3
Auto.create.topics.enable = False
Min.insync.replicas = 2, which is set to replication factor-1
Unclean.leader.election.enable = False
Broker.rack: If the organic shelf information, it is better to set this value, to ensure the distribution of data across multiple rack to achieve high persistence
Log.flush.interval.messages and log.flush.interval.ms: if it is particularly important topic and the TPS itself is not high, it is recommended to set a lower value, such as 1

Consumer End

Auto.commit.enable = False Own control displacement

4 Tuning High Availability

Broker side

Unclean.leader.election.enable = True
Min.insync.replicas = 1
Num.recovery.threads.per.data.dir = number of directories configured in Log.dirs

Consumer End

Session.timeout.ms: As low as you can

iii. Monitoring of indicators

1 Operating system level indicators

Memory utilization
Disk occupancy rate
CPU Usage
Number of open file handles
Disk IO Utilization
Bandwidth IO Utilization

2 Kafka conventional JMX monitoring

3 JMX monitoring for easy discovery of bottlenecks

4 clients-side common JMX monitoring

5 broker-side ISR-related JMX monitoring

==========================================

The above is a brief translation of the original text. Or that sentence, many of the parameters in the set are already commonplace, and there is not much new ideas. However, this article gives different thoughts from 4 aspects of throughput, latency, persistence, and availability. It is worth reading from this point.

"Translate" to tune Apache Kafka cluster

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More