"Translate" to tune Apache Kafka cluster

Source: Internet
Author: User
Tags apache flink kafka streams

Today brings a translation "Tuning Apache Kafka cluster", there are some ideas and there is not much novelty, but the summary is detailed. This article from four different goals to give a different configuration of the parameters, it is worth reading ~ Original address please refer to: https://www.confluent.io/blog/optimizing-apache-kafka-deployment/

==========================================

Apache Kafka is currently the best enterprise-class streaming platform. Link your application to the Kafka cluster, and all the rest of the Kafka can help you: automate your load balancing, automatically implement zero-copy data transfer, Automatic rebalance of consumer group member changes and automatic backup of application state persistent storage and partition leader automatic failover, etc.--the dream of the operation and maintenance personnel finally come true!

———— Author: I'm looking at Apache Flink recently. Speaking of streaming this part, Flink can be no worse than Kafka streams. As for is not the best streaming platform, the right ~ ~

With the default Kafka parameter configuration you will be able to build a Kafka clustered environment for development and testing purposes from scratch, but the default configuration usually does not match your production environment, so you have to do a certain degree of tuning. After all, different usage scenarios have different usage requirements and performance indicators. The various parameters provided by Kafka are designed to optimize these requirements and indicators. Kafka provides a lot of configuration for users to set up to ensure that the Kafka environment is able to meet the requirements of the target, so it is important to investigate the meaning of these parameters in detail and to test for different parameter values. All of this work should be done before the Kafka is formally produced, and the configuration of the various parameters should consider the future expansion of the cluster size.

The process of performing the optimization is as follows:

    1. Clear tuning goals
    2. Configure Kafka server-side and clients-side parameters in a targeted manner
    3. Perform performance tests, monitor individual metrics to determine if requirements are met, and whether further tuning is possible
I. Setting goals

The first step is to identify performance tuning goals, which are considered in 4 ways: throughput (throughput), latency (latency), persistence (durability), and availability (availability). Depending on the actual usage scenario, determine which (or which) of these 4 goals to reach. Sometimes it may be difficult to determine what you want, so try this at this point: let your team sit down and discuss the business usage scenarios and see what the main business goals are. There are two main reasons for setting goals:

    • "You can't have your cake and eat it"-you can't maximize all your goals. There must be a tradeoff between the 4 (tradeoff). Common tradeoff include tradeoffs between throughput and latency tradeoffs, persistence, and availability. But when we think about the whole system, we don't usually think about one aspect of it in isolation, but we need to consider it all. While they are not mutually exclusive, it is almost certainly impossible to make all the targets at the same time optimal.
    • We need to constantly adjust the Kafka configuration parameters to achieve these goals, and to ensure that our Kafka is optimized to meet the needs of the user's actual usage scenarios.

Here are some questions that can help you set your goals:

    • Do you expect Kafka to achieve high throughput (TPs, i.e. producer production speed and consumer consumption rate), such as millions of TPS? Because of the good design of Kafka itself, it is not difficult to produce very large numbers of messages. Kafka is much faster than traditional database or kv storage, and it can be done with normal hardware.
    • Do you expect the Kafka to achieve a low latency (i.e., the smaller the interval between messages from being written to being read, the better)? A practical application of low latency is the usual chat program, receiving a message the faster the better. Other examples include interactive websites where users expect real-time visibility into their friends ' dynamics and real-time streaming in the internet of things.
    • Do you expect the Kafka to achieve high durability, that is, messages that are successfully submitted can never be lost? For example, an event-driven MicroServices data pipeline using Kafka as the underlying data store requires Kafka not to lose events. Again, for example, when the streaming framework reads persistent storage, it must ensure that critical business events cannot be missed.
    • Do you expect high availability for Kafka? The overall outage of the service cannot occur even if there is a crash. Kafka itself is a distributed system, and nature is able to fight the crash. If high availability is your primary goal, configuring specific parameters ensures that Kafka can recover from the crash in a timely manner.
Second, configuration parameters

Below we will discuss the optimization of these four goals and the corresponding parameter settings respectively. These parameters cover the different configurations of the producer end, broker side, and consumer side. As mentioned earlier, many configurations have been introduced to some degree of tradeoff, in use must be clear about the true meaning of these configurations, to be targeted.

Producer End

    • Batch.size
    • linger.ms
    • Compression.type
    • ACKs
    • Retries
    • Max.in.flight.requests.per.connection
    • Buffer.memory

Broker side

    • Default.replication.factor
    • Num.replica.fetchers
    • Auto.create.topics.enable
    • Min.insync.replicas
    • Unclean.leader.election.enable
    • Broker.rack
    • Log.flush.interval.messages
    • log.flush.interval.ms
    • Unclean.leader.election.enable
    • Min.insync.replicas
    • Num.recovery.threads.per.data.dir

Consumer End

    • Fetch.min.bytes
    • Auto.commit.enable
    • session.timeout.ms

1 Tuning throughput

Producer End

    • Batch.size = 100000-200000 (default is 16384, usually too small)
    • linger.ms = 10-100 (default is 0)
    • Compression.type = Lz4
    • ACKs = 1
    • Retries = 0
    • Buffer.memory: If the number of partitions is increased appropriately (default is 32MB)

Consumer End

    • Fetch.min.bytes = 10 ~ 100000 (default is 1)

2 Tuning delay

Producer End

    • linger.ms = 0
    • Compression.type = None
    • ACKs = 1

Broker side

    • Num.replica.fetchers: If there is frequent ISR access or follower cannot catch up with leader, increase the value appropriately, but usually do not exceed the CPU core number +1

Consumer End

    • Fetch.min.bytes = 1

3 Tuning Persistence

Producer End

    • Replication.factor = 3
    • ACKs = All
    • Retries = relatively large value, e.g. 5 ~ 10
    • Max.in.flight.requests.per.connection = 1 (Prevent disorderly order)

Broker side

    • Default.replication.factor = 3
    • Auto.create.topics.enable = False
    • Min.insync.replicas = 2, which is set to replication factor-1
    • Unclean.leader.election.enable = False
    • Broker.rack: If the organic shelf information, it is better to set this value, to ensure the distribution of data across multiple rack to achieve high persistence
    • Log.flush.interval.messages and log.flush.interval.ms: if it is particularly important topic and the TPS itself is not high, it is recommended to set a lower value, such as 1

Consumer End

    • Auto.commit.enable = False Own control displacement

4 Tuning High Availability

Broker side

    • Unclean.leader.election.enable = True
    • Min.insync.replicas = 1
    • Num.recovery.threads.per.data.dir = number of directories configured in Log.dirs

Consumer End

    • Session.timeout.ms: As low as you can
iii. Monitoring of indicators

1 Operating system level indicators

    • Memory utilization
    • Disk occupancy rate
    • CPU Usage
    • Number of open file handles
    • Disk IO Utilization
    • Bandwidth IO Utilization

2 Kafka conventional JMX monitoring

3 JMX monitoring for easy discovery of bottlenecks

4 clients-side common JMX monitoring

5 broker-side ISR-related JMX monitoring

==========================================

The above is a brief translation of the original text. Or that sentence, many of the parameters in the set are already commonplace, and there is not much new ideas. However, this article gives different thoughts from 4 aspects of throughput, latency, persistence, and availability. It is worth reading from this point.

"Translate" to tune Apache Kafka cluster

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.