Secrets of Kafka performance parameters and stress tests
The previous article Kafka high throughput performance secrets introduces how Kafka is designed to ensure high timeliness and high throughput. The main content is focused on the underlying principle and architecture, belongs to the theoretical knowledge category. This time, from the perspective of application and O & M, we will talk about how to configure parameters and test performance best when the cluster is in place. The configuration of Kafka is detailed and complex. To perform comprehensive performance tuning, You need to master a lot of information. I just use some practical experience at work to find out the key points that have the greatest impact on the cluster performance, the points to be elaborated next are limited to the environment I described. Please make a proper choice based on your own environment.
Today's article is divided into two parts. The first part introduces some performance-related parameters, meanings, and tuning strategies I have summarized. The second section provides some test results control groups that I have practiced. The specific values and results may vary with scenarios, machines, and environments, but the overall thinking and methods should be consistent.
Before entering the topic, we will introduce the machine configurations used in this test:
Six physical machines, three of which are deployed with brokers and three dedicated for launch request.
Each physical machine: 24 Processors, 189 GB Memory, 2 GB single-host bandwidth.
During this test, I set the HeapSize of the Broker to 30 GB to cover some "unconventional" usage.
Related Parameters
When debugging and Optimizing systems developed using Java, the first step is certainly not to enable JVM optimization, and Kafka is no exception, while the focus of JVM optimization is on the memory.
In fact, the Kafka service itself does not need a lot of memory. in the previous article, we have also described in detail that Kafka relies on the PageCache provided by the system to meet the performance requirements, tools such as VisualJVM can clearly analyze the Heap Space usage. In this article, the purpose of setting 30 GB memory during testing is to support higher concurrency, and the high concurrency itself will inevitably require more memory to support it, at the same time, high concurrency also means that the relevant cache capacity such as SocketBuffer will multiply. In actual use, the principle to adjust the memory size is to leave as much free memory as possible for the system, and the Broker itself is enough.
After setting the size, let's talk about the garbage collector on the JVM. The latest G1 is recommended in the official documentation to replace CMS as the garbage collector. However, it is also clearly pointed out that there will still be some unstable issues in some earlier JDK versions (1.7u21. The recommended minimum version is JDK 1.7u51. The JVM memory configuration parameters of the Broker in this test are as follows:
-Xms30g-Xmx30g-XX: PermSize = 48 m-XX: MaxPermSize = 48 m-XX: + UseG1GC-XX: MaxGCPauseMillis = 20-XX: InitiatingHeapOccupancyPercent = 35
In fact, G1 has been introduced as an experience version for the first time as early as JDK 1.6u14. However, due to initial false publicity, fees are required for use, and its own Bug is not stable, we will not wait until the later version of 1.7 is updated.
Compared with CMS, G1 has the following advantages:
G1.
The memory division methods are different. The Eden, memory vor, and Old regions are no longer fixed and the memory usage is more efficient. G1 effectively avoids memory fragmentation by dividing the memory by Region.
G1 can specify the time that GC can be used to pause the thread (strict compliance is not guaranteed ). CMS does not provide controllable options.
CMS only merges the compressed memory after FullGC, while G1 aggregates the collection and merging.
CMS can only be used in the Old partition. When Young is cleared, ParNew is usually used together, while G1 can unify the collection algorithms of the two types of partitions.
Applicable scenarios of G1:
JVM occupies a large amount of memory (At least 4G)
When applications frequently apply for and release memory, a large amount of memory fragments are generated.
For GC time-sensitive applications.
Next, we will summarize the configuration items that may affect the performance of Kafka.
Broker
Num. network. threads: 3
The number of threads used to receive and process network requests. The default value is 3. The internal implementation adopts the Selector model. Start a thread as an Acceptor to establish a connection, and then start num. network. threads to read requests from Sockets in turn. Generally, no changes are required unless the upstream and downstream concurrent requests are too large.
Num. partitions: 1
The number of partitions can also directly affect the throughput performance of the Kafka cluster. For example, I have written a MapReduce task to read data from Kafka, and each Partition corresponds to a Mapper to consume data. If the number of Partition is too small, the task will be very slow due to insufficient number of Mapper. In addition, when the number of Partition statements is less than the amount of inbound and outbound data, or because the business logic does not match the number of Partition statements, a large amount of data is read and written to individual partitions, when a large number of read/write requests are concentrated on one or more machines, it is easy to fill up all the NIC traffic. It is hard to imagine that not only does this Partition have a performance bottleneck, but other Partition or services on the same Broker may fall into a situation where network resources are scarce.
Queued. max. requests: 500
This parameter specifies the maximum capacity of the queue used to cache network requests. When this queue reaches the upper limit, new requests will not be received. Generally, it will not become a bottleneck. Unless I/O performance is too poor, you need to adjust it together with num. io. threads and other configurations.
Kafka architecture design of the distributed publish/subscribe message system
Apache Kafka code example
Apache Kafka tutorial notes
Principles and features of Apache kafka (0.8 V)
Kafka deployment and code instance
Introduction to Kafka and establishment of Cluster Environment
Replica configurations:
Replica. lag. time. max. ms: Replicated replica. lag. max. messages: 4000num. replica. fetchers: 1
The previous article has briefly introduced the meaning of the previous two configurations. Here we will not repeat them, but will focus on the third configuration. Any (Broker, Leader) tuples will have replication. factor-1 Broker as Replica. On Replica, several Fetch threads are started to synchronize the corresponding data to the local machine, while num. replica. the fetchers parameter is used to control the number of Fetch threads.
Generally, if you find that only one Partition exists in the ISR of Partition, and no new Replica is added for a long time, you can increase this parameter to speed up the replication process. In its internal implementation, each Fetch corresponds to a SimpleConsumer. For any Leader on other machines that requires Catch-up, a SimpleConsumer (num. replica. fetchers) is created to pull logs.
I was quite confused when I first learned about this design. I solemnly introduced it at the beginning of the Kafka document, the Consumer and Partition in the same Consumer group must be one-to-one consumption relationship within the same time. What is the reason for increasing the efficiency by adding SimpleConsumer?
Check the source code. We can see in javasactfetcherthread. scala that The multithreading started by Fetch is actually SimpleConsumer.
First, getFetcherId () uses numFetcher to control the range of FetchId and the number of Consumer. The partitionsPerFetcher structure is a Mapping of Fetchers started from Partition to Partition.
The preceding example uses partitionMap: mutable. HashMap [TopicAndPartition, Long] to share the offset between multiple Fetcher (SimpleConsumer) started for each Partition to achieve parallel Fetch data. Therefore, the shared offset ensures a one-to-one relationship between Consumer and Partition within the same time period, and allows us to increase the efficiency by increasing the Fetch thread.
Default. replication. factor: 1
This parameter indicates the default number of Replica instances when a topic is created. When acks in the Producer! = 0 & acks! When it is set to 1, the size of Replica may significantly affect the performance of Produce data. Too few Replica will affect Data availability. Too many Replica will waste storage resources. Generally, it is recommended that ~ 3.
Fetch. purgatory. purge. interval. requests: 1000producer. purgatory. purge. interval. requests: 1000
For more details, please continue to read the highlights on the next page: