Kafka partition number and consumer number

Last Update:2018-10-21 Source: Internet

Author: User

Tags zookeeper

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Kafka the number of partitions is not the more the better? Advantages of multiple partitions

Kafka uses partitioning to break topic messages to multiple partition distributions on different brokers, enabling high throughput of producer and consumer message processing. Kafka's producer and consumer can operate in parallel in multiple threads, and each thread is processing a partitioned data. So partitioning is actually the smallest unit of tuning Kafka parallelism. For producer, it actually uses multiple threads to initiate a socket connection to the broker where the different partitions are located and send messages to those partitions at the same time; consumer, All consumer threads within the same consumer group are consumed by a partition of the specified topic.

So, if more than one topic partition, theoretically the entire cluster can achieve the greater throughput.

No more partitions, better.

Is it better to have more partitions? Obviously not, because each partition has its own overhead:

the more memory you need to use on the client/server sideAfter Kafka0.8.2, the client producer has a parameter batch.size, which defaults to 16KB. It caches messages for each partition and packs the messages in batches once they are full. It looks like it's a design that improves performance. Obviously, because this parameter is at the partition level, if the number of partitions is greater, this portion of the cache will require more memory. Assuming you have 10,000 partitions, by default, this portion of the cache consumes approximately 157MB of memory. And the consumer end? We throw aside the memory needed to get the data, not to mention the thread overhead. If you still have 10,000 partitions, and the number of consumer threads to match the number of partitions (in most cases the optimal consumption throughput configuration), then the consumer client will create 10,000 threads, You also need to create about 10,000 sockets to get the partition data. The overhead of thread switching in this is no longer negligible.
Server-side overhead is not small, if you read Kafka source can be found that many components of the server side in memory maintain the partition-level cache, such as Controller,fetchermanager, so the more partitions, the cost of this cache is greater.
second, the cost of the file handleEach partition has its own directory in the underlying file system. There are usually two files in this directory: Base_offset.log and Base_offset.index. The controller and Replicamanager of Kafak will save these two file handles for each broker (filename handler). Obviously, the greater the number of partitions, the more file handles you need to keep open, which may end up breaking your ulimit-n limit.
third, reduce high availabilityKafka is guaranteed to be highly available through a replica (replica) mechanism. The practice is to save several copies of each partition (Replica_factor specifies the number of replicas). Each copy is saved on a different broker. A copy of the interim is acting as a copy of leader, handling producer and consumer requests. The other replicas act as follower roles, and the Kafka Controller is responsible for ensuring synchronization with the leader. If the broker where the leader is located, Contorller will detect and then re-elect the new leader--with the help of zookeeper, which will have a short, unavailable time window, although in most cases it may be just a few milliseconds. But if you have 10,000 partitions, 10 brokers, which means there are 1000 partitions on average on each broker. Now that the broker is dead, the zookeeper and controller need to leader the 1000 partitions immediately. This must take longer than a very small number of partitioned leader elections, and usually not linearly cumulative. It would be even worse if the broker was also a controller at the same time.

How do I determine the number of partitions?

You can follow certain steps to try to determine the number of partitions: Create a topic with only 1 partitions, and then test the producer throughput and consumer throughput of this topic. Assuming that their values are TP and TC respectively, the unit can be MB/s. Then assume that the total target throughput is Tt, then the number of partitions = Tt/max (Tp, Tc)

Description: TP represents the throughput of the producer. Testing producer is usually easy, because its logic is very simple, just send the message directly to the Kafka. The TC represents the throughput of the consumer. The test TC is usually more related to the application because the TC's value depends on what you do after you get the message, so TC testing is usually a bit cumbersome.

How does a message know which partition to send to? Assign by key value

By default, Kafka allocates the partition based on the key of the message being passed, that is, hash (key)% Numpartitions:

def partition(key: Any, numPartitions: Int): Int = {    Utils.abs(key.hashCode) % numPartitions}

This ensures that messages of the same key must be routed to the same partition.

When key is NULL, the partition ID is taken from the cache or a random

If you don't specify a key, then how does Kafka determine which partition the message goes to?

if (key = =NULL) {If no key is specified val id = sendpartitionpertopiccache.get (topic)Let's see if Kafka has a cache of ready-made partition ID match {case Some (PartitionID) = PartitionID // If any, use this partition ID directly case None = > //if not, Val availablepartitions = Topicpartitionlist.filter (_. leaderbrokeridopt.isdefined) //Find all available partitions leader the broker if (availablepartitions.isempty) throw new Leadernotavailableexception ( "No leader for all partition in topic" + topic) Val index = UTILS.A BS (random.nextint)% availablepartitions.size //from which to randomly pick a val PartitionID = Availablepartitions (index). PartitionID sendpartitionpertopiccache.put (topic, PartitionID) //update cache for next direct use PartitionID}}

When you do not specify a key, Kafka is almost randomly looking for a partition to send a message without a key, and then add the area code to the cache for immediate use--of course, the Kafka itself empties the cache (by default every 10 minutes or every time the topic metadata is requested).

What is the relationship between the number of consumer and the number of partitions?

A partition under topic can only be consumed by a consumer thread under the same consumer group, but it does not, that is, a consumer thread can consume data from multiple partitions, such as Kafka supplied The default is just one thread that consumes data from all partitions.

i.e. the number of partitions determines the maximum number of consumers in the same group

Image.png

So, if your partition number is n, then the best number of threads is also maintained as N, which usually achieves maximum throughput. A configuration that exceeds n is a waste of system resources because the extra threads are not allocated to any partitions.

allocation strategy of consumer consumption partition

Kafka provides two allocation policies: Range and Roundrobin, specified by the parameter partition.assignment.strategy, which is the range policy by default.

When the following event occurs, Kafka will make a partition assignment:

A new consumer in the same Consumer Group
The consumer leaves the consumer Group that is currently affiliated, including shuts down or crashes
Subscribed topics Add a partition

Moving the ownership of a partition from one consumer to another is called rebalancing (rebalance), and how rebalance relates to the partition allocation policy mentioned in this article.
Here we will detail the two partition allocation policies built into Kafka. This article assumes that we have a theme named T1, which contains 10 partitions, and then we have two consumers (C1,C2)
To consume data from these 10 partitions, and C1 's num.streams = 1,c2 's Num.streams = 2.

Range strategy

The range policy is for each topic, first sorting the partitions within the same topic by ordinal and sorting the consumers alphabetically. In our example, the sequence of partitions will be 0, 1, 2, 3, 4, 5, 6, 7, 8, 9; The consumer thread sequencing will be c1-0, c2-0, c2-1. It then determines how many partitions each consumer thread consumes, in addition to the total number of consumer threads in the partitions. If it's not enough, the first few consumer threads will consume more than one partition. In our example, we have 10 partitions, 3 consumer threads, 10/3 = 3, and besides that, consumer thread c1-0 will consume more than one partition, so the result of the last partition allocation looks like this:

C1-0 will consume 0, 1, 2, 3 partitions
C2-0 will consume 4, 5, 6 partitions
C2-1 will consume 7, 8, 9 partitions

If we had 11 partitions, the result of the last partition allocation would look like this:

C1-0 will consume 0, 1, 2, 3 partitions
C2-0 will consume 4, 5, 6, 7 partitions
C2-1 will consume 8, 9, 10 partitions

If we have 2 themes (T1 and T2), each with 10 partitions, the result of the last partition allocation looks like this:

C1-0 will consume T1 themes of 0, 1, 2, 3 partitions and T2 themes of 0, 1, 2, 3 partitions
C2-0 will consume T1 theme of 4, 5, 6 partition and T2 theme of 4, 5, 6 partition
C2-1 will consume T1 theme of 7, 8, 9 partition and T2 theme of 7, 8, 9 partition

As you can see, c1-0 consumer threads consume 2 more partitions than other consumer threads, which is a clear drawback of range strategy.

Roundrobin strategy

There are two prerequisites for using the Roundrobin policy that must be met:

The num.streams of all consumers in the same consumer group must be equal;
The theme must be the same for each consumer subscription.

Therefore assume here the num.streams of the 2 consumers mentioned above = 2. How the Roundrobin strategy works: Make a topicandpartition list of all the topics, then sort the topicandpartition list by hashcode, and see the following code to understand:

val alltopicpartitions = ctx.partitionsForTopic.flatMap { Span class= "Hljs-keyword" >case (topic, partitions) = info ( "Consumer%s rebalancing the following partitions for topic%s:%s". Format (Ctx.consumeri D, topic, partitions)) Partitions.map (partition = { Topicandpartition (topic, Partition)})}.toseq.sortwith ( ( TopicPartition1, TopicPartition2) = {/* * Randomize the ORDER by taking the hashcode to re Duce the likelihood of all partitions of a given topic ending * up on one consumer (if it had a high enough stream count). */TopicPartition1.toString.hashCode < TopicPartition2.toString.hashCode})

Finally, the partitions are assigned to different consumer threads according to the Round-robin style.

In this example, if the Topic-partitions group sorted by Hashcode is T1-5, T1-3, t1-0, T1-8, T1-2, T1-1, T1-4, T1-7, t1-6, t1-9, our consumer threads are sorted as c1-0 , C1-1, c2-0, c2-1, the result of the last partition assignment is:

C1-0 will consume T1-5, t1-2, t1-6 partition;
C1-1 will consume T1-3, t1-1, t1-9 partition;
C2-0 will consume t1-0, t1-4 partition;
C2-1 will consume t1-8, t1-7 partition;

Partition Allocation for multiple topics is similar to a single topic. Unfortunately, we have not yet been able to customize the partition allocation policy and can only select range or roundrobin through the Partition.assignment.strategy parameter.

Transferred from: https://www.jianshu.com/p/dbbca800f607

Kafka partition number and consumer number

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More