One of the most important features of the Kafka theme is the ability to let consumers specify the subset of messages they want to consume. In extreme cases, it may not be a good idea to put all your data in the same topic, because consumers cannot choose the events they are interested in-they need to consume all the messages. Another extreme situation, having millions of different themes is not a good idea, because each Kafka topic is cost-intensive and has a lot of themes that can damage performance.
In fact, from a performance standpoint, the number of partitions is the key factor. In Kafka, each topic corresponds to at least one partition, and if you have n topics, there will be at least n partitions. Not long ago, June Rao wrote a blog post explaining the cost of having multiple partitions (end-to-end latency, file descriptors, memory overhead, recovery time after a failure). Based on experience, if you care about latency, then each node allocates hundreds of partitions. If the number of partitions per node exceeds tens of thousands, there is a significant delay.
The discussion of performance provides some guidance for designing a topic structure: If you find yourself having thousands of themes, it might be wise to merge some fine-grained, low-throughput topics into coarse-grained topics, which avoids the spread of partitions.
However, performance is not the only problem we care about. In my opinion, more important is the data integrity and data model of the subject structure. We'll discuss these in the remainder of this article.
The topic equals the collection of events of the same type?
It is generally accepted that the same type of events should be placed in the same topic, and that different event types should use different themes. This idea reminds us of relational databases where tables are collections of the same type of records, so we have an analogy between database tables and Kafka topics.
Confluent Avro Schema Registry further reinforces this concept because it encourages you to use the same Avro schema (schema) for all messages on the subject. Patterns can evolve while maintaining compatibility (for example, by adding optional fields), but all messages must conform to a record type. I'm going to go back and talk about it later on.
It is reasonable for certain types of streaming data, such as active events, to require that all messages in the same topic conform to the same pattern. However, some people use Kafka as a database, such as event traceability, or exchanging data between microservices. In this case, I think it's less important to define a topic as a collection of messages with the same pattern. This time, more importantly, the message in the topic partition must be ordered.
Imagine a scenario where you have an entity (such as a customer) that can have many different things, such as creating a customer, changing the address of the customer, adding a new credit card to the account, initiating a customer service request, paying the customer's bill, and closing the account with the customer.
The order between these events is important. For example, we want other events to occur after the customer has been created, and no other events can occur after the customer closes the account. When using Kafka, you can put them all in the same topic partition to keep them in order. In this example, you can use the customer ID as the key for the partition, and then place all the events in the same topic. They must be in the same topic because different topics correspond to different partitions, and Kafka does not guarantee the order of the partitions.
Order problems
If you use different themes for the customercreated, customeraddresschanged, and Customerinvoicepaid events, consumers of these topics may not see the order between these events. For example, a consumer may see an address change made by a non-existent customer (the customer has not yet been created because the corresponding customercreated event may be delayed).
If consumers pause for a while (for example, to maintain or deploy a new version), the likelihood of an event being scrambled is even higher. During a consumer stop, events continue to be released, and these events are stored in a specific subject partition. When the consumer starts again, it consumes all the backlog of events in the partition. If consumers only consume one partition, then there is no problem: the backlog of events is processed sequentially in the order in which they are stored. However, if consumers consume several topics at the same time, the data in the topic is read in any order. It can read all the data in a single topic, and then read the backlog of data on another topic, or interleave the data on multiple topics.
So if you put customercreated, customeraddresschanged, and Customerinvoicepaid events in three separate themes, consumers might see customercreated The Customeraddresschanged event is seen before the event. As a result, consumers are likely to see a customer's Customeraddresschanged event, but the customer has not been created.
You might think of attaching a timestamp to each message and using it to sort the events. If you import events into the Data warehouse and then sort the events, there may be no problem. But it is not enough to use only timestamps in streaming data: When you receive an event with a specific timestamp, you do not know whether you need to wait for an event with an earlier timestamp, or if all previous events have arrived before the current thing. Relying on clocks for synchronization often leads to nightmares, and for more details on clocking problems, see Chapter 8th, "Designing Data-intensive Applications".
When to split themes and when to merge themes?
Based on this background, I will give some experience to help you determine which data should be in the same topic and which data should be placed in different topics.
First, events that need to be kept in a fixed order must be placed in the same topic (and use the same partition key). If the events belong to the same entity, then the order of events is important. Therefore, we can say that all events of the same entity should be saved in the same subject.
If you use event traceability for data modeling, the sequencing of events is especially important. The state of an aggregated object is derived by replaying the event log in a specific order. Therefore, even though there may be different event types, all the events required for aggregation must be in the same topic.
For events of different entities, should they be saved in the same topic or in a different theme? I want to say that if an entity relies on another entity (such as an address belonging to a customer), or often needs to use them at the same time, they should also be saved in the same topic. On the other hand, if they are unrelated and belong to different teams, it's best to put them in different topics.
Also, this depends on the throughput of the event: if the event throughput of an entity type is much higher than that of other entities, it is best to divide it into several topics so as not to be overwhelmed by consumers who want to consume only low-throughput entities (see 4th). However, you can combine multiple entities with low throughput.
What if an event involves more than one entity? For example, an order involves products and customers, and transfers involve at least two accounts.
I recommend that you record these events as a single atomic message at the outset, rather than dividing it into several messages that belong to different topics. When recording an event, it is best to remain intact, keeping the data in its original form as much as possible. You can then use a streaming processor to split composite events, but if you split prematurely, it would be much harder to recreate the original event. If you can assign a unique ID (such as a UUID) to the initial event, then if you want to split the original event, you can take this ID, which can be traced back to the origin of each event.
Look at the number of topics consumers need to subscribe to. If several consumers subscribe to a specific set of topics, this indicates that these topics may need to be merged together.
If you combine fine-grained themes into a coarse-grained theme, some consumers may receive events they don't need and need to ignore them. This is not a big problem: the cost of consuming messages is very low, and even if the more than half event is eventually ignored, the overall cost may not be significant. It is only when the consumer needs to ignore the overwhelming majority of messages (for example, 99.9% is not needed) that I recommend splitting the bulk event stream into a small volume event stream.
The change log theme used as the Kafka Streams State Store (ktable) should be separate from other topics. In this case, these topics are managed by the Kafka Streams process and should not contain other types of events.
Finally, what if the rules above are still unable to make the right judgments? Then group the events by type and put the same type of events in the same topic. However, I think this rule is the least important one.
Mode Management
If your data is plain text, such as JSON, and you don't use static mode, it's easy to put different types of events in the same topic. However, if you are using pattern encoding (such as Avro), there are more things to consider when saving multiple types of events in a single topic.
As mentioned above, the Kafka confluent Schema Registry based on Avro assumes a premise that each topic has a pattern (rather, a pattern is used for the key of the message, and a pattern is used for the value of the message). You can register the new version of the mode, and the registry checks whether the mode is forward and backward compatible. One of the benefits of this design is that you can allow different producers and consumers to use different versions of the model at the same time, and still maintain compatibility with each other.
The confluent Avro Serializer registers the schema in the registry by subject name. By default, the subject of the message key is-key, and the subject of the message value is-value. The schema registry checks for compatibility of all modes registered under a specific subject.
Recently, I have provided a patch (https://github.com/confluentinc/schema-registry/pull/680) for the Avro serializer to make compatibility checking more flexible. This patch adds two new configuration options: Key.subject.name.strategy (used to define how the subject name of the message key is constructed) and Value.subject.name.strategy, which defines how the message value is constructed Subject name). Their values can be as follows:
Io.confluent.kafka.serializers.subject.TopicNameStrategy (default): The subject name of the message key is called-key, and the message value is-value. This means that the pattern of all messages in the topic must be compatible with each other.
The Io.confluent.kafka.serializers.subject.RecordNameStrategy:subject name is the fully qualified name of the Avro record type. Therefore, the schema registry examines the compatibility of a particular record type, regardless of the subject. This setting allows the same topic to contain different types of events.
The Io.confluent.kafka.serializers.subject.TopicRecordNameStrategy:subject name is-, which is the Kafka topic name, is the fully qualified name of the Avro record type. This setting allows the same topic to contain different types of events and further compatibility checks on the current topic.
With this new feature, you can easily place all the different types of events that belong to a particular entity in the same topic. Now you have the freedom to choose the granularity of the topic, not just a single topic for a type.
Like small series gently attention Oh!
Kafka Practice: Should you put different types of messages in the same topic?