A very important design principle of distributed systems is loose coupling, that is, minimizing dependencies between subsystems. In this way, subsystems can evolve, maintain, and reuse independently of each other. Message Queue (MQ) is a good means of decoupling. For more information about the role of MQ in system integration, see the enterprise integration patterns (EIP) book or corresponding website. Simply put, the publisher only publishes a message to MQ, and no matter who gets it, the message user only obtains the message from MQ, regardless of who publishes the message. In this way, both the publisher and the user do not need to know the other party's existence.
There are also many MQ products and many open-source products. Common examples include activemq, openmq, and rabbitmq. I have also used the MQ system before, and recently I am thinking about how to use MQ in the SAAs system. So we can see on the Internet what kind of MQ system has good scalability and can support large-scale data streams. Then we can find Kafka.
1. What is Kafka?
Kafka is a distributed MQ system developed and open-source by LinkedIn. It is now an incubator project of Apache. On its homepage, Kafka is described as a high-throughput distributed MQ that can distribute messages to different nodes. In this blog post, the author briefly mentioned the reasons for developing Kafka without selecting an existing MQ system. Two reasons: Performance and scalability. Here is an appropriate explanation.
Basically, most (if not all) MQ systems are designed for enterprise integrated applications, rather than large-scale service applications. What is the difference between the two?
The basic feature of enterprise integration is to integrate existing unrelated applications in the enterprise. For example, an enterprise may want to integrate the financial system and the warehouse management system to reduce the cost and time of inter-department settlement and circulation, and better support upper-layer decision-making. However, these two systems are made by different manufacturers and cannot be modified. In addition, enterprise integration is a continuous and progressive process, with frequent demand changes. The requirements for the MQ system are flexible and customizable. Therefore, common MQ systems can be customized through the xml configuration or plug-in development to meet the needs of business processes of different enterprises. Most of them can define some modes in EIP by configuring different levels of support. However, the design goals do not focus much on scalability and performance, because enterprise-level applications generally do not have a large data stream and scale. Even if some of them are relatively large, you can use a high-configuration server or a cluster with a few simple nodes.
A large-scale service refers to an application that targets the public at a level such as Facebook, Google, LinkedIn, and Taobao or may grow to this level. Compared with enterprise integration, the business processes of these applications are relatively stable. The business complexity of integration between subsystems is relatively low, because subsystems are usually carefully selected and designed and can be adjusted. Therefore, the customization and customization of MQ systems do not require high complexity. However, because the data volume is huge, not a few servers can meet the requirements, dozens or even hundreds of servers may be required, and performance requirements are high to reduce costs, therefore, the MQ system must be well scalable.
Kafka is an MQ system that meets SaaS requirements. It improves performance and scalability by reducing the complexity of the MQ system.
2. Kafka Design
The design document of Kafka details its design philosophy. Here is a brief list and discussion.
Basic Concepts
Kafka works basically the same way as other MQ, but it is somewhat different in terms of some naming terms. For better discussion, we will give a brief explanation of these terms. Through these explanations, you can get a general idea of how Kafka MQ works.
- Producer (P): the client that sends messages from Kafka.
- Consumer (c): client for Retrieving messages from Kafka
- Topic (t): it can be understood as a queue.
- Consumer group (CG): Kafka is used to broadcast a topic message (send to all consumer) and Unicast (send to any consumer. A topic can have multiple CG instances. Messages of a topic are copied (not actually copied, conceptual) to all CG instances, but each CG sends messages to a consumer in the CG group. To implement broadcast, as long as each consumer has an independent CG. To implement unicast, as long as all consumers are in the same CG. You can also use CG to group consumer freely without sending messages to different topics multiple times.
- Broker (B): a Kafka Server is a broker. A cluster consists of multiple brokers. A broker can accommodate multiple topics.
- Partition (P): To achieve scalability, a very large topic can be distributed to multiple brokers (servers. Kafka only guarantees that messages are sent to consumer in the order of one partition, but does not guarantee the order of the entire topic (between multiple partitions.
Reliability (consistency)
MQ must implement reliable message transmission and distribution from producer to consumer. Traditional MQ systems are usually implemented through the ACK mechanism between the broker and the consumer, and the status of message distribution is saved in the broker. Even such consistency is hard to guarantee (refer to the original article ). Kafka is saved by the consumer, and do not confirm the status. In this way, although the consumer burden is heavier, it is actually more flexible. Message re-processing is required for any reason on the consumer, and can be obtained from the broker again.
Kafka producer has an asynchronous sending operation. This is to improve performance. The producer puts the message in the memory before returning it. In this way, the caller (Application) does not need to wait until the network transmission ends. Messages in the memory are sent to the broker in batches in the background. Because the message will stay in the memory for a period of time, there is a risk of message loss during this period. Therefore, you need to evaluate this point carefully when using this operation.
In addition, in the latest version, message replication between brokers is also implemented to remove spof from brokers ).
Scalability
Kafka uses zookeeper to implement dynamic cluster expansion without changing the configurations of the client (producer and consumer. The broker registers and keeps related metadata (topic, partition information, etc.) updated in zookeeper. The client registers the related watcher on zookeeper. Once zookeeper changes, the client can promptly perceive and make corresponding adjustments. In this way, Server Load balancer can be automatically implemented between brokers when brokers are added or removed.
Server Load balancer
Server Load balancer can be divided into two parts: Server Load balancer for sending messages and Server Load balancer for reading messages from consumer.
A producer has a connection pool to all brokers. When a message needs to be sent, you need to decide which broker to send (partition ). This is implemented by partitioner, which is implemented by applications. Applications can implement any partitioning mechanism. To achieve Load Balancing while taking into account the order of messages (only messages on a partition/broker can be delivered in order), the implementation of partitioner is not easy. I personally think this remains to be improved.
When a consumer reads a message, in addition to the current broker situation, it also needs to consider other consumer situations to determine which partition to read the message from. The specific mechanism is not very clear and further research is needed.
Performance
Performance is a key factor in the design of Kafka. Multiple methods are used to ensure stable O (1) performance.
Kafka uses a disk file to save the received message. It uses a mechanism similar to Wal (write ahead log) to implement sequential read/write to the disk, and then periodically writes messages to the disk in batches. Messages are read sequentially. This is in line with the sequential read and append write features of MQ.
In addition, Kafka reduces network transmission through batch message transmission, and uses the sendfile and 0 copy mechanisms in Java to reduce the number of times of memory data copying and kernel user mode switching between reading files and sending messages.
According to the Kafka performance test report, its performance basically reaches the complexity of O (1.
3. Summary
From the above, I personally think that Kafka is suitable for simple message transmission and distribution and supports large data volumes. However, to implement the complex EIP mode, it is not as easy as the traditional MQ. In addition, only messages in partition can guarantee the delivery order. If the order of messages is important and requires good scalability, it may be difficult to implement Kafka. Therefore, Kafka is suitable for processing simple events and messages, such as log collection and real-time analysis of a large amount of fact data (Kafka can be integrated with mapreduce ).
However, it should be noted that Kafka is still an Apache incubator project, but it is not very mature, although development activities are still quite active.