Before we introduce why we use Kafka, it is necessary to understand what Kafka is. 1. What is Kafka.
Kafka, a distributed messaging system developed by LinkedIn, is written in Scala and is widely used for horizontal scaling and high throughput rates. At present, more and more open-source distributed processing systems, such as storm,spark,flink, support integration with Kafka. Now our data real-time processing platform is also used in the Kafka. It has now been used by several different types of companies as multiple types of data pipelines and messaging systems. 2. Why use the messaging system.
We mentioned above that Kafka is a distributed messaging system. Then why use such a messaging system in our data processing platform? What kind of benefits does the messaging system bring to us?
(1) Decoupling
It is extremely difficult to anticipate what needs to be encountered in future projects at the start of the project. The message system inserts an implicit, data-based interface layer in the middle of the processing process, which is implemented on both sides of the processing. This allows you to independently extend or modify the processing on both sides, as long as you ensure that they adhere to the same interface constraints.
(2) Redundancy
In some cases, the process of processing data will fail. Unless the data is persisted, it is lost. Message Queuing persists the data until it has been fully processed, bypassing the risk of data loss in this way. In the insert-get-delete paradigm used by many message queues, it is necessary for your processing system to explicitly indicate that the message has been processed before it is removed from the queue, ensuring that your data is safely saved until you are finished using it.
(3) Extensibility
Because Message Queuing decouples your processing, it is easy to increase the number of messages queued and processed, as long as additional processing is required. No need to change the code, do not need to adjust parameters. Expansion is as simple as adjusting the power button.
(4) Flexibility & peak handling capability
Applications still need to continue to function in the event of a surge in traffic, but such bursts are not common, and it is a huge waste to be ready to invest in resources that can handle such peak access. Using Message Queuing enables critical components to withstand burst access pressure without crashing completely due to sudden and overloaded requests.
(5) Order Guarantee
In most usage scenarios, the order of data processing is important. Most message queues are inherently sorted and ensure that the data is handled in a specific order. Kafka guarantees the ordering of messages within a partition.
(6) Buffer
In any important system, there will be elements that require different processing times. For example, loading a picture takes less time than applying a filter. Message Queuing uses a buffer layer to help the task perform the most efficient execution ——— the processing of the write queue is as fast as possible. This buffering helps to control and optimize the speed of the data flow through the system. 3. Why is Kafka.
Above we know we need to use a message system in the data processing system, but why must we choose Kafka? Now the message system is not only Kafka, as the saying goes, shop around, we look at the difference between Kafka and other messaging systems.
The LinkedIn team did an experimental study comparing the performance of Kafka with Apache ActiveMQ V5.4 and Rabbitmq V2.4. LinkedIn runs their experiments on two Linux machines, each configured with 8 cores of 2GHz, 16GB of memory, and 6 disks using RAID10. Two machines are connected via a 1GB network. One machine as an agent and the other as a producer or consumer. 3.1 Producer Testing
For each system, run a producer and publish a total of 10 million messages, each message 200 bytes. Kafka producers send messages in batches of 1 and 50. Activemq and RABBITMQ seem to have no easy way to send messages in bulk, and LinkedIn assumes that it has a batch value of 1. The result is shown in the following figure:
The main causes of Kafka performance are:
(1) Kafka does not wait for the agent to confirm that the agent can process the fastest speed to send messages.
(2) The Kafka has a more efficient storage format. On average, Kafka each message has a 9-byte overhead, while Activemq has 144 bytes. This is due to the heavy message headers required for JMS and the overhead of maintaining various index structures. LinkedIn notes that activemq a busiest thread spends most of its time accessing b-tree to maintain message metadata and state. 3.2 Consumer Testing
To do consumer testing, LinkedIn uses a consumer to get a total of 10 million messages. LinkedIn allows all systems to pre-fetch approximately the same amount of data per pull request, up to 1000 messages or 200KB. Set the consumer confirmation model to ACTIVEMQ and Rabbitmq,linkedin automatically. The result is shown in the following figure:
The main causes of Kafka performance are:
(1) Kafka has a more efficient storage format; In Kafka, fewer bytes are transferred from the agent to the consumer.
(2) agents in the ACTIVEMQ and RABBITMQ two containers must maintain the transmission status of each message. The LinkedIn team noticed that one of the ACTIVEMQ threads was writing the KAHADB page to disk during the testing process. In contrast, the Kafka agent has no disk write action. Finally, Kafka reduces the transport overhead by using the Sendfile API.