Analysis of Kafka design concepts

Source: Internet
Author: User
Tags sendfile

This article will try to explain the design concept of Kafka from the following two aspects:

  • Kafka design background and causes

  • Design Features of Kafka

Kafka design background and causes

Kafka was initially designed by LinkedIn to process activity stream data and operaitonal data ). Activity stream data is data generated by user operations such as page view and user search keywords. Common scenarios include time line (timeline), which is a new reminder and ranking of user views. The system processes data related to server performance, such as CPU, load, and user requests. Most of its application scenarios are background services, such as security, users of malicious attack servers can be monitored to take appropriate measures, as well as server performance monitoring, and instant alerts when problems occur.

Both types of data belong to the log data category. Common log systems, such as scribe, collect the data and then obtain the required results through offline batch processing, such as hadoop clusters. Offline processing is generally not too frequent, for example, once an hour or even once a day. This is not suitable for real-time applications, such as timeline. The existing Message Queue Systems are very suitable for such scenarios with high real-time requirements. However, because they all maintain message queues in the memory, the data size is limited. Here, I believe that smart readers have already understood the cause of Kafka, that is, to integrate the two scenarios mentioned above into a system, that is, offline Big Data Analysis and online real-time data analysis can be implemented through this system.

Kafka is designed to retain the common operations of message queues, which makes it more and more widely used as a message queue after its birth, it is not limited to processing the two types of data mentioned above.

Design Features of Kafka

Kafka has the following features:

  • Message data is linearly accessed through a disk
  • Throughput
  • The consumption status is maintained by the consumer.
  • Distributed
Message data is linearly accessed through a disk

This feature should be the most amazing feature of Kafka. In our understanding, the data processing speed (read/write) of the hard disk is much slower than that of the memory. Therefore, basically all programs designed for data processing use the memory whenever possible. However, after some research and tests, Kafka designers boldly adopted the solution of full hard disk access to message data. Their main basis is:

  • Hard disks provide excellent performance in linear read/write scenarios.
    • Tests show that linear write can reach 7200 MB/sec on a 6 300 rpm SATA raid-5 array, while random write is only 50 KB/sec. The reason why linear read/write can have such excellent performance is inseparable from the file system, because for write operations, the operating system generally performs buffering, while for read operations, the operating system performs pre-captured buffer operations, which greatly improves Read efficiency. This layer of cache operations are at the OS level, which means that the cache will not expire even if Kafka fails to be restarted.
  • Reduce jvm gc triggering.
    • Objects in the JVM occupy a large amount of space (such as class information) except the actual data, which is not compact enough and wastes space. When the number of message data maintained in the memory increases gradually, GC will be triggered frequently, which will greatly affect the application response speed. Therefore, discard the memory and use the disk to reduce the impact of GC triggering.

In the Kafka thesis, the performance comparison with activemq and other message queues further affirmed the design concept of Kafka.

Throughput

Kafka was designed to process terabytes of data, with a greater emphasis on throughput. Kafka has made many optimizations in terms of throughput performance in both reading and writing.

Write

In the previous article we mentioned the form of Kafka storing message data, such as: topic-partition-0000.kafka-1024. kafka-2048. when Kafka is started, all files in the folder are opened in the form of channel, and only the last Kafka file is opened in the form of read/write, and others are opened in the read-only mode, new messages are directly appended to the last Kafka file, so that sequential writing is realized. As mentioned above, the performance of sequential writing is extremely high, so that the write performance is guaranteed.

Read

Kafka uses sendfile, an advanced system function, that is, zero-Copy technology. If you are interested, you can read IBM articles. This technology reduces the number of system copies, greatly improving the efficiency of data transmission. Simple Description:

The program reads the file content and calls the socket to send the content. The process involves the following steps:

  • Call syscall, such as read, into the kernel, read the file content to the kernel cache pagecache

  • The program copies the file content from the kernel cache area to the user space memory.

  • Call syscall, such as the socket write function, and fall into the kernel. Copy the contents of the user space into the socket buffer memory, and prepare to send

  • Copy the socket buffer memory to the NIC (network interface controller) buffer to send data.

Two syscall and four data copies are involved. I believe that the four data copies are obviously unnecessary. Can I directly read the data from the pagecache kernel cache to the NIC buffer? The sendfile system function does this. Obviously, this will greatly improve the efficiency of data transmission. In Java, the corresponding function call is

FileChannle.transferTo

In addition, Kafka further improves the throughput by compressing, transmitting, and accessing multiple data entries.

The consumption status is maintained by the consumer.

The consumption status of Kafka message data is maintained by the consumer. The reason is not detailed here. Interested parties can read the references at the beginning of this article. Here we will briefly explain the benefits of doing so:

  • Remove the pressure on the server to maintain the consumption status.

  • This improves the degree of freedom for consumers to store and consume data. For example, the storage location can be stored in the zookeeper database or HDFS, which can be based on the needs of consumers.

  • For special requirements, such as message consumption failure, the consumer can roll back and re-consume the message.

Distributed

Kafka's broker producer and consumer are both distributable. The implementation is to maintain the information of the three in the cluster through zookeeper, so as to achieve interaction between the three. The detailed implementation will be available later.

I believe that you have some knowledge about Kafka design through the above introduction. This design can fully integrate the offline Big Data Processing and online real-time processing needs. For offline processing, Kafka supports batch migration of data to HDFS. For real-time online processing, it is easy to set the Consumer processing feature properly. LinkedIn once cited a Kafka application case in its article. I would like to share with you here. Kafka is used to update indexes. After a user updates his/her data, the producer generates a message and sends it to the broker. The consumer immediately obtains the data and then updates the index, in this way, we can achieve the "second-level Update" feature of data during search.

Summary

This article attempts to explain the origins and design features of Kafka, but the author's ability is limited. If you have any mistakes, please leave a message. Next, I will make a simple analysis of the Kafka source code, and hope that interested readers can participate.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.