Original address: http://blog.csdn.net/honglei915/article/details/37564757
Kafka Video tutorial Sync Starter, welcome to watch.
Kafka has made great efforts in improving efficiency. One of the main usage scenarios for Kafka is to process the site activity log, which is very large, and each page produces several writes. Reading, assuming that each message is consumed only once, the amount of reading is also very large, Kafka also try to make the read operation more lightweight. We discussed disk performance issues before, and there are about two things that affect disk performance in the case of linear read and write: Too many trivial I/O operations and too many byte copies. The I/O problem occurs between the client and the server, and also in persistent operations inside the server.
Messaging Set (message set)
To avoid these problems, Kafka established the concept of message set, which organizes messages together as units of processing. Processing messages in a message set is more performance-enhancing than handling them in a single message unit. Producer sends a piece of message to the server, rather than to the delivery of a strip; The server appends the message set to the log file one time, which reduces trivial I/O operations. Consumer can also request a message set at once.
Another performance optimization is in the byte copy aspect. This is not a problem in low-load situations, but it is still very large in the case of high loads. To avoid this problem, Kafka uses the standard binary message format, which can be shared between Producer,broker and producer without any changes.
Zero Copy
The message log maintained by the broker is simply a directory file, and the message set is written to the log file in a fixed team format, producer and consumer are shared, which makes Kafka an important point to optimize: the delivery of messages over the network. Modern UNIX operating systems provide high-performance system functions that send data from the page cache to the socket, which is sendfile in Linux.
To better understand the benefits of Sendfile, let's look at the data flow that typically sends data from a file to a socket:
The operating system writes the data from the page cache in the file copy kernel to the application from the page cache to the application that copies the data in its own memory cache to the kernel in the socket cache the operating system copies data from the socket cache to the NIC interface cache, where it is sent to the network.
This is obviously inefficient, with 4 copies and 2 system calls. Sendfile the network card interface cache by directly sending data from the page cache, avoiding duplicate copies and greatly optimizing performance.
In a multi-consumers scenario, the data is copied only once to the page cache once instead of every time the message is consumed. This allows messages to be sent at a rate of near-network bandwidth. At the disk level you can hardly see any read operations, because the data is sent directly to the network from the page cache.
This article describes in detail the application of Sendfile and Zero-copy technology in Java.
Data Compression
Many times, the bottleneck of performance is not CPU or hard disk but network bandwidth, especially for applications that need to transfer large amounts of data between data centers. Of course, users can compress their own messages without Kafka support, but this will result in a lower compression ratio, since compressing a large number of files together can be the best way to compress compared to a separate message.
Kafka uses end-to-end compression: Because there is a "message set" concept, the client's message can be compressed together and sent to the server, and in a compressed format to write to the log file, in a compressed format sent to the consumer, the message from producer sent to consumer get are compressed is decompressed only when consumer is used, so it is called "end-to-end compression."
Kafka supports GZIP and snappy compression protocols. More detailed information can be seen here.