If you read Kafka for the first time, read the distributed message system Kafka preliminary
Some people have asked the difference between Kafka and general MQ, which is difficult to answer. I think it is better to analyze the implementation principles of Kafka, based on the design provided on the official website, this article will analyze in detail how Kafka achieves its high performance and high throughput. This section should be quite long. I want to write it in two articles. Today, this article focuses on the implementation details of Kafka from a macro perspective, and the next article analyzes Kafka from specific technologies.
Let's first look at the design elements of Kafka:
1. In general, Kafka is used for message persistence (persistent messages)
2. throughput is the main goal of Kafka Design
3. The consumption status is recorded as part of the consumer, rather than the server. This slightly explains that the server here is still a broker, and the number of data consumed by the consumer is recorded in the consumer's own hands, and there is no broker. It is reasonable to say that the consumption record is also a log that can be stored in the broker. As to why this design is necessary, let's write it down.
4. The distribution of Kafka can be manifested in the distribution of producer, broker, and consumer on multiple machines.
Before talking about implementation principles, we have to understand several terms:
L topic: in fact, this word is not mentioned on the official website, but topic is the key to understanding. In Kafka, different data can be stored according to different topics.
L message: the message is the object processed by Kafka. In Kafka, the message is published to the broker topic. Consumer also obtains data from the corresponding topic. That is to say, messages are stored by topic.
L consumer group: although article 4 of the above design elements, we say that all three can be deployed on multiple machines, and each of the three can be used as a logical group, however, for consumer, such deployment requires special support. Consumer
A group is used to logically assume a consumer for multiple (related) processes (machines. The group is defined to support semantics such as topic. In JMS, we are most familiar with queues. We put all the consumers in a group. This is the queue. The topic places the consumer in the topic related to it. Therefore, no matter how many consumer a topic exists,
Message is stored only a single time. You may have questions about how to back up the data.
Next, let's take a look at what Kafka's implementation depends on.
1. on the hardware, Kafka selects the hard disk for direct read/write. Of course, there are also policies here. A 67200 RPM stat RAID5 array with a linear read/write speed of 300 mb/sec. For random read/write, the speed is 50 K/sec. The difference is obvious. Therefore, the choice of Kafka is to use linear storage. As for how to store it, we will talk about it in storage.
2. Regarding the cache, Kafka does not use memory as the cache. The operating system uses a feature. If direct I/O is not used, idle memory will be used for Disk
Caching. If a Process maintains an in-process cache of the data, a dual pagecache may be generated and stored twice. In addition, Kafka runs on JVM, and JVM garbage collection and object creation consume a lot of memory, so it no longer relies on memory for caching. All
Data is immediately written to a persistent log on the filesystem without any call to flush the data. Of course, the kernel's own flush is not enough. It takes about 10 minutes for the hot spring to cache 32 GB memory at a time.
3. Liner writer/Reader: although this is not as diverse as B-tree changes, there are O (1) operations, and read/write will not affect each other. In addition, linear read/write decouples the data size. With cheap storage, you can achieve high cost effectiveness.
4. Zero-copy: writing data from the hard disk to the socket usually requires... You can calculate it by yourself. This is the knowledge in the operating system. The answer is at the end of the article. For details, refer to http://my.oschina.net/ielts0909/blog/85147. In one sentence, zero-copy reduces the IO operation steps.
5. gzip and snappy compression: considering that the biggest bottleneck of transmission lies in the network, Kafka provides various protocols for data compression.
6. Transaction Mechanism: Although Kafka is weak in transaction processing, it still implements certain policies on message delivery to ensure the accuracy of data delivery:
At most once-This handles the first case described. messages are immediately marked as consumed, so they can't be given out twice, but cannot failure scenarios may lead to losing messages.
At least once-this is the second case where we guarantee each message will be delivered at least once, but in failure cases may be delivered twice.
Exactly once-this is what people actually want, each message is delivered once and only once.
The above is about the Implementation Details of Kafka, and mainly describes the reasons for the technologies and technologies used by Kafka. In the next article, I will focus on the cooperation between producers, brokers, and consumer and the storage of Kafka.
--------------------------------------------------------------------------------
To understand the impact of sendfile, it is important to understand the common data path for transfer of data from file to socket:
- The operating system reads data from the disk into pagecache in kernel space
- The application reads the data from kernel space into a user-space Buffer
- The application writes the data back into kernel space into a socket buffer
- The operating system copies the data from the socket buffer to the NIC buffer where it is sent over the network
In fact, zero-copy is already in use. transferto in the filechannel of NiO adopts this principle.
From: http://my.oschina.net/ielts0909/blog/94153