Reprinted from http://blog.csdn.net/xiaolang85/article/details/18048631
== what is = =
Simply put,Kafka is a distributed Message Queuing system developed by Linkedin (Messages queue )
Target Scope(what to fix)
The main purpose of Kafka development is to build a data processing framework that handles massive logs, user behavior, and website operations statistics. In combination with data mining, behavioral analysis, operational monitoring and other requirements, the need to meet a variety of real-time online and batch offline processing applications for low latency and batch throughput performance requirements. From the fundamental demand, high throughput rate is the first requirement, followed by real-time and durability.
The existing Message queue framework or the reliability of the message delivery provides a high assurance, resulting in a large burden, can not meet the requirements of high throughput rate, or fully oriented to the real-time message processing system, for bulk offline processing of the situation can not provide sufficient cache and durability requirements.
Most of the Log collection processing systems for big data development applications (e.g. scribe, flume) are generally better suited for bulk off-line processing, and are not supported for real-time online processing.
Overall,Kafka is trying to provide a messaging system that simultaneously addresses massive amounts of data both online and offline.
== How to implement = =
Kafka clusters have multiple Broker servers, each type of message is defined as topic, and messages within the same topic are partitioned according to a certain key and algorithm ( Partition) stored on different brokers , message producer producer and consumer consumer can be produced on multiple brokers / Consumer topic
Core Ideas
With high efficiency as the first design principle,Kafka 's structural design has made radical trade-offs in many aspects.
= Minimalist data structure and application Mode =
Message Queuing is stored as a log file, the message producer can only add the message to the end of the existing file, there is no ID information for the location of the message, relying entirely on the displacement within the file, so the user of the message can only rely on the file displacement order to read the message. This makes it unnecessary to maintain the structure of the index that is read by the complex support.
Kafka broker does not maintain and coordinate the behavior patterns of multi-user messages, and users maintain their own displacements to index messages.
The smallest concurrent access unit is the partition partition, and all users within the same user group (all concurrent processes that can be understood to be the same application) can only have one access to the same partition, while the number of partitions is fixed, and dynamic tuning is not supported. This simplifies the complexity of concurrency control for message processing access between multi-process/Distributed clients , and of course brings some usage-mode constraints (such as maximum concurrency depending entirely on the number of pre-planned partition)
Also, the problem with partitioning is that the message is just an orderly, rather than a global, order within the partition. If the need for global order, the application needs to rely on other mechanisms to ensure.
Use pull mode to dispatch messages, the use of messages, such as whether there are consumer not read, repeat read ( improved ) , and at the Broker side also completely do not track maintenance , the expiration of messages is simply deleted by the timer periodically (for example , for 7 days), thus simplifying the overhead of various message tracking maintenance.
= Maximize data transfer efficiency in a variety of ways =
For example, producers and consumers can read and write messages in bulk to reduce RPC overhead
Use Zero Copy to transfer file contents directly to the network Socket at the kernel level , avoiding copy of application layer data
Use a reasonable compression format, etc.
= Aggressive memory management mode =
The basic meaning is not to manage ... Kafka does not maintain the message cache inside the JVM process, and the message is read and written directly from the file, relying entirely on the OS cache at the filesystem level to avoid management in the JVM The additional data structure overhead brought by the Cache and the performance cost of the GC. Based on the application pattern of batch processing and sequential reading and writing, maximizing the Cache mechanism of file system and avoiding the performance cost of reading and writing relative memory.
= HA =
Kafka before 0.8 message is not a backup-tolerant mechanism,producer 's working mode is Fire and forget, if a broker failure, then the related topic partition information is lost. The reason for this design is that the initial application mode, such as log / user behavior and other message processing, the robustness of the data requirements are not high, can tolerate the loss of some of the data. With fire and forget mode, there is no need to wait for Broker ack, which canimprove The throughput rate of producer.
But in0.8Version, the data is addedReplicaThe mechanism of a message partition for multipleReplicaDistributed in differentbrokerleader replica responsible for daily reading and writing, through the Span lang= "en-US" >zookeeper supervised failover, different partitions of leader replica balanced load to different brokerproducer can choose not to wait for leader replica ackackackack mechanism. These three kinds of mechanisms, performance descending (Producer throughput reduced 1-3 )
= = Links = =
Project Home http://kafka.apache.org/
Paper thesis http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf
[Reprint] Quick Understanding Kafka distributed Message Queue framework