1 Overview KAKFA was originally a distributed messaging system developed by LinkedIn and later became part of Apache, which was written in Scala and is widely used for horizontal scaling and high throughput rates. At present, more and more open-source distributed processing systems such as Cloudera, Apache Storm, Spark and so on are supporting integration with Kafka. Kafka by virtue of its own advantages, more and more favored by the Internet enterprises, only the product will also adopt Kafka as one of its internal core messaging engine. Kafka as a commercial-grade message middleware, the importance of message reliability is conceivable. How to ensure the accurate transmission of messages. How to ensure the accurate storage of messages. How to ensure the correct consumption of messages. These are the issues that need to be considered. In this paper, starting from the architecture of Kafka, first understand the basic principles of the Kafka, and then through the KAKFA storage mechanism, replication principle, synchronization principle, reliability and durability assurance, and so on, the reliability of the step-by-step analysis, and finally through the benchmark to enhance the knowledge of Kafka high reliability. 2 Kafka Architecture As shown in the figure above, a typical Kafka architecture consists of several producer (which can be server logs, business data, page view generated at the front of the pages, and so on), a number of brokers (Kafka support horizontal expansion, the more general broker number, the higher the cluster throughput rate), Several consumer (Group), and one zookeeper cluster. Kafka manages the cluster configuration through zookeeper, elects leader, and rebalance when the consumer group is changed. Producer uses push mode to publish messages to Broker,consumer to subscribe to and consume messages from broker using pull mode. Noun Explanation: 2.1 Topic & Partition A topic can be considered a class of messages, each topic will be divided into multiple partition, each partition at the storage level is the Append log file. Any message posted to this partition is appended to the tail of the log file, where each message is called offset (offset) in the file, and offset is a long number that uniquely marks a message. Each message is append to partition, which is a sequential write disk, so it's very efficient (proven, sequential write disk efficiency is higher than random write memory, which is an important guarantee for Kafka high throughput). Each message is sent to the broker, and the partition is chosen to be stored according to the partition rule. If the partition rules are set up properly, all messages can be distributed evenly across different partition, allowing for horizontal scaling. (If a topic corresponds to a file, the machine I/O to which this file resides will become a performance bottleneck for this topic, and partition solves the problem). You can specify the number of this partition in $kafka_home/config/server.properties when you create topic (see below), but you can change the number of partition after the topic is created.
1 2 3 4 |
# The default number of log partitions per topic. More partitions allow greater # Parallelism for consumption, but this would also result in more files across # the brokers. Num.partitions= 3 |
When sending a message, you can specify the key,producer of the message according to the key and partition mechanism to determine which partition the message is sent to. The partition mechanism can be specified by specifying the Partition.class parameter of the producer, which must implement the Kafka.producer.Partitioner interface. For more details on topic and partition, refer to the section "Kafka file storage mechanism" below. 3 High-reliability storage analytics Kafka's high-reliability guarantee comes from its robust copy (replication) strategy. By adjusting its copy-related parameters, the Kafka is able to operate smoothly between performance and reliability. KAFKA provides partition level of replication starting with the 0.8.x release, and the number of replication can be in $kafka_home/config/ Server.properties in the configuration (default.replication.refactor). First, from the Kafka file storage mechanism, from the bottom of the Kafka to understand the storage details, and then its storage has a micro-cognition. Then, the concept of macro-level is expounded by Kafka copy principle and synchronous mode. Finally, we enrich the cognition of Kafka-related knowledge from various dimensions such as Isr,hw,leader election, data reliability and durability guarantee, etc. 3.1 Kafka file storage mechanism The messages in Kafka are classified as topic, and the producer sends messages to Kafka broker through topic, and the consumer reads the data through topic. However, topic in the physical plane and can be partition as a group, a topic can be divided into a number of partition, then topic and partition how to store it. Partition can also be subdivided into segment, a partition is physically composed of multiple segment, then what are these segment? Let's announce it in one by one. For the sake of explaining the problem, suppose there is only one Kafka cluster, and this cluster has only one Kafka broker, that is, only a single physical machine. In this KAFKA broker, configure ($KAFKA _home/config/server.properties) log.dirs=/tmp/kafka-logs to set the KAFKA message file storage directory; At the same time create a topic:topic_zzh_test; The number of partition is 4 ($KAFKA _home/bin/kafka-topics.sh–create–zookeeper localhost:2181–partitions 4–topic topic_vms_test– Replication-factor 4) 。 Then we can see in the/tmp/kafka-logs directory that we have generated 4 directories at this time:
1 2 3 4 |
Drwxr-xr-x 2 root root 4096 Apr 16:10 topic_zzh_test- |
|