Apache Kafka Reference
Http://kafka.apache.org/documentation.html
Message Queuing mode:
Point-to-point:
message producer production messages are sent to the queue, The message consumer then pulls out the queue and consumes the message. Note here:
-
- queue supports multiple consumers, but for a message, only one consumer can consume it.
publish/subscribe:
The message producer (release) publishes the message to topic and has multiple message consumers (subscriptions) consuming the message. Unlike point-to-point methods, messages posted to topic are consumed by all subscribers.
background:
kafka is a messaging system that was originally developed from LinkedIn as the basis for the activity stream of LinkedIn and the Operational Data Processing pipeline (Pipeline). It has now been used by several companies as multiple types of data pipelines and messaging systems.
It is the first to write the various activities in the form of a log to some kind of file, These files are then periodically analyzed in a statistical manner. Operational data refers to the performance data of the server (CPU, IO usage, request time, service log, and so on), in general, a wide variety of statistical methods of operational data.
As shown, a typical Kafka cluster contains several producer (which can be page View generated by the Web front end, or server logs, System CPUs, memory, etc.), and several brokers (Kafka support horizontal expansion, the more general broker number,
The higher the cluster throughput, several consumer Group, and one zookeeper cluster. Kafka manages the cluster configuration through zookeeper, elects leader, and rebalance when the consumer group is changed. Producer using push mode
Publishing a message to Broker,consumer uses pull mode to subscribe to and consume messages from the broker.
Kafka noun Interpretation and working methods:
- topic: A class of messages, such as Page view logs, The Click Log can exist in the form of topic. We can understand it as a queue.
- producer: The message producer is the client that pushes messages to Kafka broker.
- consumer: Message consumer, client to Kafka broker to cancel interest
- consumer Group (CG): This is the means by which Kafka is used to implement a topic message broadcast (to all consumer) and unicast (sent to any consumer). A topic can have multiple CG.
Topic messages will be copied (not really copied, conceptually) to all CG, but each CG will only send the message to a consumer in that CG.
If you need to implement a broadcast, as long as each consumer has a separate CG on it.
To achieve unicast as long as all the consumer in the same CG.
CG also allows consumer to be freely grouped without having to send messages to different topic multiple times.
- Partition: In order to achieve extensibility, a very large topic can be distributed across multiple brokers (that is, servers), one topic can be divided into multiple Partition, each Partition an orderly queue. Each message in the partition will be assigned an ordered ID (offset).
Kafka only guarantees that messages are sent to consumer in the order of one partition, without guaranteeing the order of a topic whole (multiple partition).
- Offset: Each partition consists of a series of ordered, immutable messages that are appended to the partition consecutively. Each message in the partition has a sequential sequence number called offset, which is used to uniquely identify a message partition.
Kafka storage files are named according to Offset.kafka, the advantage of using offset name is convenient to find. For example, if you want to find a location at 2049, just locate the 2048.kafka file. Of course, the first offset is 00000000000.kafka.
Kafka Features:
- Provides persistence of messages through the disk data structure of O (1), a structure that maintains long-lasting performance even with terabytes of message storage.
- High throughput: Even very common hardware Kafka can support hundreds of thousands of messages per second.
- Supports both synchronous and asynchronous replication of HA
- Consumer client pull, random read, using Sendfile system call, Zero-copy, bulk pulling data
- Consumption status saved on client
- Message store Sequential Write
- Data migration and expansion are transparent to users
- Supports Hadoop parallel data loading.
- Support for online and offline scenarios.
- Persistence: Prevents data loss by persisting data to the hard disk and replication.
- Scale out: The machine can be extended without downtime.
- Regular deletion mechanism, support setting partitions segment file retention time.
Reliability (consistency)
Kafka (MQ) is to implement reliable messaging and distribution from producer to consumer. Traditional MQ systems are usually implemented through the acknowledgment (ACK) mechanism between broker and consumer, and the state of message distribution is saved in the broker.
Even such consistency is difficult to guarantee (see the original text). Kafka's practice is to save the state by consumer himself, nor any confirmation. In this way, although the consumer burden heavier, but in fact more flexible.
Because the message needs to be re-processed regardless of any reason on the consumer, it can be obtained again from the broker.
Kafak System Extensibility
Kafka uses zookeeper to implement dynamic cluster expansion without the need to change the configuration of clients (producer and consumer). The broker registers with zookeeper and keeps the relevant metadata (topic,partition information, etc.) updated.
The client will register the relevant watcher on the zookeeper. Once the zookeeper has changed, the client can perceive and make adjustments accordingly. This ensures that when the broker is added or removed, load balancing can still be automatically implemented between the brokers.
High throughput is one of its core designs.
-
- zero-copy: Reduced IO operation steps.
- support data bulk sending and fetching.
- data compression supported.
-
producer sends a message to the specified partition based on the user-specified algorithm.
Pull mechanism of consumer
Because Kafka broker persists the data, the broker does not have the cache pressure, so consumer is more appropriate to take pull to consume data, specifically as follows:
- Simplifies the Kafka design and reduces the difficulty.
- Consumer control the speed of message extraction according to the consumption ability.
- Consumer chooses consumption patterns according to its own circumstances, such as batch and repeat consumption, starting with partition or position (offset).
relationship between consumer and topic and its mechanism
Essentially Kafka only supports topic. Each consumer belongs to a consumer group; Conversely, there can be more than one consumer per group. For a specific message in topic,
will only be consumed by one consumer in each group subscribed to this topic, this message will not be sent to multiple consumer of a group, and all consumer in a group will be interleaved to consume the entire topic.
This is similar to the JMS (Java Message Service) queue pattern If all consumer have the same group, and the messages will be load-balanced between consumers.
If all consumer have different group, then this is "publish-subscribe"; The message will be broadcast to all consumers.
In Kafka, a message in a partition is consumed only by one consumer in the group (at the same time), and each group consumer the message consumption is independent of each other; we can think of a group as a "subscriber",
Each partions in a topic is consumed only by one consumer in a "subscriber", but a consumer can consume messages from multiple partitions at the same time.
Kafka can only guarantee that a message in a partition is ordered by a consumer consumption. In fact, from a topic perspective, when there are multiple partitions, the message is still not globally ordered.
Typically, a group contains multiple consumer, which not only improves the concurrent consumption of messages in topic, but also improves "fault tolerance" if one of the consumer in the group fails,
Then the partitions of its consumption will be automatically taken over by other consumer. Kafka design principle, for a topic, the same group can not have more than partitions number of consumer simultaneous consumption,
Doing so will mean that some consumer will not be able to get the message.
kafka cluster can provide metadata information to producer, which contains information such as "Servers list alive in the cluster"/"Partitions leader list"
, etc. ( See node information in zookeeper). When producer acquires metadata confidence, producer will keep the socket connection with all topic partition under leader; The
message is sent directly through the socket to the broker by producer, and does not pass through any "routing layer". In fact, the message is routed to which partition, and there is a producer client decision.
For example, you can use "random" "Key-hash" "polling" and so on, if there are multiple partitions in a topic, then it is necessary to implement "message balanced distribution" on the producer side.
in the configuration file on the producer side, developers can specify how partition routes are routed.
When a group has consumer join or leave, it will trigger partitions equalization. The ultimate goal of equalization is to increase the concurrent consumption capacity of topic.
1) If TOPIC1, has the following PARTITIONS:P0,P1,P2,P3
2) to join the group, there are the following CONSUMER:C0,C1
3) First sort partitions according to the partition index number: P0,P1,P2,P3
4) based on consumer.id: C0,C1
5) Calculation multiplier: M = [P0,p1,p2,p3].size/[C0,c1].size, this example value m=2 (rounding up)
6) then assign Par TITIONS:C0 = [P0,p1],c1=[p2,p3], ci = [P (i * M), p ((i + 1) * M-1)]
kafka, the replication policy is based on partition, instead of Topic;kafka copying each partition data to multiple servers, any one partition has one leader and multiple follower ( Can not be);
Consumer messages are saved in the local log; Leader is responsible for tracking all follower status, and if follower is "behind" too much or fails, leader will remove it from the Replicas Sync list.
KAfka The condition that a follower survives or not is 2:
1) Follower need to keep good links with zookeeper
2) It must be able to follow leader in a timely manner and not lag too much behind.
If the above 2 conditions are met at the same time, then leader that the follower is "active". If a follower fails (server fails) or falls behind too much,
Leader will remove it from the Sync list ' NOTE: If this replicas falls too far behind, it will continue to fetch data from leader until enough up-to-date,
Then join to the Sync list again, Kafka does not replace the replicas host, because replicas in the synchronization list needs to be fast enough to ensure that the delay in receiving an ACK when the message is producer is low.
When the leader fails, it is necessary to select a new leader in the followers, which may follower lag behind the leader, so it is necessary to select a "up-to-date" Follower.kafka in leader elections are not adopted " Voting majority "algorithm,
Because this algorithm has higher requirements for "network stability"/"number of voting participants", and the design of Kafka cluster, it also needs to tolerate N-1 replicas failure. For Kafka,
All the replicas information in each partition can be obtained in zookeeper, so the election leader will be a very simple matter. When choosing follower, you need to take into account a problem
is the number of partition leader that are already hosted on the new leader server, and if there is too much partition leader on one server, it means that the server will be under more IO pressure.
In the election of new leader, "load balancing" needs to be considered, partition leader less broker will be more likely to become the new leader.
This partition can continue to accept read and write operations in the cluster as long as there is one replicas surviving.
Summarize:
1) The producer side connects directly to the Broker.list list, returning topicmetadataresponse from the list, which contains topic each partition leader to establish a socket connection and send a message.
2) broker-side uses zookeeper to register broker information and monitor partition leader survivability.
3) The consumer end uses zookeeper to register consumer information, including partition list of consumer consumption, and also used to discover the broker list, and to establish a socket connection with partition leader. and get the message.
Kafka Messaging Service