Kafka distributed messaging system 2011-08-28 18:32:46
Category: LINUX
KAFKA[1] is a distributed message queue used by LinkedIn for log processing, and the log data of LinkedIn is large, but the reliability requirements are not high, and its log data mainly includes user behavior (login, browse, click, Share, like) and system run log (CPU, memory, disk, network, System and process status).
Many of the current Message Queuing services provide reliable delivery guarantees, and the default is instant consumption (not suitable for offline). High-reliability delivery of LinkedIn logs is not required, so it can improve performance by reducing reliability, while allowing messages to accumulate in the system by building distributed clusters, enabling Kafka to support both offline and online log processing.
Note: Publishers (publisher) and producers (producer) in this article are interchangeable, and Subscribers (subscriber) are interchangeable with consumers (consumer).
Kafka the schema is as follows:
Kafka Storage Policies
- Kafka with topic for message management, each topic contains multiple parts (ition), each part corresponds to a logical log, with multiple segment.
- Each segment stores multiple messages (see), the message ID is determined by its logical location, that is, from the message ID can be directly located to the location of the message storage, avoid the ID-to-location additional mapping.
- Each part corresponds to an index in memory, recording the first message offset in each segment.
- Messages sent to a topic by the Publisher are distributed evenly across multiple part (randomly or based on user-specified callback functions), and the broker receives a publish message to add the message to the last segment of the corresponding part. When the number of messages on a segment reaches the configured value or the message is published longer than the threshold, the message on segment is flush to disk, and only the message Subscribers flush to disk can subscribe to it, and segment will not write the data to that segment after reaching a certain size , the broker creates a new segment.
Publishing and subscribing interfaces
When a message is published, the Kafka client constructs a message that joins the message into the message set set (Kafka supports bulk publishing, can add multiple messages to the message collection, and a row is published), and the client needs to specify the topic to which the message belongs when the Send message is sent.
When subscribing to a message, the Kafka client needs to specify topic and partition num (each partition corresponds to a logical log stream, such as topic on behalf of a product line, partition on behalf of the product line's log-by-day results), after the client subscribes , the message can be iterated, and if there is no message, the client blocks until a new message is released. Consumer can accumulate acknowledgement of the received message, and when it confirms the message of an offset, it means that the previous message has been successfully received, and the broker updates the Zookeeper offset registry (described later).
High-efficiency data transfer
- Publishers can publish multiple messages at a time (by adding a message to a message collection), and the sub iterates over a message at a time.
- Do not create a separate cache, using the system's page cache. Publishers are published sequentially, and subscribers are usually a little behind the publishers, and the page cache effect of using Linux directly compares, reducing the overhead of cache management and garbage collection.
- Use Sendfile to optimize network transmission and reduce one-time memory copy.
Stateless broker
- Broker does not have a copy mechanism, and once the broker is down, the broker's messages will not be available.
- The broker does not save the status of the Subscriber, which is saved by the subscriber itself.
- Stateless causes the deletion of messages to be a challenge (potentially deleted messages are being subscribed), Kafka takes a time-based SLA (Service level Assurance), and the message is saved for a certain amount of time (typically 7 days) and is deleted.
- The message subscriber can rewind back to any location to re-consume, and when the subscriber fails, the minimum offset can be selected to re-read the consumer message.
Consumer Group
- Allows the consumer group (which contains multiple consumer, such as a cluster to consume concurrently) to consume a topic and separate subscriptions between different consumer groups.
- In order to reduce the distributed coordination overhead between different consumer in a consumer group, the partition is specified as the smallest parallel consumption unit, that is, the consumer within a group can only consume a different partition.
Zookeeper coordinated control
1. Manage broker and consumer dynamic join and leave.
2. Trigger load balancing, which triggers a load-balancing algorithm when the broker or consumer joins or leaves, making a
Subscription load balancing for multiple consumer within a consumer group.
3. Maintain consumer relations and consumer information for each partion.
Zookeeper on the details:
- After each broker starts, a temporary broker registry is registered on zookeeper, containing the broker's IP address and port number, topics and partitions information stored.
- After each consumer is started, a temporary consumer registry is registered on the zookeeper: consumer group that contains consumer and the topics of the subscription.
- Each consumer group is associated with a temporary owner registry and a persistent offset registry. For each partition that is subscribed to contains an owner registry, the content is the consumer ID of the subscription to this partition, and an offset registry containing the offset of the last subscription.
Message Delivery Assurance
- Kafka does not have strict requirements for the repetition, loss, error, and sequencing of messages.
- Kafka provides at-least-once delivery, which means that some messages may be duplicated delivery when consumer is down.
- Because each partition will only be consumed by a consumer within the consumer group, Kafka ensures that messages within each partition are subscribed sequentially.
- Kafka computes the CRC for each message for error detection, and the message that the CRC check does not pass is discarded directly.
Linkedin the application Environment
For example, the left side of the online real-time processing of log data, the right to apply to the log data offline analysis (now pull the log into Hadoop or DWH).
Kafka the Performance
Test environment: 2 Linux machines, each with 8 2GHz cores, 16GB of memory, 6 disks with RAID 10. The machines is connected with a 1Gb network link. One of the machines is used as the broker and the other machine was used as the producer or the consumer.
Test evaluation (by Me): (1) The environment is too simple to explain the problem. (2) There is no analysis of the continuous fluctuations of the producer. (3) Only two machine zookeeper have been saved??
Test results: For example, the victory over other message queue, a single message sent (each 200bytes), can be sent to 50000messages/sec,50 batch mode, averaging 400000messages/sec.
Kafka Future Research Direction
1. Data compression (Save network bandwidth and storage space)
2. Broker multiple copies
3. Streaming applications
Kafka Distributed messaging System