kafka--Distributed Messaging System

Source: Internet
Author: User
Tags sendfile set set

kafka--Distributed Messaging System

Architecture

Apache Kafka is a December 2010 Open source project, written in the Scala language, using a variety of efficiency optimization mechanisms, the overall architecture is relatively new (push/pull), more suitable for heterogeneous clusters.

Design goal:

(1) The cost of data access on disk is O (1)
(2) High throughput rate, hundreds of thousands of messages per second on a regular server
(3) Distributed architecture, capable of partitioning messages
(4) Support for loading data in parallel to Hadoop


Kafka is actually a message publishing subscription system. Producer publishes a message to a topic, and consumer subscribes to a topic message, and once there is a new message about a topic, the broker is passed to all consumer that subscribed to it. In Kafka, messages are organized by topic, and each topic is divided into multiple partition, which makes it easy to manage data and load balance. At the same time, it uses zookeeper for load balancing.
There are three main characters in Kafka, namely Producer,broker and consumer.

Producer

The task of producer is to send data to the broker. Kafka provides two producer interfaces, one Low_level interface, which sends data to a certain partition under one topic of a particular broker, and one that is a high-level interface that supports synchronous/asynchronous sending of data , zookeeper-based broker automatic recognition and load balancing (based on partitioner).
Among them, broker automatic recognition based on zookeeper is worth saying. Producer can obtain a list of available brokers through zookeeper, or you can register listener in zookeeper, which is woken up in the following situations:

    • Add a broker;
    • Delete a broker;
    • Register the new topic;
    • Broker Registration of existing topic

When producer know the above time, can take certain action according to need.

Broker

Broker has adopted a variety of strategies to improve data processing efficiency, including sendfile and zero copy technologies.

Consumer

The role of consumer is to load log information onto a central storage system. The Kafka provides two consumer interfaces, one that is low, that maintains a connection to a broker, and that the connection is stateless, that is, each time the data is pull from the broker, the offset of the broker data is told. The other is the high-level interface, which hides the details of the broker, allowing consumer to push data from the broker without having to care about the network topology. More importantly, for most log systems, the data information that consumer has acquired is saved by the broker, while in Kafka, the data information is maintained by consumer itself.

Storage structure

1. Kafka with topic for message management, each topic contains multiple partition, each partition corresponding to a logical log, consisting of multiple segment.
2. Each segment stores multiple messages (see), the message ID is determined by its logical location, that is, from the message ID can be directly located to the storage location of the message, avoid the ID-to-location additional mapping.
3. Each partition corresponds to an index in memory, recording the first message offset in each segment.
4. Messages sent to a topic by the Publisher are distributed evenly across multiple partition (randomly or based on user-specified callback functions), and the broker receives a publish message to add the message to the last segment of the corresponding partition. When the number of messages on a segment reaches the configured value or the message is published longer than the threshold, the message on segment is flush to disk, and only the message Subscribers flush to disk can subscribe to it, and segment will not write the data to that segment after reaching a certain size , the broker creates a new segment.

Consumers always get the message sequentially from a particular partition, and if the consumer knows the offset of a particular message, it means that the consumer has consumed all the previous messages. The consumer sends an asynchronous pull request to the proxy and prepares the byte buffer for consumption. Each asynchronous pull request contains the message offset to consume. Kafka uses the Sendfile API to efficiently distribute bytes to consumers from the agent's log segment files.

The Kafka agent is stateless, which means that consumers must maintain status information that has been consumed. This information is maintained by the consumer itself and the agent does not care:

    1. Removing a message from the agent becomes tricky because the agent does not know whether the consumer has already used the message. Kafka creatively solves this problem by applying a simple time-based SLA to a retention policy. When a message exceeds a certain amount of time in the agent, it is automatically deleted.
    2. This innovative design has great benefits that consumers can deliberately pour back into the old offset to consume data again. This violates the common conventions of the queue, but proves to be a basic feature of many consumers.

API instances

Publishing interfaces

Newnew message ("Your message"new  messageset (msg);p Roducer.send ( "topic", set)

When a message is published, the Kafka client constructs a message and joins the message into the message set set (Kafka supports bulk publishing, can add multiple messages to the message collection, one row is published), and the client needs to specify the topic to which the message belongs when the Send message is sent.

Subscription interface

Streams[] = consumer.createmessagestreams ("topic", 1);  for (message:stream[0]) {    = message.payload ();     // Do sth.with the bytes}

When subscribing to a message, the Kafka client needs to specify topic and partition num (each partition corresponds to a logical log stream, such as topic on behalf of a product line, partition on behalf of the product line's log-by-day results), after the client subscribes , the message can be iterated, and if there is no message, the client blocks until a new message is released. Consumer can accumulate acknowledgement of the received message, when it confirms the message of an offset, meaning that the previous message has also been successfully received, the broker will update the zookeeper offset registry.

Reference Documentation:

http://dongxicheng.org/search-engine/log-systems/

http://kafka.apache.org/documentation.html#gettingStarted

kafka--Distributed Messaging System

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.