Kafka Series--Basic concept

Source: Internet
Author: User

Kafka is a distributed, partitioned, replication-committed publish-Subscribe messaging System
The traditional messaging approach consists of two types:

    • Queued: In a queue, a group of users can read messages from the server and each message is sent to one of them.
    • Publish-Subscribe: In this model, messages are broadcast to all users.
      The advantages of Kafka compared to traditional messaging technologies are:
    • Fast: A single Kafka agent can handle tens of thousands of clients, processing a few megabytes of read and write operations per second.
    • Scalable: partition and simplify data on a set of machines to support larger data
    • Persistent: Messages are persistent and replicated in the cluster to prevent data loss.
    • Design: It provides fault-tolerant assurance and durability
Basic concepts
    1. Topic (Topic): A category name used in Kafka to distinguish between different categories of information. specified by producer
    2. Producer (producer): Publishing messages to Kafka-specific topic objects (procedures)
    3. Consumers (consumer): an object (procedure) that subscribes to and processes messages in a specific topic
    4. Broker (Kafka service Cluster): Published messages are stored in a set of servers called Kafka clusters. Each server in the cluster is an agent (broker). Consumers can subscribe to one or more topics and pull data from the broker to consume these published messages.
    5. Partition (partition):topic Physical Grouping, a topic can be divided into multiple Partition, each Partition is an ordered queue. Each message in the partition will be assigned an ordered ID (offset)
    6. Message: Messages are the basic unit of communication, and each producer can send some messages to a topic (subject).
Log (logs)

A log is a series of records that can only be added, sorted exactly by time. We can add records to the end of the log, and we can read the log records from left to right. Each record specifies a unique log record number that has a certain order.

Each log file is a "log entries" sequence, with each log entry containing a 4-byte integer (a value of N) followed by a n-byte message body. Each message has a unique 64-byte offset under the current partition, which indicates the starting position of the message

This "log Entries" is not made up of a single file, but is divided into multiple segment, each segment called the segment first message offset and ". Kafka". There will also be an index file that indicates the offset range of the log entry contained under each segment.

Topic & Partition


When it comes to Kafka storage, you have to mention partitioning.
When you create a topic, you can specify the number of partitions, the greater the number of partitions, the greater the throughput, but also the more resources required, and also the higher unavailability, Kafka will store messages to different partitions based on the equalization policy after receiving the messages sent by the producer.

In order to make the Kafka throughput can be scaled horizontally, the topic is physically divided into one or more partition, each partition physically corresponding to a folder, which stores all messages and index files of this partition.

Messages are stored sequentially: Each partition is a sequential, immutable message queue and can be continuously added, and the last message received will be consumed at the end of the day.
The messages in the partition are assigned a sequence number called offsets (64 bytes of offset), which is unique in each partition.
Because each message is append to the partition, it is a sequential write disk, so the efficiency is very high (proven, sequential write disk efficiency is higher than the random write memory, which is a very important guarantee of Kafka high throughput rate)

Interaction with the producer

When a producer sends a message to a Kafka cluster, it can be sent to the specified partition by specifying a partition
You can also send messages to different partitions by specifying the equalization policy
If not specified, the default random equalization policy is used to randomly store messages in different partitions

When each message is sent to the broker, it chooses which partition to store according to the paritition rule.
If the partition rules are set up properly, all messages can be distributed evenly across different partition, allowing for horizontal scaling. (If a topic corresponds to a file, the machine I/O to which this file resides will become a performance bottleneck for this topic, and partition solves the problem).

When sending a message, you can specify the key,producer of the message according to the key and partition mechanism to determine which parition to send the message to.
The paritition mechanism can be specified by specifying the producer Paritition. class to specify that the class must implement the Kafka.producer.Partitioner interface.
In this example, if key can be resolved to an integer, the corresponding integer is partition to the total number of the sum, and the message is sent to the corresponding partition of that number. (Each parition will have a serial number)

Interacting with consumers

When consumers consume messages, Kafka uses offset to record the current consumption location
In the design of Kafka, there can be several different group to consume the same topic message at the same time.
Two different group at the same time consumption, their consumption of the record location offset items, do not interfere with each other.

For a group, the number of consumers should not be the number of extra partitions ,
Because each partition can only send messages automatically to a group of consumers , that is, a consumer can consume multiple partitions, a partition can only give the same group of one consumer consumption .
Therefore, if the number of consumers in a group is greater than the number of partitions, the excess consumers will not receive any messages .

Offset offsets
    1. This offset is unique within each partition.
    2. The only meta-data the consumer holds is the offset, which is where the consumer is in the log.
    3. The offset is controlled by the consumer. Normally, when consumers consume messages, the offsets are linearly increased. The consumer can reset the offset to an older offset and reread the message.
    4. The operation of a consumer does not affect the handling of this log by other consumers
    5. A topic of 100 data, I consumed 50 and submitted, then the Kafka service side record submitted by the offset is (offset from 0), then the next time the consumption of offset from 50 start consumption.
Manual control Offest

Autocommit is the data offset that is directly committed when the Kafka is pulled to the end of the data.

In the business system, the consumption data is accompanied by some logic business processing, inserting the database and so on.
This transaction process needs to be completed before it is submitted. Otherwise it's not plugged into the database, and the new data comes back.
Joining the database fails to consume the failed data again.

So to strictly do not lose data, need to manually control the offest.

    1. Manual commit offset, and for Partition_num to start the same number of consumer processes, so that a consumer process to occupy a partition,commit offset will not affect the other partition of offset. However, this method is more limited because the number of partition and consumer processes must correspond strictly.
    2. Another method also requires manual commit offset, in addition to the consumer end of all fetch to the data cache into the queue, when all the data in the queue is processed, and then batch to commit offset, This guarantees that only processed data will be committed.

The official website also said:
Manual control of the offest allows us to precisely control the consumption of messages (which can be implemented offest no longer consumed, the data that has not been submitted offest will be consumed again).
However, this process may fail when the data is inserted into the database, but there is no commit offset to Kafka.
The next time the consumption is taken from the beginning of the last consumption to the data, the database is inserted repeatedly.
Kafka provides a "At-least once delivery" guarantee that each message will be passed at least once, but will cause duplication in the failure.

Kafka Series--Basic concept

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.